Ticket #63 (closed defect: fixed)

Opened 3 years ago

Last modified 2 years ago

REXML doesn't work with UTF-16 XML files

Reported by: d_rems@… Owned by: ser
Priority: normal Milestone: 3.1.6
Component: XPath Version: 3.1.3
Severity: normal Keywords:
Cc: Ruby version: 1.8.4
Operating system: All

Description (last modified by ser) (diff)

c:/ruby/lib/ruby/1.8/rexml/source.rb:140:in `initialize': undefined method `encode' for #<REXML::IOSource:0x2cd77d8> (NoMethodError)
	from c:/ruby/lib/ruby/1.8/rexml/source.rb:16:in `create_from'
	from c:/ruby/lib/ruby/1.8/rexml/parsers/baseparser.rb:123:in `stream='
	from c:/ruby/lib/ruby/1.8/rexml/parsers/baseparser.rb:100:in `initialize'
	from c:/ruby/lib/ruby/1.8/rexml/parsers/treeparser.rb:8:in `initialize'
	from c:/ruby/lib/ruby/1.8/rexml/document.rb:190:in `build'
	from c:/ruby/lib/ruby/1.8/rexml/document.rb:45:in `initialize'
	from C:/ruby/rems/xmlTest.rb:11

It looks like method encode is not (en)coded ;-)

by TheR

Attachments

a.xml (157.9 kB) - added by d_rems@… 3 years ago.
Original XML file in UTF 16 format
XMLTEST.SVG (131.4 kB) - added by Richard.Schmidt@… 2 years ago.
Sample UTF-16 file which doesn't work.

Change History

Changed 3 years ago by d_rems@…

Original XML file in UTF 16 format

Changed 2 years ago by Richard.Schmidt@…

  • os changed from Windows to All

We are running on UNIX and I tried to resolve this problem by forcing it to reload using the following code:

class IOSource < Source
    #attr_reader :block_size

    # block_size has been deprecated
    def initialize(arg, block_size=500)
      @er_source = @source = arg
      @to_utf = false
      # Determining the encoding is a deceptively difficult issue to resolve.
      # First, we check the first two bytes for UTF-16.  Then we
      # assume that the encoding is at least ASCII enough for the '>', and
      # we read until we get one of those.  This gives us the XML declaration,
      # if there is one.  If there isn't one, the file MUST be UTF-8, as per
      # the XML spec.  If there is one, we can determine the encoding from
      # it.
      str = @source.read( 2 )
      if (str[0] == 254 && str[1] == 255) || (str[0] == 255 && str[1] == 254)
        @encoding = check_encoding( str )
        # 7/17/2006 - Richard Schmidt @ Pacificorp
        # Due to a bug in the code (the encode method is not defined), the
        # following code was added to force the loading of the encoding module
        #print "Class: IOSource - encoding = #{@encoding} \n"
        # Because the encoding for SVG files is coming back as UNILE instead of UTF-16,
        # we override it
        @encoding = 'UTF-16' if @encoding == 'UNILE'
        #print "Requiring file rexml/encodings/#{@encoding}.rb \n"
        require File.join("rexml", "encodings", "#{@encoding}.rb")
        return Encoding.apply(self, @encoding)
        # 7/17/2006 - End of modification
        @line_break = encode( '>' )
      else
        @line_break = '>'
      end
      super str+@source.readline( @line_break )
    end
end

While this resolved the problems with the encode and decode methods not being defined, it still didn't resolve all the problems. We now get an exception from the treeparser:

[pdxemspds02] 66 /rgrdev/home/p17904/projects/display_point_rename> /usr/local/bin/ruby xmltest.rb XMLTEST.SVG
/usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:85:in `parse': REXML::ParseException
        from /usr/local/lib/ruby/1.8/rexml/document.rb:178:in `build'
        from /usr/local/lib/ruby/1.8/rexml/document.rb:45:in `initialize'
        from xmltest.rb:8

Resolution of this problem is critical for us and we can't wait until 3.1.6 is available. Does anyone have a solution/work-a-round for this?

Changed 2 years ago by Richard.Schmidt@…

Sample UTF-16 file which doesn't work.

Changed 2 years ago by ser

  • description modified (diff)

Formatting changes.

Also, Matz has fixed this in Ruby 1.8 CVS:

Changes:

  • Encoding#encoding= to return boolean value to tell if the body is really converted or not.
  • Specific conversion library (e.g. rexml/encodings/UTF-16.rb) to have higher preceding.
  • UTF-16#decode_utf16 should work strings without BOM.

matz.

--- lib/rexml/encoding.rb       22 Aug 2006 15:25:43 -0000      1.10
+++ lib/rexml/encoding.rb       11 Sep 2006 02:36:44 -0000
@@ -26,17 +26,18 @@ module REXML
         $VERBOSE = false
-        return if defined? @encoding and enc == @encoding
+                               enc = enc.nil? ? nil : enc.upcase
+        return false if defined? @encoding and enc == @encoding
         if enc and enc != UTF_8
-          @encoding = enc.upcase
-          begin
-            require 'rexml/encodings/ICONV.rb'
-            Encoding.apply(self, "ICONV")
-          rescue LoadError, Exception => err
-            raise ArgumentError, "Bad encoding name #@encoding" unless @encoding =~ /^[\w-]+$/
-            @encoding.untaint 
-            enc_file = File.join( "rexml", "encodings", "#@encoding.rb" )
-            begin
-              require enc_file
-              Encoding.apply(self, @encoding)
-            rescue LoadError
-              puts $!.message
+                                       @encoding = enc
+                                       raise ArgumentError, "Bad encoding name #@encoding" unless @encoding =~ /^[\w-]+$/
+                                       @encoding.untaint 
+                                       enc_file = File.join( "rexml", "encodings", "#@encoding.rb" )
+                                       begin
+                                               require enc_file
+                                               Encoding.apply(self, @encoding)
+          rescue LoadError, Exception
+                                               begin
+                                                       require 'rexml/encodings/ICONV.rb'
+                                                       Encoding.apply(self, "ICONV")
+            rescue LoadError => err
+              puts err.message
               raise ArgumentError, "No decoder found for encoding #@encoding.  Please install iconv."
@@ -52,2 +53,3 @@ module REXML
       end
+                       true
     end
Index: lib/rexml/source.rb
===================================================================
RCS file: /var/cvs/src/ruby/lib/rexml/source.rb,v
retrieving revision 1.9
diff -p -u -1 -r1.9 source.rb
--- lib/rexml/source.rb 22 Aug 2006 15:25:43 -0000      1.9
+++ lib/rexml/source.rb 11 Sep 2006 02:36:44 -0000
@@ -46,3 +46,3 @@ module REXML
                def encoding=(enc)
-                       super
+                       return unless super
                        @line_break = encode( '>' )
Index: lib/rexml/encodings/UTF-16.rb
===================================================================
RCS file: /var/cvs/src/ruby/lib/rexml/encodings/UTF-16.rb,v
retrieving revision 1.5
diff -p -u -1 -r1.5 UTF-16.rb
--- lib/rexml/encodings/UTF-16.rb       9 Apr 2005 17:03:32 -0000       1.5
+++ lib/rexml/encodings/UTF-16.rb       11 Sep 2006 02:36:44 -0000
@@ -18,5 +18,6 @@ module REXML
     def decode_utf16(str)
+      str = str[2..-1] if /^\376\377/ =~ str
       array_enc=str.unpack('C*')
       array_utf8 = []
-      2.step(array_enc.size-1, 2){|i| 
+      0.step(array_enc.size-1, 2){|i| 
         array_utf8 << (array_enc.at(i+1) + array_enc.at(i)*0x100)

Changed 2 years ago by ser

  • status changed from new to closed
  • resolution set to fixed

Actually, Matz's fix was correct. The UNILE decoder was busted. This has been fixed in changeset:1235

Changed 2 years ago by ser

"correct" -> "incorrect"

Note: See TracTickets for help on using tickets.