Ticket #110 (closed defect: fixed)

Opened 17 months ago

Last modified 12 months ago

doc.to_s leaves BOM in place, removes descriptor from file for UTF-16

Reported by: steenvoorden Owned by: ser
Priority: normal Milestone: 3.1.8
Component: DOM Version: 3.1.6
Severity: normal Keywords: to_s UTF-16
Cc: Ruby version: Other
Operating system: MacOS

Description (last modified by ser) (diff)

I'm trying to store an XML file in a field in a MySQL 5.0 database.

I'm using the code:

       file = File.new(@electrocardiogram.xml_file)[[BR]]
       doc = REXML::Document.new(file)[[BR]]
       @electrocardiogram.xml_data = doc.to_s[[BR]]

The table is defined as:

CREATE TABLE electrocardiograms (
id int(11) NOT NULL auto_increment,
.....
xml_data mediumtext,
.....
PRIMARY KEY  (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

This works fine for xml files with UTF-8. The field in the database contains the contents of the XML file.

However, if the file is an UTF-16 file the contents of the field contains the BOM (FF FE) followed by the second line in the XML file. The first line in the file: <?xml version="1.0" encoding="UTF-16" ?> is completely removed. Analysis of the string created in the to_s call shows that it also contains this error.

Without being able to put the data in a string I can't even do base64 encoding ( I believe mySQL doesn support utf-16 although the contents of the field looks good apart from the observed ommision)

Is this a bug in to_s ?

I'm using Locomotive 2.0.8 on Mac OS X 10.4, Ruby 1.8.6 and REXL 3.1.6

attached is a UTF-8 file and a UTF-16 file. Ruud

Attachments

15ac5980-4461-11dc-4823-003168c40029.xml (202.6 kB) - added by steenvoorden 17 months ago.
UTF-16 file
040999 040444 -5.xml (201.3 kB) - added by steenvoorden 17 months ago.
UTF-8 file

Change History

Changed 17 months ago by steenvoorden

UTF-16 file

Changed 17 months ago by steenvoorden

UTF-8 file

Changed 16 months ago by ser

  • milestone set to 3.1.8

Changed 15 months ago by ser

  • status changed from new to assigned
  • description modified (diff)

Changed 15 months ago by ser

  • status changed from assigned to closed
  • resolution set to fixed

Yeah, UTF-16 has always been troublesome. changeset:1286 should fix this.

Changed 12 months ago by rubys

changeset:1288 adds the test case

Note: See TracTickets for help on using tickets.