Ticket #83 (assigned defect)

Opened 2 years ago

Last modified 16 months ago

REXML does not appear to handle special characters inside a set of tags

Reported by: muppetmaster Owned by: ser
Priority: normal Milestone: Deferred
Component: DOM Version: 3.1.3
Severity: normal Keywords:
Cc: Depili Ruby version: 1.8.5
Operating system: Unix

Description (last modified by ser) (diff)

You may see the area below. This is a call to the Geocode API of Google and it does return a valid XML token. You may see the error below and here is the field it does not like:

<AdministrativeAreaName>Cataluña</AdministrativeAreaName>

As you may see, there is a special Spanish character returned.

/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:85:in `parse': #<REXML::ParseException: Missing end tag for 'AdministrativeAreaName' (got "AdministrativeArea") (REXML::ParseException)
Line: 
Position: 
Last 80 unconsumed characters:
</Country></AddressDetails><Point><coordinates>2.148474,41.390046,0</coordinates>>
/usr/lib/ruby/1.8/rexml/parsers/baseparser.rb:311:in `pull'
/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:21:in `parse'
/usr/lib/ruby/1.8/rexml/document.rb:176:in `build'
/usr/lib/ruby/1.8/rexml/document.rb:45:in `initialize'
Missing end tag for 'AdministrativeAreaName' (got "AdministrativeArea")
Line: 
Position: 
Last 80 unconsumed characters:
</Country></AddressDetails><Point><coordinates>2.148474,41.390046,0</coordinates>
Line: 
Position: 
Last 80 unconsumed characters:
</Country></AddressDetails><Point><coordinates>2.148474,41.390046,0</coordinates>       from /usr/lib/ruby/1.8/rexml/document.rb:176:in `build'
        from /usr/lib/ruby/1.8/rexml/document.rb:45:in `initialize'

Change History

Changed 2 years ago by ser

  • status changed from new to assigned
  • description modified (diff)

Added wiki formatting to make the bug report easier to read.

Can you attach the original XML? Is there an XML declaration with the encoding on it, or is Google returning invalid XML?

--- SER

Changed 2 years ago by muppetmaster

I had posted the details here:

http://railsforum.com/viewtopic.php?id=529

I will re-run tomorrow and post the XML file itself in proper form.

Changed 2 years ago by muppetmaster

<?xml version="1.0" encoding="UTF-8"?> <kml xmlns="http://earth.google.com/kml/2.0"><Response><name>Comte urgell 240, barcelona, spain</name><Status><code>200</code><request>geocode</request></Status><Placemark><address>Carrer del Comte d'Urgell 240, 08036 Barcelona, Spain</address><AddressDetails? Accuracy="8" xmlns="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0"><Country><CountryNameCode?>ES</CountryNameCode><AdministrativeArea?><AdministrativeAreaName?>Cataluña</AdministrativeAreaName><SubAdministrativeArea?><SubAdministrativeAreaName?>Barcelona</SubAdministrativeAreaName><Locality><LocalityName?>Barcelona</LocalityName><Thoroughfare><ThoroughfareName?>Carrer del Comte d'Urgell 240</ThoroughfareName></Thoroughfare><PostalCode?><PostalCodeNumber?>08036</PostalCodeNumber></PostalCode></Locality></SubAdministrativeArea></AdministrativeArea></Country></AddressDetails><Point><coordinates>2.148474,41.390046,0</coordinates></Point></Placemark></Response></kml>

Changed 2 years ago by Depili

I can also confirm this bug, but that seems erratic, parsing the following file fails always:

<organizers>

<organizer>

<name>Niko Nikitön</name>

</organizer>

</organizers>

But the following is parsed fine:

<organizers>

<organizer>

<name>Niko Nikitön organizer</name>

</organizer>

</organizers>

this also works fine:

<organizers>

<organizer>

<name>Mikko Löpönen</name>

</organizer> <organizer>

<name>Niko NikitöÖäÄn? Organizer</name>

</organizer>

</organizers>

So it's not just the special characters, but something else, and this is blocking my development of a software for parsing finnish names from a xml file :/

Changed 2 years ago by Depili

  • cc Depili added

Changed 2 years ago by Depili

I have done additional testing, and adding <?xml version="1.0" encoding="ISO-8859-1"?> tag to the begining of the file solves the parsing problem, but now rexml breaks the special characters (the file is saved as latin1, but the special characters end up in the database as ö etc :/

Changed 2 years ago by ser

Please attach an offending XML document. The URL that was provided returns a kml document that contains no non-7-bit ASCII characters.

Please note that the XML document must contain a proper XML declaration and correct encoding if the encoding is not UTF-8. Once parsed, all text returned by REXML is in UTF-8 format. If you want to convert it back to ISO-8859-1, you need to perform the conversion yourself. EG:

   utf_8_text = element.text
   iso_8859_1_text = utf_8_text.unpack("U*").pack("C*")

Changed 2 years ago by ser

  • milestone changed from 3.1.6 to Deferred

Changed 16 months ago by patoche

  • ruby_version changed from 1.8.2 to 1.8.5
  • os changed from MacOS to Unix

Hi,

I'm having the same problem trying to parse the result of a geocoding request, I get the following response :

<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<kml xmlns=\"http://earth.google.com/kml/2.0\">
<Response>
<name>16 Weststrasse, Dietikon, 8953, Switzerland</name>
<Status>
<code>200</code>
<request>geocode</request>
</Status>
<Placemark id=\"p1\">
<address>Weststrasse 16, 8953 Dietikon, Switzerland</address>
<AddressDetails Accuracy=\"8\" xmlns=\"urn:oasis:names:tc:ciq:xsdschema:xAL:2.0\">
<Country>
<CountryNameCode>CH</CountryNameCode>
<AdministrativeArea>
<AdministrativeAreaName>Zürich</AdministrativeAreaName>
<SubAdminitrativeArea>
<SubAdministrativeAreaName>Dietikon</SubAdministrativeAreaName>
<Locality>
<LocalityName>Dietikon</LocalityName>
<Thoroughfare>
<ThoroughfareName>Weststrasse 16</ThoroughfareName>
</Thoroughfare>
<PostalCode>
<PostalCodeNumber>8953</PostalCodeNumber>
</PostalCode>
</Locality>
</SubAdministrativeArea>
</AdministrativeArea>
</Country>
</AddressDetails>
<Point>
<coordinates>8.393334,47.402911,0</coordinates>
</Point>
</Placemark>
</Response>
</kml>

and the parsing gives me :

>> doc = REXML::Document.new(xml)
REXML::ParseException: #<REXML::ParseException: Missing end tag for 'SubAdminitativeArea' (got "SubAdministrativeArea")
Line: 
Position: 
Last 80 unconsumed characters:
</AdministrativeArea></Country></AddressDetails><Point><coordinates>8.393334,47.4>
/usr/lib/ruby/1.8/rexml/parsers/baseparser.rb:315:in `pull'
/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:21:in `parse'
/usr/lib/ruby/1.8/rexml/document.rb:190:in `build'
/usr/lib/ruby/1.8/rexml/document.rb:45:in `initialize'
(irb):3:in `new'
(irb):3:in `irb_binding'
/usr/lib/ruby/1.8/irb/workspace.rb:52:in `irb_binding'
/usr/lib/ruby/1.8/irb/workspace.rb:52
...
Missing end tag for 'SubAdminitativeArea' (got "SubAdministrativeArea")
Line: 
Position: 
Last 80 unconsumed characters:
</AdministrativeArea></Country></AddressDetails><Point><coordinates>8.393334,47.4
Line: 
Position: 
Last 80 unconsumed characters:
</AdministrativeArea></Country></AddressDetails><Point><coordinates>8.393334,47.4
        from /usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:89:in `parse'
        from /usr/lib/ruby/1.8/rexml/document.rb:190:in `build'
        from /usr/lib/ruby/1.8/rexml/document.rb:45:in `initialize'
        from (irb):3:in `new'
        from (irb):3

Note: See TracTickets for help on using tickets.