Ticket #93 (closed defect: invalid)

Opened 2 years ago

Last modified 2 years ago

ASCII Character 160 (Hex A0) can trigger BUFFER EMPTY

Reported by: hjgalbraith Owned by: ser
Priority: normal Milestone:
Component: Stream Version: 3.1.4
Severity: normal Keywords:
Cc: Ruby version: 1.8.4
Operating system: Windows

Description

I was testing some XML files and ran across hex A0 characters that were somehow embedded. While this does not generate an exception, it does cause the parser to ignore the remaining data in the file. I suspect that hex A0 is an anomaly, but it does suggest that the parser should ignore unused ascii characters such as this (ASCII 160)or generate an exception so that the application is notified that there is a problem.

The work-around is to scan each file for offending characters and change or remove them. Not pretty and a performance issue.

I am sorry, but I don't know the version number of REXML that is included with my Ruby release.

Change History

Changed 2 years ago by ser

  • status changed from new to closed
  • resolution set to invalid

0xA0 is an illegal UTF-8 character; bytes greater than 0x7F must not exist in an XML document unless they are

  1. part of a valid UTF-8 sequence, or
  2. in documents which contain an XML declaration that declares an encoding for which 0xA0 is a legal character.

XML documents are UTF-8 by default.

As an example:

$ ruby -e '"0xA0".unpack("U*")'
-e:1:in `unpack': malformed UTF-8 character (ArgumentError)
        from -e:1

For more information, see http://en.wikipedia.org/wiki/Utf-8 and a couple of paragraphs after http://www.w3.org/TR/xml/#NT-EncName, which says:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

REXML should really throw a more descriptive exception.

Note: See TracTickets for help on using tickets.