Ticket #150 (new defect)

Opened 2 months ago

SAX2 doesn't handle entity references consistently

Reported by: candlerb Owned by: ser
Priority: normal Milestone:
Component: SAX2 Version: 3.1.7
Severity: normal Keywords: sax2 entity normalization escaping
Cc: Ruby version: 1.8.6
Operating system: Linux

Description

$ ruby -vrrexml/rexml -e 'p REXML::VERSION,PLATFORM'
ruby 1.8.6 (2008-03-03 patchlevel 114) [i686-linux]
"3.1.7.2"
"i686-linux"

The SAX2 parser does only partial unescaping of entities - some entities are unescaped, some are not. This makes it impossible for the application to process character data correctly, since you don't know whether "&" has already been unescaped or not.

Example:

require 'rexml/parsers/sax2parser'

source = <<EOS
<foo>
  Bill &amp; Ben
  Foo &#38; Bar
  abc &#38;flurble; def
  abc &flurble; def
</foo>
EOS

l = Object.new
def l.method_missing(*args)
  p args
end

p = REXML::Parsers::SAX2Parser.new(source)
p.listen(l)
p.parse

Results:

[:start_document]
[:start_element, nil, "foo", "foo", {}]
[:progress, 5]
[:characters, "\n  Bill &amp; Ben\n  Foo & Bar\n  abc &flurble; def\n  abc &flurble; def\n"]
[:progress, 83]
[:end_element, nil, "foo", "foo"]
[:progress, 5]
[:characters, "\n"]
[:progress, 0]
[:end_document]

Notice that "abc &#38;flurble; def" and "abc &flurble; def" appear identically when presented to the characters() method, and yet one has had entity substitution performed, and one has not.

IMO, a well-behaved parser should either

  1. Not perform any entity unescaping, and return raw data always; or
  2. Always perform entity unescaping, and call a user method for any entity which it does not understand.

I had a look at the Java spec for SAX/SAX2, see https://jaxp-sources.dev.java.net/nonav/docs/api/org/xml/sax/ContentHandler.html#characters(char[],%20int,%20int)

This doesn't make it clear that the characters() method has already performed unescaping. However lower down the page there is a skippedEntity() method. So I guess this means that in the case of "abc &flurble; def" you would get

  • characters("abc ")
  • skippedEntity("flurble")
  • characters(" def")

That would allow the application to take whatever action it requires on seeing an unknown entity.

Note: See TracTickets for help on using tickets.