Ticket #150 (new defect)
Opened 2 months ago
SAX2 doesn't handle entity references consistently
Description
$ ruby -vrrexml/rexml -e 'p REXML::VERSION,PLATFORM' ruby 1.8.6 (2008-03-03 patchlevel 114) [i686-linux] "3.1.7.2" "i686-linux"
The SAX2 parser does only partial unescaping of entities - some entities are unescaped, some are not. This makes it impossible for the application to process character data correctly, since you don't know whether "&" has already been unescaped or not.
Example:
require 'rexml/parsers/sax2parser' source = <<EOS <foo> Bill & Ben Foo & Bar abc &flurble; def abc &flurble; def </foo> EOS l = Object.new def l.method_missing(*args) p args end p = REXML::Parsers::SAX2Parser.new(source) p.listen(l) p.parse
Results:
[:start_document]
[:start_element, nil, "foo", "foo", {}]
[:progress, 5]
[:characters, "\n Bill & Ben\n Foo & Bar\n abc &flurble; def\n abc &flurble; def\n"]
[:progress, 83]
[:end_element, nil, "foo", "foo"]
[:progress, 5]
[:characters, "\n"]
[:progress, 0]
[:end_document]
Notice that "abc &flurble; def" and "abc &flurble; def" appear identically when presented to the characters() method, and yet one has had entity substitution performed, and one has not.
IMO, a well-behaved parser should either
- Not perform any entity unescaping, and return raw data always; or
- Always perform entity unescaping, and call a user method for any entity which it does not understand.
I had a look at the Java spec for SAX/SAX2, see https://jaxp-sources.dev.java.net/nonav/docs/api/org/xml/sax/ContentHandler.html#characters(char[],%20int,%20int)
This doesn't make it clear that the characters() method has already performed unescaping. However lower down the page there is a skippedEntity() method. So I guess this means that in the case of "abc &flurble; def" you would get
- characters("abc ")
- skippedEntity("flurble")
- characters(" def")
That would allow the application to take whatever action it requires on seeing an unknown entity.
