Ticket #44 (assigned enhancement)

Opened 3 years ago

Last modified 8 months ago

Text class escapes '&notanxmlentityref;' incorrectly.

Reported by: carl@… Owned by: ser
Priority: normal Milestone: 3.2.0
Component: DOM Version: 3.1.2
Severity: normal Keywords: entity escape
Cc: Ruby version: 1.8.2
Operating system: Linux

Description

Note, this was tested with Ruby 1.8.4 besides 1.8.2 but it was not a choice when I submitted this ticket.

With irb, do the following test.

% irb
irb(main):001:0> require 'rexml/document'
=> true
irb(main):002:0> REXML::Text.new("&", false, false).to_s
=> "&"
irb(main):003:0> REXML::Text.new("&notanxmlentityref;", false, false).to_s
=> "&notanxmlentityref;"

This behavior is incorrect. Since raw=false the '&' in each of these strings should be escaped but it is not in either case. The correct output should be the following.

% irb
irb(main):001:0> require 'rexml/document'
=> true
irb(main):002:0> REXML::Text.new("&", false, false).to_s
=> "&"
irb(main):003:0> REXML::Text.new("&notanxmlentityref;", false, false).to_s
=> "&notanxmlentityref;"

That way the original string is recovered (unescaped) like this: "&" -> "&" rather than this: "&" -> "&" !!! Not the original string.

If you are interested in how this problem came up I was trying to escape some C++ code that looked like this:

int *intp = &aninteger;

This should've been escaped as int *intp = &aninteger; but instead was escaped as int *intp = &aninteger; When mozilla firefox went to parse this text node produced by REXML it complained that aninteger was an undefined entity reference (the right thing for it to do in this case)

Change History

follow-up: ↓ 4   Changed 2 years ago by ser

  • priority changed from high to normal
  • status changed from new to assigned
  • severity changed from major to enhancement
  • milestone changed from 3.1.4 to 3.2.0

Your interpretation of the meaning of "raw" is incorrect. If you look at the documentation for "Text#to_s", you'll see an example describing exactly what you're seeing:

t = Text.new( "< & &s; russell", false, nil, false )
t.to_s   #-> "&lt; &amp; &s; russell"

"raw" will escape any text that isn't already escaped. It will not mangle already valid entities.

However, I see your point, and I'll consider changing the behavior of "raw" for a 3.2 release. The correct way of including code like this currently is to use CData:

e = Element.new( "x" )
e.add( CData.new( "int *intp = &aninteger;" ))
puts e
# => <x><![CDATA[int *intp = &aninteger;]]></x>

follow-up: ↓ 3   Changed 21 months ago by SamRuby

The "correct" way can cause other problems. Witness:

e.add( CData.new(']]>') )
RuntimeError: Illegal character ']]>' in raw string "]]>"
        from /usr/lib/ruby/1.8/rexml/text.rb:88:in `initialize'
        from /usr/lib/ruby/1.8/rexml/cdata.rb:16:in `initialize'
        from (irb):9

in reply to: ↑ 2 ; follow-up: ↓ 5   Changed 12 months ago by ser

Replying to SamRuby:

The "correct" way can cause other problems. Witness:

']]>' is an illegal character sequence, so that's not a problem; REXML behaves correctly.

For reference, http://www.w3schools.com/xml/xml_cdata.asp (from the "Notes" section)

A CDATA section cannot contain the string "]]>", therefore, nested CDATA sections are not allowed.

and, formally, the XML spec

  [20]     CData     ::=     (Char* - (Char* ']]>' Char*))

in reply to: ↑ 1   Changed 11 months ago by Arsen7

Replying to ser:

"raw" will escape any text that isn't already escaped. It will not mangle already valid entities. However, I see your point, and I'll consider changing the behavior of "raw" for a 3.2 release.

I found a similiar problem. Not exactly with 'raw'. The problem is - how to distinguish escaped text from unescaped one. It would be nice to have a method to unconditionally escape the entities.

I expected the unnormalized(normalized(string)) to be exactly the same as given. The problem was the condition 'if its escaped'. Maybe 'normalize' and 'raw' should work unconditionally?

class RexmlTest < Test::Unit::TestCase
  VERBATIM_TEXT = "a < b > c & d \" e ' f &amp; &amp;amp; &quot; &#13;" +
                  " &lt; &gt; &apos; &some; &thing \\ <- backslash".freeze
 
  BAD_TEXT      = "a < b > c & d \" e ' f & &amp; \" &#13; " +
                  " < > ' &some; &thing \\ <- backslash".freeze
 
  # Expected behaviour:
  def test_normalize_is_reversible
    assert_equal VERBATIM_TEXT, REXML::Text.unnormalize( REXML::Text.normalize( VERBATIM_TEXT ))
  end
 
  # Current behaviour,
  # ruby 1.8.5 (2006-12-04 patchlevel 2) [i686-linux]
  # "3.1.4"
  # "i686-linux"
 
  def test_normalize_is_not_reversible
    assert_equal BAD_TEXT, REXML::Text.unnormalize( REXML::Text.normalize( VERBATIM_TEXT ))
  end
end

in reply to: ↑ 3   Changed 11 months ago by Arsen7

Replying to ser:

A CDATA section cannot contain the string "]]>", therefore, nested CDATA sections are not allowed.

The problem how to escape the ']]>' emerges. :-)

There is a nice solution mentioned, among others, in http://lists.xml.org/archives/xml-dev/199802/msg00088.html - replacing ']]>' with ']]]]><![CDATA[>'

CData.new might perform such replacement automatically, but that would cause at least two CData instances to be created.. right?

  Changed 8 months ago by martin

I think this bug has bitten me, too, but in a slightly different way. The behaviour of .value is strange for raw=false and entities in text nodes:

tn = REXML::Text.new('&gt;xx', true, nil, false)

=> "&gt;xx"

tn.to_s

=> "&amp;gt;xx"

tn.value

=> ">xx"

From current doc: "If this value is false, the string may contain any characters, and REXML will escape any and all defined entities whose values are contained in the text.". So either raw=false doesn't escape, or .value double unescapes?

When the universe ends and all comes down, the only thing left will be an incorrectly double-escaped string ...

Note: See TracTickets for help on using tickets.