Ticket #109 (new enhancement)

Opened 16 months ago

Add ability to ignore bad UTF-8 encoded characters

Reported by: ser Owned by: ser
Priority: low Milestone: 3.1.8
Component: DOM Version: 3.1.4
Severity: minor Keywords:
Cc: Ruby version: 1.8.5
Operating system: Linux

Description

Holden Karau's original email:

I've been doing some work parsing alot of RSS feeds in the wild, and one thing which I've found useful is removing invalid UTF-8 characters.Thispatch is against rexml_3.1.7 and modifies encodings/ICONV.rb & encodings/UTF-8.rb . I've attached the patch, but in case it gets mangled you can grab it from my website at http://www.holdenkarau.com/~holden/projects/rexml/rexml_strip_invalid.diff. I'm not sure if this should go into REXML, but I think it would be useful for a lot of people.

Solution

Holden's solution is pretty close to what we want to do. Add an UTF-8/Ignore encoding. Users can then construct their own sources when they want to parse malformed XML documents:

s = Source.new( xml_file, "UTF-8/Ignore" )
d = Document.new(s)

Attachments

rexml_strip_invalid.diff (1.4 kB) - added by ser 16 months ago.
Holden's patch to modify REXML's default UTF-8 behavior

Change History

Changed 16 months ago by ser

Holden's patch to modify REXML's default UTF-8 behavior

Note: See TracTickets for help on using tickets.