Ticket #109 (new enhancement)
Opened 16 months ago
Add ability to ignore bad UTF-8 encoded characters
| Reported by: | ser | Owned by: | ser |
|---|---|---|---|
| Priority: | low | Milestone: | 3.1.8 |
| Component: | DOM | Version: | 3.1.4 |
| Severity: | minor | Keywords: | |
| Cc: | Ruby version: | 1.8.5 | |
| Operating system: | Linux |
Description
Holden Karau's original email:
I've been doing some work parsing alot of RSS feeds in the wild, and one thing which I've found useful is removing invalid UTF-8 characters.Thispatch is against rexml_3.1.7 and modifies encodings/ICONV.rb & encodings/UTF-8.rb . I've attached the patch, but in case it gets mangled you can grab it from my website at http://www.holdenkarau.com/~holden/projects/rexml/rexml_strip_invalid.diff. I'm not sure if this should go into REXML, but I think it would be useful for a lot of people.
Solution
Holden's solution is pretty close to what we want to do. Add an UTF-8/Ignore encoding. Users can then construct their own sources when they want to parse malformed XML documents:
s = Source.new( xml_file, "UTF-8/Ignore" ) d = Document.new(s)
