REXML Memory Usage

Attributes improvement

Error: Failed to load processor Timestamp
No macro or processor named 'Timestamp' found

I've been working on a branch that implements memory optimization by simplifying attributes. This code is still broken -- not all of the unit tests pass yet -- but the results so far ar encouraging.

The lowest two datasets are not on the same scale as the rest. The lowest two sets refer to the number of bytes of memory used per byte of parsed XML. The Y-axis units are, therefore, not KB, but b, and doesn't refer to total memory useage.

The short version is that memory use is halved. A 5MB XML file parsed by the trunk consumes over 450MB of memory, while the new branch consumes under 250. The Mem/File bars show that REXML averages about 100 bytes of memory used per byte of parsed XML for the trunk, and about 50b/b for the optimization.

If the bug caveat wasn't enough, let me scare you with a couple more issues:

  1. Not all the unit tests pass; not all of the API is supported (for example, element.attributes["attr"] << "Foo" doesn't do what you want it to. Temporary
  2. Parsing is slower. I don't know why, because there's no reason for it to be so, and I expect that -- along with getting the unit tests to pass -- I'll fix this, and we may actually end up with faster parsing. Temporary
  3. Accessing attributes is slower. This is because I'm not doing a lot of the pre-processing of attributes anymore. I used to break up the attribute prefix/name pairs and build a tree, which I'm not doing anymore -- now I just store the strings, and create the Attribute objects on demand. This not only reduces memory use for parsed trees, but it also should allow speed optimizations at parse time, since we're not creating a bunch of extra objects. Possibly permanent

That said, I have high hopes for this branch being merged with the main tree, and fairly soon at that. There isn't much work to do on it. Most of the unit tests already pass, and there are a small number of design decisions I have to make. Do I change the API to remove some convenience functions, or do I add some overhead to support those functions? I'm inclined to add the overhead, even though it'll mean either a performance hit, or a smaller improvement in the memory optimization.

Sourcecode at http://www.germane-software.com/repos/rexml/branches/memory_attributes


Notes about Ruby memory use

These are notes, not gospel. They may be inaccurate, or downright wrong. I reserve this place as a scratchpad for notes as I explore various aspects of Ruby's memory use, WRT REXML.

REXML is limited in both speed and memory use by the cost of Ruby objects.

The absolutely least-memory way we can store an XML tree (without resorting to un-Ruby-like constructions) is as an Array of Arrays. A tree, at its simplest, is a node with children; the children are going to be stored in an Array, or an Array-like structure (for the purposes of this discussion, it is assumed that Ruby's Array implementation is reasonably space efficient, and any alternative will be not significantly more efficient). A Ruby Array consumes about 99 bytes per instance. Therefore, a document with 1000 elements will consume at least 100,000 bytes, or 100kB for just the tree structure.(?)

A test with a trivial class:

class S ; end

shows that each instance consumes about 30 bytes.