Parsing Huge XML Files in Pentaho Kettle

Recently I was working with Amazon Data Feed files.  These files are about 1GB zipped up and about 15GB unzipped.  And they’re XML.  These files…  are HUGE!

n4vdvup

For files of this magnitude, the traditional Get Data from XML step does not work since that step requires Kettle to load the XML data in memory, which at 15GB will cause Kettle to freeze and crash.  However, there is an alternative step called XML Input Stream (StAX) which can be used to parse the XML data.  This post will discuss how to configure that StAX step.

To start off, this is how the transformation looks:

stax_overview

1. XML Input Stream (StAX):

The main fields to complete here indicate…

  • Where the data is coming from
  • Whether the data should be limited (limiting the data initially is helpful as you figure out the structure of the data)

This is what my configuration looks like:

stax_main

The options below the ones mentioned specify what kind of data you want extracted from the XML file.  StAX essentially parses the file node-by-node and converts every node/tag/attribute to a row.  This is what the data looks like in preview:

stax_preview

The level of detail here is extremely granular.  Every start element, end element, attribute, etc. is given its own row.  Therefore, when running the transformation on a full file, expect to see 100M+ rows being passed through.  Processing should still be fast, though.

2. Filter

The second step, the filter step, filters out the START_ELEMENT and END_ELEMENT rows thus reducing the amount of data Kettle needs to work on.  This works if you are not interested in the start/end element information.  You can always keep it; however, the subsequent JavaScript step is dependent on only have one xml_data_name for unique attributes of the entity.

stax_filter

3. JavaScript

The transformation is dependent on having only one xml_data_name for unique attributes.  These unique attributes are used to increment a grouping variable that is used to differentiate one entity from the next.

Here’s my configuration for this (my explanation for this configuration is below the images):

stax_start_script

stax_main_script

First, I set up a start script.  What does this do?  It initializes a variable called “group” upon start-up of the step.  In this case, the group variable is being set to 0.

Second, in a standard JavaScript tab, I increment the “group” variable based on data passing through the JavaScript step.  Upon previewing the content of the XML, I noticed that the first unique attribute being passed in is called “ASIN”:

stax_asin.png

Therefore, I chose to increment my “group” variable on whether xml_data_name is equal to the value “ASIN”.  Note that this could be different for different XMLs, so always make sure to review your incoming data thoroughly to determine which value is unique — it may not always be the first one!

4. Denorm

Using the newly created “group” variable, I can now denorm the data into a format that’s easier to read.  In the denorm step, I pivot off data in the xml_path field and bring in data from the xml_data_value field.

stax_denorm

The denorm step provides the option to concatenate/group elements together, which makes things easier in terms of combining elements that might have multiple values.

(Note: do NOT attempt to sort the data by the grouping variable, since it may cause Kettle to crash — sorting is another step that is run in memory, which is what we want to avoid doing with a data file this large.)

And with that, your XML should be starting to look like a really nice CSV:

stax_final.png

To recap:

  1. Open the file in the StAX step — this will convert every aspect of the XML into a row.
  2. Filter out START_ELEMENT and END_ELEMENT.
  3. Figure out what the unique field is and use JavaScript to increment and create a variable for grouping.
  4. Using the grouping variable to denorm data of interest.

To get started, feel free to use my sample transformation.

Create a free website or blog at WordPress.com.