Parsing Huge XML Files in Pentaho Kettle

Recently I was working with Amazon Data Feed files.  These files are about 1GB zipped up and about 15GB unzipped.  And they’re XML.  These files…  are HUGE!

n4vdvup

For files of this magnitude, the traditional Get Data from XML step does not work since that step requires Kettle to load the XML data in memory, which at 15GB will cause Kettle to freeze and crash.  However, there is an alternative step called XML Input Stream (StAX) which can be used to parse the XML data.  This post will discuss how to configure that StAX step.

To start off, this is how the transformation looks:

stax_overview

1. XML Input Stream (StAX):

The main fields to complete here indicate…

  • Where the data is coming from
  • Whether the data should be limited (limiting the data initially is helpful as you figure out the structure of the data)

This is what my configuration looks like:

stax_main

The options below the ones mentioned specify what kind of data you want extracted from the XML file.  StAX essentially parses the file node-by-node and converts every node/tag/attribute to a row.  This is what the data looks like in preview:

stax_preview

The level of detail here is extremely granular.  Every start element, end element, attribute, etc. is given its own row.  Therefore, when running the transformation on a full file, expect to see 100M+ rows being passed through.  Processing should still be fast, though.

2. Filter

The second step, the filter step, filters out the START_ELEMENT and END_ELEMENT rows thus reducing the amount of data Kettle needs to work on.  This works if you are not interested in the start/end element information.  You can always keep it; however, the subsequent JavaScript step is dependent on only have one xml_data_name for unique attributes of the entity.

stax_filter

3. JavaScript

The transformation is dependent on having only one xml_data_name for unique attributes.  These unique attributes are used to increment a grouping variable that is used to differentiate one entity from the next.

Here’s my configuration for this (my explanation for this configuration is below the images):

stax_start_script

stax_main_script

First, I set up a start script.  What does this do?  It initializes a variable called “group” upon start-up of the step.  In this case, the group variable is being set to 0.

Second, in a standard JavaScript tab, I increment the “group” variable based on data passing through the JavaScript step.  Upon previewing the content of the XML, I noticed that the first unique attribute being passed in is called “ASIN”:

stax_asin.png

Therefore, I chose to increment my “group” variable on whether xml_data_name is equal to the value “ASIN”.  Note that this could be different for different XMLs, so always make sure to review your incoming data thoroughly to determine which value is unique — it may not always be the first one!

4. Denorm

Using the newly created “group” variable, I can now denorm the data into a format that’s easier to read.  In the denorm step, I pivot off data in the xml_path field and bring in data from the xml_data_value field.

stax_denorm

The denorm step provides the option to concatenate/group elements together, which makes things easier in terms of combining elements that might have multiple values.

(Note: do NOT attempt to sort the data by the grouping variable, since it may cause Kettle to crash — sorting is another step that is run in memory, which is what we want to avoid doing with a data file this large.)

And with that, your XML should be starting to look like a really nice CSV:

stax_final.png

To recap:

  1. Open the file in the StAX step — this will convert every aspect of the XML into a row.
  2. Filter out START_ELEMENT and END_ELEMENT.
  3. Figure out what the unique field is and use JavaScript to increment and create a variable for grouping.
  4. Using the grouping variable to denorm data of interest.

To get started, feel free to use my sample transformation.

12 thoughts on “Parsing Huge XML Files in Pentaho Kettle

  1. I want to extract full structure of node including it’s sub child. How can I do it by using stax parser (My input file is 10 gb).

  2. Thanks for writing this article, it’s very helpful. Although I do have a question… During the JavaScript stage you set an increment value in a variable when certain “unique key” is found, and you even warn that it might not be the first value in the incoming data.

    In my case it’s not the first value, so when I do the increment it starts at 0 and it increments in value for the first time it finds the “unique key”, which is in the middle of the incoming data, so all rows before the first key are “group 0” and the first key found becomes “group 1”. This will consequently mess up the denormaliser and the data after this step will be wrong. Do you have any idea on how to overcome this? Thanks

  3. I can´t see the file, it has an error 404

    I´m trying to read or pass many xmls files in Pentaho but i can´t 😦

  4. Hi morannachum,
    I have a large (8GB) XML File I want to transfere into my database. I tried your solution but I handle to only get one (the last) row from the XML into my tables.
    I thought it might have to do with the grouping but can’t figure it out.

    • Hi Andre, it’s possible that it is the grouping. Are you using a unique attribute for the grouping? Is it the first attribute in the for the data element?

      If not, it may be better to leverage the START_ELEMENT value for grouping (see my earlier response to Filipe). I was able to group using the START_ELEMENT value by switching around the Filter step and the Javascript step and then leveraging the START_ELEMENT values for my grouping. Here is an example: https://morannachum.github.io/files/StAX%20Example%20Switched.zip

    • Hi Rogerio,

      It seems that the file you’re trying to parse has a number of different pieces of information.

      When reviewing the file, I noted the following categories: balancete, fluxoCaixa, idadeSaldo, lucrosPrejuizos, saldoProvEventosSinistrosLiq, coberturaAssistencial, eventosIndenizaveis, agrupamentoContratos, corresponsabilidadeCedida, contraprestacoesPecuniarias, testeAdequacaoPassivo, modeloPadraoCapital, creditosDebitosCRC, fundosInvestimentosCRC.

      Each one of these categories structures the data within it slightly differently. The method in this article works best for large files where the data is the same.

      To resolve your situation, I would recommend trying to split the file by the above categories before parsing.

      Let me know if that works for you.

Leave a reply to morannachum Cancel reply

Blog at WordPress.com.