Aug

Parsing Huge XML Files in Pentaho Kettle

Recently I was working with Amazon Data Feed files. These files are about 1GB zipped up and about 15GB unzipped. And they’re XML. These files… are HUGE!

n4vdvup

For files of this magnitude, the traditional Get Data from XML step does not work since that step requires Kettle to load the XML data in memory, which at 15GB will cause Kettle to freeze and crash. However, there is an alternative step called XML Input Stream (StAX) which can be used to parse the XML data. This post will discuss how to configure that StAX step.

To start off, this is how the transformation looks:

stax_overview

1. XML Input Stream (StAX):

The main fields to complete here indicate…

Where the data is coming from
Whether the data should be limited (limiting the data initially is helpful as you figure out the structure of the data)

This is what my configuration looks like:

stax_main

The options below the ones mentioned specify what kind of data you want extracted from the XML file. StAX essentially parses the file node-by-node and converts every node/tag/attribute to a row. This is what the data looks like in preview:

stax_preview

The level of detail here is extremely granular. Every start element, end element, attribute, etc. is given its own row. Therefore, when running the transformation on a full file, expect to see 100M+ rows being passed through. Processing should still be fast, though.

2. Filter

The second step, the filter step, filters out the START_ELEMENT and END_ELEMENT rows thus reducing the amount of data Kettle needs to work on. This works if you are not interested in the start/end element information. You can always keep it; however, the subsequent JavaScript step is dependent on only have one xml_data_name for unique attributes of the entity.

stax_filter

3. JavaScript

The transformation is dependent on having only one xml_data_name for unique attributes. These unique attributes are used to increment a grouping variable that is used to differentiate one entity from the next.

Here’s my configuration for this (my explanation for this configuration is below the images):

stax_start_script

stax_main_script

First, I set up a start script. What does this do? It initializes a variable called “group” upon start-up of the step. In this case, the group variable is being set to 0.

Second, in a standard JavaScript tab, I increment the “group” variable based on data passing through the JavaScript step. Upon previewing the content of the XML, I noticed that the first unique attribute being passed in is called “ASIN”:

Therefore, I chose to increment my “group” variable on whether xml_data_name is equal to the value “ASIN”. Note that this could be different for different XMLs, so always make sure to review your incoming data thoroughly to determine which value is unique — it may not always be the first one!

4. Denorm

Using the newly created “group” variable, I can now denorm the data into a format that’s easier to read. In the denorm step, I pivot off data in the xml_path field and bring in data from the xml_data_value field.

stax_denorm

The denorm step provides the option to concatenate/group elements together, which makes things easier in terms of combining elements that might have multiple values.

(Note: do NOT attempt to sort the data by the grouping variable, since it may cause Kettle to crash — sorting is another step that is run in memory, which is what we want to avoid doing with a data file this large.)

And with that, your XML should be starting to look like a really nice CSV:

To recap:

Open the file in the StAX step — this will convert every aspect of the XML into a row.
Filter out START_ELEMENT and END_ELEMENT.
Figure out what the unique field is and use JavaScript to increment and create a variable for grouping.
Using the grouping variable to denorm data of interest.

To get started, feel free to use my sample transformation.

12 thoughts on “Parsing Huge XML Files in Pentaho Kettle”

Andrew on August 5, 2016 at 12:46 am said:

Great read!

Reply ↓
- morannachum on August 5, 2016 at 12:52 am said:
  
  Glad to hear you enjoyed it, Andrew! Please feel free to let me know if there are any other topics I should cover here.
  
  Reply ↓
Praveen on September 6, 2017 at 2:39 am said:

I want to extract full structure of node including it’s sub child. How can I do it by using stax parser (My input file is 10 gb).

Reply ↓
Travista Carrington on October 17, 2018 at 8:47 am said:

This was wonderful. Thank you for writing this. It helped me out alot.

Reply ↓
Filipe on April 24, 2019 at 11:06 am said:

Thanks for writing this article, it’s very helpful. Although I do have a question… During the JavaScript stage you set an increment value in a variable when certain “unique key” is found, and you even warn that it might not be the first value in the incoming data.

In my case it’s not the first value, so when I do the increment it starts at 0 and it increments in value for the first time it finds the “unique key”, which is in the middle of the incoming data, so all rows before the first key are “group 0” and the first key found becomes “group 1”. This will consequently mess up the denormaliser and the data after this step will be wrong. Do you have any idea on how to overcome this? Thanks

Reply ↓
- morannachum on February 2, 2021 at 11:26 pm said:
  
  Hi Felipe, I apologize for the late response.
  
  I think you may be able to group this file by flipping the Filter step and the Javascript step and then leveraging the START_ELEMENT values for your grouping. I tried doing that in this example and it worked out: https://morannachum.github.io/files/StAX%20Example%20Switched.zip
  
  Reply ↓
Rene on May 15, 2019 at 10:05 am said:

I can´t see the file, it has an error 404

I´m trying to read or pass many xmls files in Pentaho but i can´t 😦

Reply ↓
- morannachum on February 2, 2021 at 10:43 pm said:
  
  Hi Rene, I apologize for allowing the link to deprecate. I realize this is a bit late, but it should be back up now.
  
  Reply ↓
André Cieluch on April 13, 2021 at 5:08 am said:

Hi morannachum,
I have a large (8GB) XML File I want to transfere into my database. I tried your solution but I handle to only get one (the last) row from the XML into my tables.
I thought it might have to do with the grouping but can’t figure it out.

Reply ↓
- morannachum on April 14, 2021 at 10:07 pm said:
  
  Hi Andre, it’s possible that it is the grouping. Are you using a unique attribute for the grouping? Is it the first attribute in the for the data element?
  
  If not, it may be better to leverage the START_ELEMENT value for grouping (see my earlier response to Filipe). I was able to group using the START_ELEMENT value by switching around the Filter step and the Javascript step and then leveraging the START_ELEMENT values for my grouping. Here is an example: https://morannachum.github.io/files/StAX%20Example%20Switched.zip
  
  Reply ↓
ROGERIO RODRIGUES on May 31, 2021 at 1:28 pm said:

Hi, morannachum.

I’m trying to implement your code in this xml file template

http://www.ans.gov.br/padroes/diops/2021/documentos/exemplo_diops_2021_padrao.xml

but it only returns one line.

Reply ↓
- morannachum on June 16, 2021 at 3:48 pm said:
  
  Hi Rogerio,
  
  It seems that the file you’re trying to parse has a number of different pieces of information.
  
  When reviewing the file, I noted the following categories: balancete, fluxoCaixa, idadeSaldo, lucrosPrejuizos, saldoProvEventosSinistrosLiq, coberturaAssistencial, eventosIndenizaveis, agrupamentoContratos, corresponsabilidadeCedida, contraprestacoesPecuniarias, testeAdequacaoPassivo, modeloPadraoCapital, creditosDebitosCRC, fundosInvestimentosCRC.
  
  Each one of these categories structures the data within it slightly differently. The method in this article works best for large files where the data is the same.
  
  To resolve your situation, I would recommend trying to split the file by the above categories before parsing.
  
  Let me know if that works for you.
  
  Reply ↓