Unzipping Multiple Files in Pentaho Kettle

I was recently attempting to download a lot of data from a site using Pentaho Kettle. Downloading the files was simple with the setup from my previous post. However, once the files were downloaded, I realized they were all in .zip format. I needed to unzip them.

The process was fairly difficult for me and the web was of little to no help. Therefore, I thought I’d put together this post so that future users of Pentaho Kettle don’t have to go through the same headaches I did.

The step to unzip files is only available in Jobs (not transformations). The icon looks like this:

Unzip Step Icon

This is what the step looks like when you double-click it (with some fields filled in by me):

Unzip Step Main Menu

The step, as it is shown here, is configured to unzip multiple files. If you only have one file to unzip, then this process is much easier. Additionally, you’ll notice that I have set up a ${PATH} variable for my job that points to a directory on my computer.

The main fields to note here are:

  • Zip File Name: If you have just one file to unzip, you can treat this like any other directory field in Pentaho Kettle and simply browse to the file location on your computer and select it. If you have multiple files, you’ll want to specify the folder in which the .zip files are present.
  • Source Wildcard (RegExp): In here, you specify that you want all files within the folder specified in “Zip File Name” that end in a .zip format to be unzipped. Use “.*\.zip” to do so.

After specifying where the zipped files are residing, you need to specify to where you would like them unzippped. Specify your target directory in the “Target Directory” field.

Additionally, Pentaho Kettle offers you some other options, such as whether you only want to pull specific items from the zipped file (if there are multiple items in it) or if you want to exclude certain items. You’ll see in my example screen capture that I’m only pulling files ending in a .txt from the zipped files I’m unzipping.

Lastly, in my setup of the step, I elected to overwrite any existing documents, since I intended to run and re-run the step a couple of times (for troubleshooting).

That’s about it. I realize it looks simple, but it took me about an hour to get this hammered out. It seemed like no one else at my company had had a chance to unzip multiple files at once in the past.

Feel free to download my example as a starting point.

Create a free website or blog at WordPress.com.