Running Scripts in Pentaho Kettle, the Sequel

Surprise!  One of this blog’s most successful posts is about how to run scripts in Pentaho Kettle.

Confession: I wrote that post a long time ago (in fact, it was one of my very first posts about Pentaho Kettle).  And since then, I’ve learned a lot more about Kettle and about running scripts in Kettle.  Therefore, I wanted to offer a refresher to the original post and a new recommendation on how to better run scripts in Kettle.

More recently, I’ve been running scripts in Kettle like this:

scripts_general

What’s different?

  1. The “Insert Script” option is checked meaning that the second tab, “Script”, is now available for us to fill in.  This tab acts like our Terminal in Kettle.  Anything that you can run in Terminal, you can execute in the Script tab, but I’ll get more into that later.
  2. The “Working Directory” is now an environmental variable.  This is an improvement over our previous configuration, since it allows for greater transferability of the Kettle job from one person to another.

On the “Script” tab, this is my configuration:

scripts_script

In here, I’m using environmental variables to specify my output directories providing more ease of transferability when exchanging jobs with other persons.  Additionally, I am not relying on the machine’s version of Python, but rather a version of Python specific to a virtualenv.  This again, better insures that when transferring my job to other people, they are able to recreate the virtual environment I’ve created on my machine and run the job without a problem.

In Practice

Let’s say I wrote a script that:

  1. Pings an API
  2. Places the returned data in a JSON file

The script takes in two inputs: a link to ping the API and an output filename where the returned JSON will be placed.  This is my script:

import requests
import argparse
import json
import datetime

# Writes to a JSON file.
# Input: filename/location string, data in the form of array of
# dictionaries
###################################################################
def toJSON(filename, data):
    with open(filename, 'w') as outfile:
        json.dump(data, outfile)

# Call a given link.
# Input: API link to ping
###################################################################
def callAPI(link):
    response = requests.get(link)
    return response

# Parses incoming information
######################################################################
def commandLineSetup():
    commandParser = argparse.ArgumentParser(description="Pings an API link for data "
"and outputs data to JSON file")
    commandParser.add_argument("-l", "--link", help="API link to ping for information")
    commandParser.add_argument("-o", "--outputfile", help="Name/Path of the JSON output file")

    args = commandParser.parse_args()

    return args.link, args.outputfile

######################################################################
# MAIN
######################################################################

def main():
    LINK, OUTPUT = commandLineSetup()

# Check that proper inputs were provided
    if not LINK or not OUTPUT:
        print str(datetime.datetime.now()) + " - Insufficient inputs provided"
        exit()

    print str(datetime.datetime.now()) + " - Calling link %s" % LINK
    response = callAPI(LINK)
    print str(datetime.datetime.now()) + " - Outputting to file %s" % OUTPUT
    toJSON(OUTPUT, response.json())
    print str(datetime.datetime.now()) + " - Done!"

if __name__ == "__main__":
    main()

Notice that my script relies on two packages that are not native to Python: requests and argparse.  I use requests to ping the API to retrieve data and argparse to parse passed-in information from the command line.  To accommodate for these two modules, I create a virtual environment called “example”, which has a requirements.txt file.

Once my virtualenv is configured, I can test out my Python script in my virtualenv in my terminal window by running a command from within the working directory:

scripts_terminal

My working directory in this case is ~/Desktop/Personal/Sites/Blog/Projects.  This is also where I have my job saved:

scripts_job

Therefore, when configuring my script for execution within Kettle, I can use the variable ${Internal.Job.Filename.Directory} to specify my working directory and enter the same command as I did in Terminal and everything will execute just as it did in Terminal:

scripts_job_generalscripts_job_script

To check out my example transformation, please download this file (make sure to create the virtual environment before attempting to run the job; name the virtual environment “example”).

Recap

When executing scripts in Kettle, it is better to use the “Insert script” option, since it allows for:

  • Better job transferability
  • Easier compatibility of virtual environments
  • Integration of Kettle environmental variables

I hope you find this useful!

Parsing Huge XML Files in Pentaho Kettle

Recently I was working with Amazon Data Feed files.  These files are about 1GB zipped up and about 15GB unzipped.  And they’re XML.  These files…  are HUGE!

n4vdvup

For files of this magnitude, the traditional Get Data from XML step does not work since that step requires Kettle to load the XML data in memory, which at 15GB will cause Kettle to freeze and crash.  However, there is an alternative step called XML Input Stream (StAX) which can be used to parse the XML data.  This post will discuss how to configure that StAX step.

To start off, this is how the transformation looks:

stax_overview

1. XML Input Stream (StAX):

The main fields to complete here indicate…

  • Where the data is coming from
  • Whether the data should be limited (limiting the data initially is helpful as you figure out the structure of the data)

This is what my configuration looks like:

stax_main

The options below the ones mentioned specify what kind of data you want extracted from the XML file.  StAX essentially parses the file node-by-node and converts every node/tag/attribute to a row.  This is what the data looks like in preview:

stax_preview

The level of detail here is extremely granular.  Every start element, end element, attribute, etc. is given its own row.  Therefore, when running the transformation on a full file, expect to see 100M+ rows being passed through.  Processing should still be fast, though.

2. Filter

The second step, the filter step, filters out the START_ELEMENT and END_ELEMENT rows thus reducing the amount of data Kettle needs to work on.  This works if you are not interested in the start/end element information.  You can always keep it; however, the subsequent JavaScript step is dependent on only have one xml_data_name for unique attributes of the entity.

stax_filter

3. JavaScript

The transformation is dependent on having only one xml_data_name for unique attributes.  These unique attributes are used to increment a grouping variable that is used to differentiate one entity from the next.

Here’s my configuration for this (my explanation for this configuration is below the images):

stax_start_script

stax_main_script

First, I set up a start script.  What does this do?  It initializes a variable called “group” upon start-up of the step.  In this case, the group variable is being set to 0.

Second, in a standard JavaScript tab, I increment the “group” variable based on data passing through the JavaScript step.  Upon previewing the content of the XML, I noticed that the first unique attribute being passed in is called “ASIN”:

stax_asin.png

Therefore, I chose to increment my “group” variable on whether xml_data_name is equal to the value “ASIN”.  Note that this could be different for different XMLs, so always make sure to review your incoming data thoroughly to determine which value is unique — it may not always be the first one!

4. Denorm

Using the newly created “group” variable, I can now denorm the data into a format that’s easier to read.  In the denorm step, I pivot off data in the xml_path field and bring in data from the xml_data_value field.

stax_denorm

The denorm step provides the option to concatenate/group elements together, which makes things easier in terms of combining elements that might have multiple values.

(Note: do NOT attempt to sort the data by the grouping variable, since it may cause Kettle to crash — sorting is another step that is run in memory, which is what we want to avoid doing with a data file this large.)

And with that, your XML should be starting to look like a really nice CSV:

stax_final.png

To recap:

  1. Open the file in the StAX step — this will convert every aspect of the XML into a row.
  2. Filter out START_ELEMENT and END_ELEMENT.
  3. Figure out what the unique field is and use JavaScript to increment and create a variable for grouping.
  4. Using the grouping variable to denorm data of interest.

To get started, feel free to use my sample transformation.

Unzipping Multiple Files in Pentaho Kettle

I was recently attempting to download a lot of data from a site using Pentaho Kettle. Downloading the files was simple with the setup from my previous post. However, once the files were downloaded, I realized they were all in .zip format. I needed to unzip them.

The process was fairly difficult for me and the web was of little to no help. Therefore, I thought I’d put together this post so that future users of Pentaho Kettle don’t have to go through the same headaches I did.

The step to unzip files is only available in Jobs (not transformations). The icon looks like this:

Unzip Step Icon

This is what the step looks like when you double-click it (with some fields filled in by me):

Unzip Step Main Menu

The step, as it is shown here, is configured to unzip multiple files. If you only have one file to unzip, then this process is much easier. Additionally, you’ll notice that I have set up a ${PATH} variable for my job that points to a directory on my computer.

The main fields to note here are:

  • Zip File Name: If you have just one file to unzip, you can treat this like any other directory field in Pentaho Kettle and simply browse to the file location on your computer and select it. If you have multiple files, you’ll want to specify the folder in which the .zip files are present.
  • Source Wildcard (RegExp): In here, you specify that you want all files within the folder specified in “Zip File Name” that end in a .zip format to be unzipped. Use “.*\.zip” to do so.

After specifying where the zipped files are residing, you need to specify to where you would like them unzippped. Specify your target directory in the “Target Directory” field.

Additionally, Pentaho Kettle offers you some other options, such as whether you only want to pull specific items from the zipped file (if there are multiple items in it) or if you want to exclude certain items. You’ll see in my example screen capture that I’m only pulling files ending in a .txt from the zipped files I’m unzipping.

Lastly, in my setup of the step, I elected to overwrite any existing documents, since I intended to run and re-run the step a couple of times (for troubleshooting).

That’s about it. I realize it looks simple, but it took me about an hour to get this hammered out. It seemed like no one else at my company had had a chance to unzip multiple files at once in the past.

Feel free to download my example as a starting point.

Downloading Files in Pentaho Kettle

For one of my transformations in Pentaho Kettle, I wanted to download data from an internet source. The source did not have an API, so I could not use a REST step to call on it. Instead, the source had a link that connected to a CSV that could then be downloaded in the browser. I wanted to download the data from this URL and automatically place it on my computer so that I could manipulate it as necessary. To do this, I used an HTTP step. This step is only available in Jobs (not in Transformations). The icon looks like this:

HTTP Icon

This is what the menu looks like when you double-click on the step:

HTTP Menu

The main fields to look at here are the “URL” field and the “Target file” field.

  • URL: This field is where you specify the URL to which you want Kettle to connect in order to access the file you would like to download.
  • Target file: This field is where you specify where you want Kettle to store the file once it’s been downloaded. It’s best to use variable locations in order to make sure that you can run this from any computer without running into issues with directory setup and names.

This is what my completed step looked like:

HTTP Single File Complete

You can download the Job I set up here. Feel free to manipulate the setup in order for it to suit your needs.

You’ll noticed that Kettle also provides options for authentication (in case there is a login process involved in accessing the file) and also for uploading a file. I did not experiment with either of these options, but would be interested in hearing from those that have.

As you can imagine, having access to this step can be fairly useful. Now, what happens if we wanted to download multiple files at once? That’s where things get a bit more complicated. To see my solution to that issue, you can download this Job. The solution here is not as clear-cut as when we download one file from a single source. Here’s a step-by-step of how I set up the mutli-file download job:

  1. START: Naturally, I start the job with a START step.
  2. PATH: Next, I use the “Set Variables” step to set my PATH variable. This is the variable I will use to specify the directory in which I would like to save the files once I download them. (Note: I could have also used this same process of assigning a PATH variable in the last job, but instead elected to specify the path directly in the HTTP step. This step is not truly necessary, but is good habit. Use a static directory here, since path variables get a bit murky when getting passed from one job to another. The original setup I used – C:\temp – may not work on your machine, so make sure to change it as necessary before running my job.)
  3. FILES: Here I used a “Transformation” step to call on a transformation named files.ktr that I set up in conjunction with this job. That transformation identifies what to name the file once it’s downloaded and the URL source of the file. Here’s that transformation step-by-step:
    • Data Grid: I set up a data grid with the columns filename and url. Under the Data tab, I specify the data for these columns:
      Files
    • Copy Rows to Results: Next I copy the created rows to the job using the “Copy Rows to Results” step. Note that between this step and the preceding step you can use JavaScript or some other combination of steps to add uniqueness to your filenames. I’ll leave that up to you to figure out.
  4. DOWNLOAD: Next I reference a job using the step “Job”. This job is structured similar to our previous single-file download job, but instead of referencing constants, we now reference the variables we set up. This is where all the magic happens, so note the changes we made:
    • While still configuring the job setup in the MAIN job, we specify under the “Advanced” tab to “Copy the results to parameters” and also to “Execute for every input row”.
      Advanced Job Configuration
    • Still in the setup of the DOWNLOAD job in the MAIN job, we also specify that we are passing parameters:
      Parameters
    • Now in the DOWNLOAD job, we go to Edit > Settings and specify under the Parameters tab for the job to anticipate and utilize the parameters FILENAME and URL.
      Parameters
    • Lastly, in the HTTP step in the DOWNLOAD job, we use the parameter names instead of the constants we used in the single-file download.
      HTTP Variables
  5. SUCCESS: We connect to the Success step to mark the completion of the job.

And that’s it! A bit more complicated, but definitely handy for downloading batches of files.

Big thanks to Raffael Volger who initially posted on this.

Running Scripts in Pentaho Kettle

WAIT! Before you read this, please know that I’ve published an update to this article here.  I think the update is much more useful, so maybe read the update and then come back?  Or don’t come back at all?  Whatever you prefer!

Recently, I found myself needing to run a Python script from inside Pentaho Kettle.  I wanted to convert a CSV file into JSON format and found that Kettle was running for an extremely long time just to complete this simple process.  Since I already had a solid Python script in place to complete this task, I decided I could use that instead of relying on the traditional Kettle steps.

To run an external script in Kettle, you need to use the Shell step.  This step is currently (as of Pentaho Kettle 5.2.0.0) only available in Jobs – not Transformations.  The step icon looks like this:

Shell Command

When you double-click on the step, you will encounter a menu like this:

post04 image02

The main items you should be concerned with are the fields “Script file name” and “Working directory”.

  • Script file name: This is the name of the script you would like to run. Alternatively, if you would like to input custom code, you can use the Script tab to do so. In my case, I had a saved .py script that I could use. I specified that script in this field. Make sure to include the file location in the name.
  • Working directory: This is where the Python shell can be found. In my case, this is a folder on the C drive. You do not need to specify the exact executable – just the folder in which it’s present.

And that’s it! This is how my final configured step looked:

post04 image03

Since all my scripting was in the included Python file, I did not rely on the Script tab for anything. In the case that a file is specified, Kettle automatically fills the Script tab with the command “python [file name]”.

As can be seen from the screenshots, Kettle also provides options for logging. If your script outputs information about its progress, timing, or anything else, you can store that information in a particular file.

Kettle also offers specific options for iterating scripts over every row.

I did not experiment with either the logging option or the iteration option, but would be interested in hearing from anybody that might have. Happy scripting!

Pentaho Kettle MySQL Connection Errors

It’s been a minute.

I’ve been working on a number of personal projects recently and have not taken the time to document my development. I’ve learned a ton about everything from JavaScript, to ETL software Pentaho Kettle, to MySQL and Python and feel like now might be a good time to recap some of the issues I ran into during my development and also the solutions I found/devised as workarounds. More than anything, I hope this helps others that run into similar issues (and possibly also help me if by accident I run into these issues again).

The first issue I would like to recount is trouble in connecting to my GoDaddy MySQL server via Pentaho Kettle. For the longest time, this was an absolute struggle for me with no seeming explanation in site. I had set up my MySQL server on my hosting service, GoDaddy, without any issue and had been able to connect successfully via HeidiSQL. However, when I was attempting to set up my MySQL connection in Pentaho Kettle, I kept getting this absolutely horrendous error message:


Error connecting to database [devdb2] : org.pentaho.di.core.exception.KettleDatabaseException: 
Error occured while trying to connect to the database

Exception while loading class
org.gjt.mm.mysql.Driver


org.pentaho.di.core.exception.KettleDatabaseException: 
Error occured while trying to connect to the database

Exception while loading class
org.gjt.mm.mysql.Driver


    at org.pentaho.di.core.database.Database.normalConnect(Database.java:368)
    at org.pentaho.di.core.database.Database.connect(Database.java:317)
    at org.pentaho.di.core.database.Database.connect(Database.java:279)
    at org.pentaho.di.core.database.Database.connect(Database.java:269)
    at org.pentaho.di.core.database.DatabaseFactory.getConnectionTestReport(DatabaseFactory.java:86)
    at org.pentaho.di.core.database.DatabaseMeta.testConnection(DatabaseMeta.java:2464)
    at org.pentaho.ui.database.event.DataHandler.testDatabaseConnection(DataHandler.java:533)
    at sun.reflect.GeneratedMethodAccessor55.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.pentaho.ui.xul.impl.AbstractXulDomContainer.invoke(AbstractXulDomContainer.java:329)
    at org.pentaho.ui.xul.impl.AbstractXulComponent.invoke(AbstractXulComponent.java:139)
    at org.pentaho.ui.xul.impl.AbstractXulComponent.invoke(AbstractXulComponent.java:123)
    at org.pentaho.ui.xul.swt.tags.SwtButton.access$500(SwtButton.java:26)
    at org.pentaho.ui.xul.swt.tags.SwtButton$4.widgetSelected(SwtButton.java:119)
    at org.eclipse.swt.widgets.TypedListener.handleEvent(Unknown Source)
    at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown Source)
    at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
    at org.eclipse.swt.widgets.Display.runDeferredEvents(Unknown Source)
    at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source)
    at org.eclipse.jface.window.Window.runEventLoop(Window.java:820)
    at org.eclipse.jface.window.Window.open(Window.java:796)
    at org.pentaho.ui.xul.swt.tags.SwtDialog.show(SwtDialog.java:378)
    at org.pentaho.ui.xul.swt.tags.SwtDialog.show(SwtDialog.java:304)
    at org.pentaho.di.ui.core.database.dialog.XulDatabaseDialog.open(XulDatabaseDialog.java:115)
    at org.pentaho.di.ui.core.database.dialog.DatabaseDialog.open(DatabaseDialog.java:62)
    at org.pentaho.di.ui.spoon.delegates.SpoonDBDelegate.newConnection(SpoonDBDelegate.java:493)
    at org.pentaho.di.ui.spoon.delegates.SpoonDBDelegate.newConnection(SpoonDBDelegate.java:478)
    at org.pentaho.di.ui.spoon.Spoon.newConnection(Spoon.java:7770)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.pentaho.ui.xul.impl.AbstractXulDomContainer.invoke(AbstractXulDomContainer.java:329)
    at org.pentaho.ui.xul.impl.AbstractXulComponent.invoke(AbstractXulComponent.java:139)
    at org.pentaho.ui.xul.impl.AbstractXulComponent.invoke(AbstractXulComponent.java:123)
    at org.pentaho.ui.xul.swt.tags.SwtMenuitem.access$100(SwtMenuitem.java:27)
    at org.pentaho.ui.xul.swt.tags.SwtMenuitem$1.widgetSelected(SwtMenuitem.java:77)
    at org.eclipse.swt.widgets.TypedListener.handleEvent(Unknown Source)
    at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown Source)
    at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
    at org.eclipse.swt.widgets.Display.runDeferredEvents(Unknown Source)
    at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source)
    at org.pentaho.di.ui.spoon.Spoon.readAndDispatch(Spoon.java:1183)
    at org.pentaho.di.ui.spoon.Spoon.start(Spoon.java:6966)
    at org.pentaho.di.ui.spoon.Spoon.main(Spoon.java:567)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.pentaho.commons.launcher.Launcher.main(Launcher.java:134)
Caused by: org.pentaho.di.core.exception.KettleDatabaseException: 
Exception while loading class
org.gjt.mm.mysql.Driver

    at org.pentaho.di.core.database.Database.connectUsingClass(Database.java:423)
    at org.pentaho.di.core.database.Database.normalConnect(Database.java:352)
    ... 50 more
Caused by: java.lang.ClassNotFoundException: org.gjt.mm.mysql.Driver
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Unknown Source)
    at org.pentaho.di.core.database.Database.connectUsingClass(Database.java:414)
    ... 51 more

I still shudder looking at this.

Fortunately, the solution to this problem is not a complicated one. I found the answer on this post from StackOverflow. The reason the connection is failing is due to a missing MySQL Java connector library in Pentaho Kettle. This is how I solved the issue on my computer:

  1. I downloaded the most recent version of the MySQL Java Connector. (I needed to create an Oracle MySQL account in order to download this driver. To get the .zip format, I selected the Platform Independent option from the drop-down menu.)
  2. Extract the files from the downloaded .zip file and find the one labeled mysql-connector-java-5.1.34-bin.jar.
  3. Place this file into your Pentaho Kettle lib folder. The address for that should be something like C:\Program Files\pentaho\kettle\data-integration\lib.
  4. Restart Pentaho Kettle.

Create a free website or blog at WordPress.com.