Running Scripts in Pentaho Kettle, the Sequel

Surprise!  One of this blog’s most successful posts is about how to run scripts in Pentaho Kettle.

Confession: I wrote that post a long time ago (in fact, it was one of my very first posts about Pentaho Kettle).  And since then, I’ve learned a lot more about Kettle and about running scripts in Kettle.  Therefore, I wanted to offer a refresher to the original post and a new recommendation on how to better run scripts in Kettle.

More recently, I’ve been running scripts in Kettle like this:

scripts_general

What’s different?

  1. The “Insert Script” option is checked meaning that the second tab, “Script”, is now available for us to fill in.  This tab acts like our Terminal in Kettle.  Anything that you can run in Terminal, you can execute in the Script tab, but I’ll get more into that later.
  2. The “Working Directory” is now an environmental variable.  This is an improvement over our previous configuration, since it allows for greater transferability of the Kettle job from one person to another.

On the “Script” tab, this is my configuration:

scripts_script

In here, I’m using environmental variables to specify my output directories providing more ease of transferability when exchanging jobs with other persons.  Additionally, I am not relying on the machine’s version of Python, but rather a version of Python specific to a virtualenv.  This again, better insures that when transferring my job to other people, they are able to recreate the virtual environment I’ve created on my machine and run the job without a problem.

In Practice

Let’s say I wrote a script that:

  1. Pings an API
  2. Places the returned data in a JSON file

The script takes in two inputs: a link to ping the API and an output filename where the returned JSON will be placed.  This is my script:

import requests
import argparse
import json
import datetime

# Writes to a JSON file.
# Input: filename/location string, data in the form of array of
# dictionaries
###################################################################
def toJSON(filename, data):
    with open(filename, 'w') as outfile:
        json.dump(data, outfile)

# Call a given link.
# Input: API link to ping
###################################################################
def callAPI(link):
    response = requests.get(link)
    return response

# Parses incoming information
######################################################################
def commandLineSetup():
    commandParser = argparse.ArgumentParser(description="Pings an API link for data "
"and outputs data to JSON file")
    commandParser.add_argument("-l", "--link", help="API link to ping for information")
    commandParser.add_argument("-o", "--outputfile", help="Name/Path of the JSON output file")

    args = commandParser.parse_args()

    return args.link, args.outputfile

######################################################################
# MAIN
######################################################################

def main():
    LINK, OUTPUT = commandLineSetup()

# Check that proper inputs were provided
    if not LINK or not OUTPUT:
        print str(datetime.datetime.now()) + " - Insufficient inputs provided"
        exit()

    print str(datetime.datetime.now()) + " - Calling link %s" % LINK
    response = callAPI(LINK)
    print str(datetime.datetime.now()) + " - Outputting to file %s" % OUTPUT
    toJSON(OUTPUT, response.json())
    print str(datetime.datetime.now()) + " - Done!"

if __name__ == "__main__":
    main()

Notice that my script relies on two packages that are not native to Python: requests and argparse.  I use requests to ping the API to retrieve data and argparse to parse passed-in information from the command line.  To accommodate for these two modules, I create a virtual environment called “example”, which has a requirements.txt file.

Once my virtualenv is configured, I can test out my Python script in my virtualenv in my terminal window by running a command from within the working directory:

scripts_terminal

My working directory in this case is ~/Desktop/Personal/Sites/Blog/Projects.  This is also where I have my job saved:

scripts_job

Therefore, when configuring my script for execution within Kettle, I can use the variable ${Internal.Job.Filename.Directory} to specify my working directory and enter the same command as I did in Terminal and everything will execute just as it did in Terminal:

scripts_job_generalscripts_job_script

To check out my example transformation, please download this file (make sure to create the virtual environment before attempting to run the job; name the virtual environment “example”).

Recap

When executing scripts in Kettle, it is better to use the “Insert script” option, since it allows for:

  • Better job transferability
  • Easier compatibility of virtual environments
  • Integration of Kettle environmental variables

I hope you find this useful!

3 thoughts on “Running Scripts in Pentaho Kettle, the Sequel

  1. Pingback: Running Scripts in Pentaho Kettle | Moran Nachum

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: