Running Scripts in Pentaho Kettle, the Sequel

Surprise!  One of this blog’s most successful posts is about how to run scripts in Pentaho Kettle.

Confession: I wrote that post a long time ago (in fact, it was one of my very first posts about Pentaho Kettle).  And since then, I’ve learned a lot more about Kettle and about running scripts in Kettle.  Therefore, I wanted to offer a refresher to the original post and a new recommendation on how to better run scripts in Kettle.

More recently, I’ve been running scripts in Kettle like this:

scripts_general

What’s different?

  1. The “Insert Script” option is checked meaning that the second tab, “Script”, is now available for us to fill in.  This tab acts like our Terminal in Kettle.  Anything that you can run in Terminal, you can execute in the Script tab, but I’ll get more into that later.
  2. The “Working Directory” is now an environmental variable.  This is an improvement over our previous configuration, since it allows for greater transferability of the Kettle job from one person to another.

On the “Script” tab, this is my configuration:

scripts_script

In here, I’m using environmental variables to specify my output directories providing more ease of transferability when exchanging jobs with other persons.  Additionally, I am not relying on the machine’s version of Python, but rather a version of Python specific to a virtualenv.  This again, better insures that when transferring my job to other people, they are able to recreate the virtual environment I’ve created on my machine and run the job without a problem.

In Practice

Let’s say I wrote a script that:

  1. Pings an API
  2. Places the returned data in a JSON file

The script takes in two inputs: a link to ping the API and an output filename where the returned JSON will be placed.  This is my script:

import requests
import argparse
import json
import datetime

# Writes to a JSON file.
# Input: filename/location string, data in the form of array of
# dictionaries
###################################################################
def toJSON(filename, data):
    with open(filename, 'w') as outfile:
        json.dump(data, outfile)

# Call a given link.
# Input: API link to ping
###################################################################
def callAPI(link):
    response = requests.get(link)
    return response

# Parses incoming information
######################################################################
def commandLineSetup():
    commandParser = argparse.ArgumentParser(description="Pings an API link for data "
"and outputs data to JSON file")
    commandParser.add_argument("-l", "--link", help="API link to ping for information")
    commandParser.add_argument("-o", "--outputfile", help="Name/Path of the JSON output file")

    args = commandParser.parse_args()

    return args.link, args.outputfile

######################################################################
# MAIN
######################################################################

def main():
    LINK, OUTPUT = commandLineSetup()

# Check that proper inputs were provided
    if not LINK or not OUTPUT:
        print str(datetime.datetime.now()) + " - Insufficient inputs provided"
        exit()

    print str(datetime.datetime.now()) + " - Calling link %s" % LINK
    response = callAPI(LINK)
    print str(datetime.datetime.now()) + " - Outputting to file %s" % OUTPUT
    toJSON(OUTPUT, response.json())
    print str(datetime.datetime.now()) + " - Done!"

if __name__ == "__main__":
    main()

Notice that my script relies on two packages that are not native to Python: requests and argparse.  I use requests to ping the API to retrieve data and argparse to parse passed-in information from the command line.  To accommodate for these two modules, I create a virtual environment called “example”, which has a requirements.txt file.

Once my virtualenv is configured, I can test out my Python script in my virtualenv in my terminal window by running a command from within the working directory:

scripts_terminal

My working directory in this case is ~/Desktop/Personal/Sites/Blog/Projects.  This is also where I have my job saved:

scripts_job

Therefore, when configuring my script for execution within Kettle, I can use the variable ${Internal.Job.Filename.Directory} to specify my working directory and enter the same command as I did in Terminal and everything will execute just as it did in Terminal:

scripts_job_generalscripts_job_script

To check out my example transformation, please download this file (make sure to create the virtual environment before attempting to run the job; name the virtual environment “example”).

Recap

When executing scripts in Kettle, it is better to use the “Insert script” option, since it allows for:

  • Better job transferability
  • Easier compatibility of virtual environments
  • Integration of Kettle environmental variables

I hope you find this useful!

Setting Up Virtual Environments for Python

I’ve been greatly encouraged by colleagues and friends to try working on a MacBook instead of a Windows machine. Given that there is no cost to me, since the computer is provided by my company, I thought I’d give it a try. Thus far, I’ve been enjoying it greatly. And one of my favorite things that I’ve learned thus far is the ease with which one can build and manage virtual environments.

A virtual environment is different from a virtual machine. A virtual machine requires allocation of disk space and RAM. A virtual environment is simply an isolated working copy of Python. The two main benefits to setting up virtual environments are

  • that you can work in two completely different environments simultaneously and
  • you create accountability and tracking for changes made to the environment that can later be replicated by a team member if they decide to pick up on your project

Now, if you’re like me when I first heard about this, you’re probably feeling a bit intimidated. No worries! Let’s do a quick step-by-step on how to set all of this up so that you can start configuring and using your own virtual environments.

  1. Start by installing the virtualenv module on your machine via Terminal. To do this, use the command
    $ pip install virtualenv

    If you don’t have pip already installed, I would encourage you to install it using the following command:

    $ sudo easy_install pip

    It’ll definitely come in great aid later when you’re installing packages in all your new environments.

    Installing virtualenv

  2. Now that you have the virtualenv module installed on your machine, you can start setting up virtual environments. Yes, it’s that simple!To start off, it’s best practice that you setup your virtual machine within the folder of the project that you are working on. So, through the Terminal, navigate to the folder where your project is stored.Navigating to Directories via Terminal
  3. Once within the directory, create the virtual environment using the command
    $ virtualenv project

    Another best practice is to name your virtual environment the name of the project on which you are working. In my case, the name project is befitting, but make sure to select the right name for yourself.

    Note that if you have multiple instances of Python installed on your machine, you can also specify which instance you would like to be used in this particular virtual environment. We do not specify that in the command above, so the version used in this particular virtual environment will just be the default version on the machine.

  4. Now, before we start using our new environment, we need to configure it with modules specific to our project. To do so, activate the environment using the command
    $ source project/bin/activate

    Make sure to change the first word after “source” to the name of your own virtual environment. The name of your environment will be added in front of your command line once you’ve activated the environment.

    When your virtual environment is active, configure it as you wish. In this example, I install module elasticsearch in my project virtual environment:

    Creating Your Virtual Environment

  5. Once you complete configuring your environment, quit configuration by using the command
    $ deactivate
  6. To verify that your virtual environment is indeed different from your defaulted edition of Python, you can run a simple query testing for new modules in the defaulted Python version and then again in the virtual environment version. Note in my example that elasticsearch is present in my virtual environment, but not in my defaulted version of Python:Testing Virtual Environment Against Default Python Version
  7. Now that you have your virtual environment setup, you can run scripts specific to your project within that virtual environment without having to modify your defaulted version or affecting other projects you may be working on. Additionally, you can share your virtual environment with collaborating colleagues. To share your environment, while in your project’s directory within the Terminal, use the following command to create a requirements file:
    $  pip freeze > requirements.txt

    This will create a file in your project folder with all the modules present in your virtual environment.

    Creating the requirements.txt File

    When collaborating with other people, send them this requirements files, have them place it in their project directory, then navigate to that directory via their Terminal, and run the following command:

    $  pip install -r requirements.txt

    This will set up a matching virtual environment on their machine so that they can run your files without any issues.

For those seeking to learn more about virtual environments and all the things you can do with them, I highly encourage you to check out the full documentation for the module, which can be found here. A big thank you to my colleague who took time out of her day today to educate me about this.

Create a free website or blog at WordPress.com.