WAIT! Before you read this, please know that I’ve published an update to this article here. I think the update is much more useful, so maybe read the update and then come back? Or don’t come back at all? Whatever you prefer!
Recently, I found myself needing to run a Python script from inside Pentaho Kettle. I wanted to convert a CSV file into JSON format and found that Kettle was running for an extremely long time just to complete this simple process. Since I already had a solid Python script in place to complete this task, I decided I could use that instead of relying on the traditional Kettle steps.
To run an external script in Kettle, you need to use the Shell step. This step is currently (as of Pentaho Kettle 188.8.131.52) only available in Jobs – not Transformations. The step icon looks like this:
When you double-click on the step, you will encounter a menu like this:
The main items you should be concerned with are the fields “Script file name” and “Working directory”.
- Script file name: This is the name of the script you would like to run. Alternatively, if you would like to input custom code, you can use the Script tab to do so. In my case, I had a saved .py script that I could use. I specified that script in this field. Make sure to include the file location in the name.
- Working directory: This is where the Python shell can be found. In my case, this is a folder on the C drive. You do not need to specify the exact executable – just the folder in which it’s present.
And that’s it! This is how my final configured step looked:
Since all my scripting was in the included Python file, I did not rely on the Script tab for anything. In the case that a file is specified, Kettle automatically fills the Script tab with the command “python [file name]”.
As can be seen from the screenshots, Kettle also provides options for logging. If your script outputs information about its progress, timing, or anything else, you can store that information in a particular file.
Kettle also offers specific options for iterating scripts over every row.
I did not experiment with either the logging option or the iteration option, but would be interested in hearing from anybody that might have. Happy scripting!