Get Updates

Learn about ETL, SQL, and other product management techniques via our newsletter!

Setting Up a Virtual Environment for Python 3

This is a follow-up for my previous article on how to set up virtual environments for Python.

I recently started coding a bit again and attempted to set up a virtual environment per those older instructions and realized that they were outdated: Python 2.7 is now largely deprecated (replaced by Python 3.x) and prior packages I used to install pip (easy_install) have now been superseded by better options (homebrew).

So, with that in mind, I thought I would provide an update to that older article on how to set up a virtual environment.

  1. First off, you need to make sure that you have Python 3 installed on your computer. You can check for this by running the command…

    python3 --version

    If you get an error in return, something like this:
    python 3 version command and failed result

    …then it’s fair to assume that you don’t have Python 3 installed.

    To install Python 3, you will need to download a GCC package through Xcode (Apple ID is required for this). This is also a huge package (11GB zipped when I downloaded it), so allow for some time here. If you are installing Xcode for the first time, don’t forget to run the following command in Terminal once the application is installed in order to configure the command line integration:

    xcode-select --install

    If you’re opting for quicker options, according to this good guide, not all of Xcode is needed. So, if you want to opt for a smaller package, you can download Command Line Tools (must have an Apple account) or the even smaller OSX-GCC-Installer package. (Full disclaimer: I downloaded Xcode and therefore am not sure as to the efficacy of the alternative options provided here.)

    Once you have the GCC package installed via whatever means, you will need to install Homebrew (I know! All this just to install Python 3. I hope it’s worth it!) The main step here is to run the following command in Terminal:

    /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

    If Homebrew has trouble installing, it’ll often tell you why and how you can restart the installation process using a specific command.

    Once Homebrew is installed, the process for installing Python 3 is very straightforward. Just run the following command in Terminal and you’re all set:

    brew install python

    Check the output code for any errors and run the version command once more to verify that Python 3 is installed:

    python3 --version
    python 3 version command and successful result

    If any errors are returned, review the outputs from the Xcode and Homebrew installations to make sure that those were installed successfully. Reference the previously linked guide on the installation of those two packages to make sure that everything was installed correctly.
  2. Now comes the easy part: in Terminal, navigate to the directory where you want to create your virtual environment.
  3. Once within the directory, create the virtual environment using the command,

    python3 -m venv project

    …where “project” is the desired name for your virtual environment. I named mine “example” in the screenshot below:
    running venv command for example virtual environment

  4. You can now activate and install packages for this virtual environment as you would historically.

    Use the following command to activate the environment:

    source project/bin/activate

    And pip commands within this configuration to install packages.

    Use the deactivate command to close out of this configuration screen:

    deactivate
    configuring a virtual environment created with venv command

  5. If you need to load or freeze requirements, the same commands as before work here, too:

    pip freeze > requirements.txt

    pip install -r requirements.txt

    If you’re installing requirements from an older version of Python, I would recommend installing them one by one from the configuration as opposed to loading the requirements using the install command.

And that’s essentially it! If you don’t have Xcode and Homebrew, this process can be quite tedious. But once those utilities are installed, creating virtual environments for Python 3 is an easy process.

How to Troubleshoot Crontab Issues

The other day, I updated the operating system on the server on which my crontab was installed. I thought that this would be a harmless upgrade to the system — nothing too crazy. However, I was wrong, the crontab that was working perfectly well on the old operating system now seemingly ceased to work. Nothing had changed in my setup of the scripts or the crontab, so why was it not working?

I set about troubleshooting the issue with some guidance from this article:

Disclaimer: After all my troubleshooting, I ultimately did not resolve the core of my issue and instead decided to give up on crontab and use a different solution. So, if you’re looking for a solution to my crontab problems at the end of this article, you will not find it.

  1. Verify my crontab settings: First thing I did was open up my crontab to make sure that everything was configured properly (and as I’d left it). I did this using the command…
    env EDITOR=nano crontab -e

    And sure enough the setup of my crontab was as I had left it:

    0 * * * * cd ~/Desktop/how\ to\ troubleshoot\ crontab/ && python2.7 hello_world.py

    While I was here, I thought to update the “0” in my job to “*” so that the job will run regularly every minute. This will make it easier to troubleshoot what was going on. I then closed and saved the crontab.

  2. Make sure the command can run successfully from Terminal: My second thought, after seeing that everything was all right with the file, was attempting to run the full command from my crontab file in Terminal and seeing if it worked:



    It did!
  3. Attempt to install crontab with admin permissions: Once I verified that the command could run outside of crontab without an issue, I started considering that maybe I didn’t have proper permissions to install the crontab on the server. So, I re-opened the crontab file with sudo privileges:sudo env EDITOR=nano crontab -e

    This is what I saw:

    What happened to my crontab command? Well, just to be sure, I went ahead and entered the same command into this crontab file that I had opened with the sudo permissions and checked if that produced the result I wanted.  It did not.

  4. Make sure the application has proper permissions: When I closed my last crontab edit (with sudo permissions), this little dialog box showed up:

    So, I speculated that perhaps iTerm, the Terminal application I was using to edit and install the crontab did not have proper permissions to install the crontab file on my server. I opened up my Mac’s “System Preferences”, accessible from the Apple icon in the upper left corner of the screen:

    Navigated to “Security & Privacy” tab:

    Scrolled to the “Full Disk Access” portion and unlocked the screen:

    Then added the iTerm application to the list of applications with “Full Disk Access” using the “+” button under the list of applications to the right:

    With iTerm now having “Full Disk Access”, I re-attempted to save the crontab (using sudo permissions), but had no success.

  5. Brute-force a crontab into its expected placement: So, I turned to the internet and found the article listed at the top that seemed pretty helpful. It stated, “There is a system-wide /etc/crontab file…”

    I neglected to read the rest of the sentence and instead went searching for the existence of the system-wide crontab file in the specified /etc/ directory.

    This is a directory that is otherwise hidden, so I used the “Spotlight” search feature on Mac (accessed via Cmd + Space) to search for the /etc/ directory and found it:

    And sure enough, once I opened that directory, I noticed that it had no file called “crontab” in it.

    In the absence of a “crontab” file, I decided I’d add one myself in a rather unorthodox manner. I opened a new file in Sublime Text, wrote my crontab command into the file, saved it to my “Desktop”, and then dragged it from that directory into the “etc” directory. What could go wrong?

    And yet, despite all this, nothing seemed to change! My crontab job was still failing to run.

     

  6. Check crontab run logs: So, I scanned the web article for some other ideas on what may be going on and spotted a recommendation to run the command…
    ps -ef | grep cron | grep -v grep

    …to check if my crontab has run. And sure enough it had! Multiple times even!

    To better understand this output, I looked up how to get the column headers for the grep command and found this helpful writeup. I went ahead and amended my command to the following:

    ps -ef | head -1 && ps -ef | grep cron | grep -v grep

    Now that it seemed that something was running, I did some accounting: Up to this point, it’s possible that I may have set up three different crontab files. One was set up via crontab using my own user permissions, another was set up via crontab using sudo permissions, and the last one was brute-forced into the “/etc/” directory using a text file.

    Of those three, it seemed that one or two were running. I suspected that the command labeled “(cron)” was a result of my sudo permissions manipulation and that the command labeled “/usr/sbin/cron” was coming from the one setup with my standard user permissions. As to why two instances existed for both, I was not sure. So, in an effort to verify this, I deleted my brute-forced crontab file from the “/etc/” directory and waited a minute to see if the logs looked any different.

    The logs did not look any different. Thus, my next step was to delete the crontab I added using sudo permissions. To do this, I opened the crontab file again using sudo permissions, deleted its contents, and saved. Again I waited a minute and re-ran the logs to see if anything had changed:

    This time around, I noted that duplicate commands have been removed. This let me know a few things: a) the crontab I had originally configured using my standard user permissions was running and b) it was generating two commands. I suspected two commands were showing because I was concatenating two commands in my crontab: one command to change directories and another to run a Python script.
     

  7. Split the command and leverage logs: I was more befuddled than ever at this point. It seems that all my poking around had been for nothing. My crontab was installed and running successfully. So maybe it was something in my crontab that wasn’t working well.I decided to employ logs in order to troubleshoot further. I broke up my command into different components and recorded output messages into separate logs in order to see where an error may have occurred:
    * * * * * echo 'success' >> /tmp/log1.log
    * * * * * cd ~/Desktop/how\ to\ troubleshoot\ crontab/ && echo 'success' >> /tmp/log2.log
    * * * * * cd ~/Desktop/how\ to\ troubleshoot\ crontab/ && python2.7 hello_world.py >> /tmp/log3.log
    * * * * * cd ~/Desktop/how\ to\ troubleshoot\ crontab/ && python2.7 hello_world.py && echo 'success' >> /tmp/log4.log

    First thing I noted when running the new crontab is that each command line in the file generated two commands in the crontab run logs:

    So my previous assumption that the number of commands in the log corresponded to the number of commands in a line was not correct.

    Second, I noted that while I expected 4 logs to generate, only 3 did:

    And, unfortunately, they were all empty. This was not what I expected at all since when I ran the third command line from my crontab file in Terminal…

    cd ~/Desktop/how\ to\ troubleshoot\ crontab/ && python2.7 hello_world.py >> /tmp/log3.log

    …it generated a log file with a value in it:


    I was starting to suspect that crontab was having trouble running my Python script.

  8. Try running an executable file: Better informed about what was happening within my crontab, but somewhat exasperated by my situation, I thought to convert my Python script file to an executable file. I thought that an executable file may be easier for crontab to process since it would remove the dependency on the “python2.7” command. Unfortunately, even this failed!
  9. Use a different solution: So, feeling completely out of options, I decided to give up on crontab and try something different altogether: Automator.

I apologize for the disappointing conclusion. Nonetheless, I think the troubleshooting process for my issue is helpful in understanding how crontab works and how to check that crontab is running. The main takeaways:

  • When troubleshooting with crontab, update the command to run every minute to provide a quicker feedback loop.
  • Verify the format of your crontab.
  • Attempt the commands from the crontab file in Terminal to make sure they work on their own.
  • Leverage crontab logs to check that crontab is installed and running correctly.
  • Add output logs to your commands in the crontab file to better understand where failure is occurring.

How to Leverage Automator to Work the Same Way as Crontab

After struggling for a long time with trying to make my crontab work, I eventually gave up and sought out a different method to setup an automated way for my script to run. I came across multiple articles online mentioning Automator, a MacOS application that allows you to automate certain functions on your computer. Many of these guides mentioned creating an Automator job and then scheduling it to recur via iCal.

Here is how I went about doing that:

  1. I opened Automator via Spotlight:

    For our purposes, since we want this to run on its own, I created a new Automator Application (File → New or Cmd+N and then select Application):

    Automator Applications are self-running whereas Automator Workflows require user inputs to kick off.

    Now let’s go over this screen a bit, because it was very confusing to me when I first opened this application:

    Within the application, you have three main panels. The far left panel with the expandable/collapsable directories is a way to navigate through the different actions available for automation. When you first open a new file in Automator, you are automatically placed at the top of this directory (Library) and all actions are displayed.

    The second panel, the one in the middle, is where the individual actions for automation are accessible. You can scroll up and down to see some of the actions at your disposal: you can launch an application, copy and paste files from one folder to another, download URLs, or do many other things using the actions in this column.

    The third panel on the far right of the screen is the workflow panel. This allows you to build your workflow using the actions in the middle panel. This is where Automator becomes really powerful since it allows you to string together simple actions to form complex workflows: you can download URLs, then copy the files downloaded from one directory to another, and then launch a script to run a process on those newly transferred files. In order to build these workflows, you drag and drop actions from the middle panel into the far right panel.

    At the top right corner, you have Record, Step, Stop, and Run buttons that will assist in checking your workflow once you have something down.

  2. For my Automator Application, I wanted to run a Shell script. So, I filtered the left panel directory to Utilities, then searched for the “Run Shell Script” command in the middle panel, and dragged it into the right panel to create my workflow:

  3. My next step was to customize my step. So, in the step panel, I entered my Terminal code for running my Shell script:
    /Users/mnachum/Desktop/./hello_world.sh >> /tmp/log.log

  4. I then ran my Application to verify that it worked:

    Checking the logs within Automator, everything seemed to have worked as expected. I also checked the outputted logs from my command (/tmp/log.log) and verified that those looked correct, as well.

    I saved my Application:

  5. I now had my Automator Application running my desired task, but I needed to schedule it to run regularly without input from me. Many forums online mentioned being able to create an appointment in iCal and then connecting that appointment to the running of the workflow, but this seemed like outdated advice to me. When searching for this functionality in the current iCal interface, I could not find a way to make appointments or events that linked to applications or scripts.

    However, I did note in Automator the ability to create an event in iCal that would launch an application. To get to this option, I opened a new Automator Workflow…

    …filtered to Calendar events in the directory panel, and dragged and dropped the “New Calendar Event” option into my Workflow.  Under the “Alarm” dropdown, there’s an option to “Open file”:

    I configured my New Calendar Event to open my previously configured Automator Application:

    Once the “Open file” option is selected under the “Alarm” field, an application needs to be selected. I selected my previously saved Automator Application.

  6. With my scheduling Workflow completed, I ran the Workflow:

    …and verified that the Event was added to my iCal, as expected:

    Whether or not you save this workflow is up to you. My desired outcome for this Automator Workflow was completed but I still decided to save my Workflow in case I needed to troubleshoot.

  7. From my iCal, I can now configure the Event to recur:

And that’s it!  That’s how I was able to stop using crontab to run my scripts and instead started using a combination of Automator Applications/Workflows and iCal Events. Note that from my experience, the Automator and iCal applications do not need to be open in order for the automated job to complete.

When the Automator Application does run, you will be able to see that it is running via the appearance of a sun-like icon in the toolbar:

How to Convert a Python Script to a Shell Script

Why might you want to convert your Python script to a Shell script?

Originally, I thought that converting my Python script to a Shell script might make it easier for my crontab to run it, since the main benefit of a Shell script (as far as I can tell) is that it removes dependencies related to directories. Instead of putting a lot of information in your crontab command about changing directories, initializing virtual environments, and running scripts, a Shell command can do it all in one simple command.

So without further ado, here is how you can turn your Python script into a Shell script:

  1. The first step is easy: have a Python script that you want to turn into a Shell script. In my case, it was a small script called “hello_world.py”.
  2. In Sublime, open a new file and copy the following:
    #!/bin/sh
    CWD="$(pwd)"
    path/to/python python_script.py

    Adjust the last command to be specific to your file.

    Be mindful that the Shell script is initialized in the directory into which you save it. That’s what’s being done by the “pwd” command on the third line; “pwd” is shorthand for the directory in which the file exists. You can test this out by navigating to a directory in Terminal and then running the “pwd” command:

    To be safe, you can always use the absolute directory to your virtual environment and Python script. So, in my case, if I were to save the Shell script into my directory…

    /Users/mnachum/Desktop/how to troubleshoot crontab

    …I could format my Shell script as follows:

    But, just to be safe that the Shell script will continue to work even if I moved it outside of its original directory, I formatted it as follows:

    #!/bin/sh
    cd "$(dirname "$0")";
    CWD="$(pwd)"
    echo $CWD
    /Users/mnachum/Desktop/how\ to\ troubleshoot\ crontab/hello/bin/python /Users/mnachum/Desktop/how\ to\ troubleshoot\ crontab/hello_world.py
    
  3. Once the file is all setup, save it with the extension .sh:

  4. Now, in your Terminal application, navigate to the file’s directory and run the command:
    chmod u+x python_script_shell.sh

    In my case, this would be…

    chmod u+x hello_world.sh

  5. And that’s it! You now have an executable Shell script! You can run it in Terminal using the following command:

    ./hello_world.sh


    And, if you were like me and used the absolute virtual environment and Python script directories, then you can move the Shell script outside of its initial directory and still run it. In the example below, I moved the Shell script out of its folder and onto the Desktop:

For additional information, reference this StackOverflow thread.

How to Pass in Variables to Your Python Script from the Command Line

Python makes it very convenient to configure your script to take in certain command line inputs.

It’s important that you, as a writer of Python code, configure your scripts to take advantage of this, since it makes your scripts more modular.  And modularity is the key to successful development, since it allows you to develop more quickly by building on a foundation that you’ve already put in place.

Plus, you can feel like a true champ when a colleague asks you for help and in no time you are able to provide them with a script you’ve already developed and which solves their exact conundrum.

hero

The module I use in my Python scripts to allow for the passing of command line inputs is called argparse. It is not native to Python 2.7, so you will need to run the following command to install it in your virtual environment:

pip install argparse

I will be referencing this Python script throughout this example, so feel free to download and reference.

Once you have argparse installed, import it into your script and create a function to collect command line inputs. This is the format I use for that function:

def commandLineSetup():
    commandParser = argparse.ArgumentParser(description="Pings an API link for data "
                                                  "and outputs data to JSON file")
    commandParser.add_argument("-l", "--link", help="API link to ping for information")
    commandParser.add_argument("-o", "--outputfile", help="Name/Path of the JSON output file")

    args = commandParser.parse_args()

    return args.link, args.outputfile

Let’s break down what each of these elements means:

  • commandLineSteup():  This is the name of our command line parsing function.  You can name it whatever you want.  commandLineSetup works for me!
  • argparse.ArgumentParser: Here we are configuring the argument parser, which is the main object type of the argparse module.  There are multiple variables you can configure when setting up this object.  The main one I like to configure is the “description” variable.  This provides users with some information about what your script does and what they should expect.
  • add_argument: Here we are providing the ArgumentParser with information about what arguments to expect via the command line and how to differentiate one argument from another.  For example, to run our script, we need two variables: a URL and an output file.  Information about our URL will come after the string “-l” in the command line and information about our output file will come after the string “-o”.  Additionally, we provide some help text to the user to clarify what these variables mean and what our scripts needs to run successfully.
  • parse_args: This is a function of the ArgumentParser object class that we can use to parse the passed in arguments using our defined criteria.
  • return: We return the parsed arguments from our function, commandLineSetup().

To utilize this function, we need to call it when our script begins to run.

This is my reference:

LINK, OUTPUT = commandLineSetup()

Notice that the order of the variable assignment in my call to the function mirrors the order of the variables in the return clause of the function. (Note: You do not need to capitalize your variables; for me, it’s a personal preference, since it allows me to clearly note which variables are passed in and which are not.)

Lastly, I include a conditional statement in my script to verify that I have indeed received all the variables necessary to run my operations successfully:

if not LINK or not OUTPUT:
    print str(datetime.datetime.now()) + " - Insufficient inputs provided"
    exit()

The conditional statement stops the script from running if I am missing both input variables and returns an error message in Terminal to notify the user of what has happened. As the developer, you can write as many conditionals as you want in your script. For example, I could have added a regex clause to verify that the passed in OUTPUT string contains a “.json” file extension:

if re.match(r'(\.json)$', OUTPUT) == None:
    print str(datetime.datetime.now()) + " - Incorrect format for JSON file detected"
    exit()

Similarly, I could have set up a different conditional statement to verify that the passed in link is a valid URL.

In Practice

Once you’ve set up your argument parsing function, you can do all kinds of nifty things in Terminal. For example, by using the command “–h”, you can return information about what your script does and what variables it anticipates:

command_line_help

If we pass in information for our variables, we can see that the script is able to run to completion:

comman_line_inputs

We can also test that the opposite is true: that if we don’t pass certain variables to our script, that it will fail with an error message:

command_line_incom_inputs

Isn’t this nice?

Recap

I encourage you to know how to parse command line arguments in your Python scripts and to try to do this for all your scripts so that you can add modularity to them and impress your friends.

My complete Python example (used in this post) is available here.

You can learn more about the argparse module here.

Running Scripts in Pentaho Kettle, the Sequel

Surprise!  One of this blog’s most successful posts is about how to run scripts in Pentaho Kettle.

Confession: I wrote that post a long time ago (in fact, it was one of my very first posts about Pentaho Kettle).  And since then, I’ve learned a lot more about Kettle and about running scripts in Kettle.  Therefore, I wanted to offer a refresher to the original post and a new recommendation on how to better run scripts in Kettle.

More recently, I’ve been running scripts in Kettle like this:

scripts_general

What’s different?

  1. The “Insert Script” option is checked meaning that the second tab, “Script”, is now available for us to fill in.  This tab acts like our Terminal in Kettle.  Anything that you can run in Terminal, you can execute in the Script tab, but I’ll get more into that later.
  2. The “Working Directory” is now an environmental variable.  This is an improvement over our previous configuration, since it allows for greater transferability of the Kettle job from one person to another.

On the “Script” tab, this is my configuration:

scripts_script

In here, I’m using environmental variables to specify my output directories providing more ease of transferability when exchanging jobs with other persons.  Additionally, I am not relying on the machine’s version of Python, but rather a version of Python specific to a virtualenv.  This again, better insures that when transferring my job to other people, they are able to recreate the virtual environment I’ve created on my machine and run the job without a problem.

In Practice

Let’s say I wrote a script that:

  1. Pings an API
  2. Places the returned data in a JSON file

The script takes in two inputs: a link to ping the API and an output filename where the returned JSON will be placed.  This is my script:

import requests
import argparse
import json
import datetime

# Writes to a JSON file.
# Input: filename/location string, data in the form of array of
# dictionaries
###################################################################
def toJSON(filename, data):
    with open(filename, 'w') as outfile:
        json.dump(data, outfile)

# Call a given link.
# Input: API link to ping
###################################################################
def callAPI(link):
    response = requests.get(link)
    return response

# Parses incoming information
######################################################################
def commandLineSetup():
    commandParser = argparse.ArgumentParser(description="Pings an API link for data "
"and outputs data to JSON file")
    commandParser.add_argument("-l", "--link", help="API link to ping for information")
    commandParser.add_argument("-o", "--outputfile", help="Name/Path of the JSON output file")

    args = commandParser.parse_args()

    return args.link, args.outputfile

######################################################################
# MAIN
######################################################################

def main():
    LINK, OUTPUT = commandLineSetup()

# Check that proper inputs were provided
    if not LINK or not OUTPUT:
        print str(datetime.datetime.now()) + " - Insufficient inputs provided"
        exit()

    print str(datetime.datetime.now()) + " - Calling link %s" % LINK
    response = callAPI(LINK)
    print str(datetime.datetime.now()) + " - Outputting to file %s" % OUTPUT
    toJSON(OUTPUT, response.json())
    print str(datetime.datetime.now()) + " - Done!"

if __name__ == "__main__":
    main()

Notice that my script relies on two packages that are not native to Python: requests and argparse.  I use requests to ping the API to retrieve data and argparse to parse passed-in information from the command line.  To accommodate for these two modules, I create a virtual environment called “example”, which has a requirements.txt file.

Once my virtualenv is configured, I can test out my Python script in my virtualenv in my terminal window by running a command from within the working directory:

scripts_terminal

My working directory in this case is ~/Desktop/Personal/Sites/Blog/Projects.  This is also where I have my job saved:

scripts_job

Therefore, when configuring my script for execution within Kettle, I can use the variable ${Internal.Job.Filename.Directory} to specify my working directory and enter the same command as I did in Terminal and everything will execute just as it did in Terminal:

scripts_job_generalscripts_job_script

To check out my example transformation, please download this file (make sure to create the virtual environment before attempting to run the job; name the virtual environment “example”).

Recap

When executing scripts in Kettle, it is better to use the “Insert script” option, since it allows for:

  • Better job transferability
  • Easier compatibility of virtual environments
  • Integration of Kettle environmental variables

I hope you find this useful!

Parsing Huge XML Files in Pentaho Kettle

Recently I was working with Amazon Data Feed files.  These files are about 1GB zipped up and about 15GB unzipped.  And they’re XML.  These files…  are HUGE!

n4vdvup

For files of this magnitude, the traditional Get Data from XML step does not work since that step requires Kettle to load the XML data in memory, which at 15GB will cause Kettle to freeze and crash.  However, there is an alternative step called XML Input Stream (StAX) which can be used to parse the XML data.  This post will discuss how to configure that StAX step.

To start off, this is how the transformation looks:

stax_overview

1. XML Input Stream (StAX):

The main fields to complete here indicate…

  • Where the data is coming from
  • Whether the data should be limited (limiting the data initially is helpful as you figure out the structure of the data)

This is what my configuration looks like:

stax_main

The options below the ones mentioned specify what kind of data you want extracted from the XML file.  StAX essentially parses the file node-by-node and converts every node/tag/attribute to a row.  This is what the data looks like in preview:

stax_preview

The level of detail here is extremely granular.  Every start element, end element, attribute, etc. is given its own row.  Therefore, when running the transformation on a full file, expect to see 100M+ rows being passed through.  Processing should still be fast, though.

2. Filter

The second step, the filter step, filters out the START_ELEMENT and END_ELEMENT rows thus reducing the amount of data Kettle needs to work on.  This works if you are not interested in the start/end element information.  You can always keep it; however, the subsequent JavaScript step is dependent on only have one xml_data_name for unique attributes of the entity.

stax_filter

3. JavaScript

The transformation is dependent on having only one xml_data_name for unique attributes.  These unique attributes are used to increment a grouping variable that is used to differentiate one entity from the next.

Here’s my configuration for this (my explanation for this configuration is below the images):

stax_start_script

stax_main_script

First, I set up a start script.  What does this do?  It initializes a variable called “group” upon start-up of the step.  In this case, the group variable is being set to 0.

Second, in a standard JavaScript tab, I increment the “group” variable based on data passing through the JavaScript step.  Upon previewing the content of the XML, I noticed that the first unique attribute being passed in is called “ASIN”:

stax_asin.png

Therefore, I chose to increment my “group” variable on whether xml_data_name is equal to the value “ASIN”.  Note that this could be different for different XMLs, so always make sure to review your incoming data thoroughly to determine which value is unique — it may not always be the first one!

4. Denorm

Using the newly created “group” variable, I can now denorm the data into a format that’s easier to read.  In the denorm step, I pivot off data in the xml_path field and bring in data from the xml_data_value field.

stax_denorm

The denorm step provides the option to concatenate/group elements together, which makes things easier in terms of combining elements that might have multiple values.

(Note: do NOT attempt to sort the data by the grouping variable, since it may cause Kettle to crash — sorting is another step that is run in memory, which is what we want to avoid doing with a data file this large.)

And with that, your XML should be starting to look like a really nice CSV:

stax_final.png

To recap:

  1. Open the file in the StAX step — this will convert every aspect of the XML into a row.
  2. Filter out START_ELEMENT and END_ELEMENT.
  3. Figure out what the unique field is and use JavaScript to increment and create a variable for grouping.
  4. Using the grouping variable to denorm data of interest.

To get started, feel free to use my sample transformation.

Workaround for python setup.py egg_info Error

Recently, I was working in Python and trying to install the pandas module using the pip command, but kept getting an error like this:

InstallationError: Command python setup.py egg_info failed with error 
code 1 in /var/www/python/virtualenv

So frustrating!

I spent about two hours trying to figure out how to resolve this issue and wanted to share my solution with you here.

To summarize, I downloaded the necessary package using apt-get and then moved the module into my virtual environment via bash command. So, even if you did not experience the same error message as me, this post might be helpful if you’re just looking to move modules from the general Python directory into your virtual environment.

For those looking to get a better understanding of the difference between apt-get and pip, I recommend Aditya’s answer on this StackOverflow post (I don’t have much experience in this).

For those looking to resolve the error:

  1. To start, you need to install the desired module using the apt-get command. For pandas, that’s this:
    sudo apt-get install python-pandas

    This installs the module in a system-wide location on your server.

  2. Find where your recently-downloaded module is located on the server. It should be in your dist-packages directory under your installation of Python.

    The complete path for me was located at /usr/lib/python2.7/dist-packages.

    You can also launch Python using the command python –v and see where all the different loaded packages reside upon start-up of the application.

  3. Create a directory for the module in your virtual environment:
    cd /var/www/python/virtualenv/lib/python2.7/site-packages && mkdir pandas
  4. Now move all files from the system-wide module directory into your virtual environment:
     mv /usr/lib/python2.7/dist-packages/pandas/* /var/www/python/virtualenv/lib/python2.7/site-packages/pandas/

And that’s it! You should now be able to access the module in your virtual environment. You can verify this by launching your virtual environment version of Python and attempting to import the module. No errors means success.

Do note that given discrepancies between apt-get and pip, the version of the module you transferred may not be the most recent version available.

Also, this might not be the best solution for your problem! I was stuck on the error for a long time and opted for a workaround like this to alleviate my issues after trying a number of different recommendations across the web. If you find a better solution, please forward it my way.

Unzipping Multiple Files in Pentaho Kettle

I was recently attempting to download a lot of data from a site using Pentaho Kettle. Downloading the files was simple with the setup from my previous post. However, once the files were downloaded, I realized they were all in .zip format. I needed to unzip them.

The process was fairly difficult for me and the web was of little to no help. Therefore, I thought I’d put together this post so that future users of Pentaho Kettle don’t have to go through the same headaches I did.

The step to unzip files is only available in Jobs (not transformations). The icon looks like this:

Unzip Step Icon

This is what the step looks like when you double-click it (with some fields filled in by me):

Unzip Step Main Menu

The step, as it is shown here, is configured to unzip multiple files. If you only have one file to unzip, then this process is much easier. Additionally, you’ll notice that I have set up a ${PATH} variable for my job that points to a directory on my computer.

The main fields to note here are:

  • Zip File Name: If you have just one file to unzip, you can treat this like any other directory field in Pentaho Kettle and simply browse to the file location on your computer and select it. If you have multiple files, you’ll want to specify the folder in which the .zip files are present.
  • Source Wildcard (RegExp): In here, you specify that you want all files within the folder specified in “Zip File Name” that end in a .zip format to be unzipped. Use “.*\.zip” to do so.

After specifying where the zipped files are residing, you need to specify to where you would like them unzippped. Specify your target directory in the “Target Directory” field.

Additionally, Pentaho Kettle offers you some other options, such as whether you only want to pull specific items from the zipped file (if there are multiple items in it) or if you want to exclude certain items. You’ll see in my example screen capture that I’m only pulling files ending in a .txt from the zipped files I’m unzipping.

Lastly, in my setup of the step, I elected to overwrite any existing documents, since I intended to run and re-run the step a couple of times (for troubleshooting).

That’s about it. I realize it looks simple, but it took me about an hour to get this hammered out. It seemed like no one else at my company had had a chance to unzip multiple files at once in the past.

Feel free to download my example as a starting point.

Create a free website or blog at WordPress.com.