Setting Up a Virtual Environment for Python 3

This is a follow-up for my previous article on how to set up virtual environments for Python.

I recently started coding a bit again and attempted to set up a virtual environment per those older instructions and realized that they were outdated: Python 2.7 is now largely deprecated (replaced by Python 3.x) and prior packages I used to install pip (easy_install) have now been superseded by better options (homebrew).

So, with that in mind, I thought I would provide an update to that older article on how to set up a virtual environment.

  1. First off, you need to make sure that you have Python 3 installed on your computer. You can check for this by running the command…

    python3 --version

    If you get an error in return, something like this:
    python 3 version command and failed result

    …then it’s fair to assume that you don’t have Python 3 installed.

    To install Python 3, you will need to download a GCC package through Xcode (Apple ID is required for this). This is also a huge package (11GB zipped when I downloaded it), so allow for some time here. If you are installing Xcode for the first time, don’t forget to run the following command in Terminal once the application is installed in order to configure the command line integration:

    xcode-select --install

    If you’re opting for quicker options, according to this good guide, not all of Xcode is needed. So, if you want to opt for a smaller package, you can download Command Line Tools (must have an Apple account) or the even smaller OSX-GCC-Installer package. (Full disclaimer: I downloaded Xcode and therefore am not sure as to the efficacy of the alternative options provided here.)

    Once you have the GCC package installed via whatever means, you will need to install Homebrew (I know! All this just to install Python 3. I hope it’s worth it!) The main step here is to run the following command in Terminal:

    /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

    If Homebrew has trouble installing, it’ll often tell you why and how you can restart the installation process using a specific command.

    Once Homebrew is installed, the process for installing Python 3 is very straightforward. Just run the following command in Terminal and you’re all set:

    brew install python

    Check the output code for any errors and run the version command once more to verify that Python 3 is installed:

    python3 --version
    python 3 version command and successful result

    If any errors are returned, review the outputs from the Xcode and Homebrew installations to make sure that those were installed successfully. Reference the previously linked guide on the installation of those two packages to make sure that everything was installed correctly.
  2. Now comes the easy part: in Terminal, navigate to the directory where you want to create your virtual environment.
  3. Once within the directory, create the virtual environment using the command,

    python3 -m venv project

    …where “project” is the desired name for your virtual environment. I named mine “example” in the screenshot below:
    running venv command for example virtual environment

  4. You can now activate and install packages for this virtual environment as you would historically.

    Use the following command to activate the environment:

    source project/bin/activate

    And pip commands within this configuration to install packages.

    Use the deactivate command to close out of this configuration screen:

    deactivate
    configuring a virtual environment created with venv command

  5. If you need to load or freeze requirements, the same commands as before work here, too:

    pip freeze > requirements.txt

    pip install -r requirements.txt

    If you’re installing requirements from an older version of Python, I would recommend installing them one by one from the configuration as opposed to loading the requirements using the install command.

And that’s essentially it! If you don’t have Xcode and Homebrew, this process can be quite tedious. But once those utilities are installed, creating virtual environments for Python 3 is an easy process.

Get Updates

Learn about ETL, SQL, and other product management techniques via our newsletter!

How to Leverage Automator to Work the Same Way as Crontab

After struggling for a long time with trying to make my crontab work, I eventually gave up and sought out a different method to setup an automated way for my script to run. I came across multiple articles online mentioning Automator, a MacOS application that allows you to automate certain functions on your computer. Many of these guides mentioned creating an Automator job and then scheduling it to recur via iCal.

Here is how I went about doing that:

  1. I opened Automator via Spotlight:

    For our purposes, since we want this to run on its own, I created a new Automator Application (File → New or Cmd+N and then select Application):

    Automator Applications are self-running whereas Automator Workflows require user inputs to kick off.

    Now let’s go over this screen a bit, because it was very confusing to me when I first opened this application:

    Within the application, you have three main panels. The far left panel with the expandable/collapsable directories is a way to navigate through the different actions available for automation. When you first open a new file in Automator, you are automatically placed at the top of this directory (Library) and all actions are displayed.

    The second panel, the one in the middle, is where the individual actions for automation are accessible. You can scroll up and down to see some of the actions at your disposal: you can launch an application, copy and paste files from one folder to another, download URLs, or do many other things using the actions in this column.

    The third panel on the far right of the screen is the workflow panel. This allows you to build your workflow using the actions in the middle panel. This is where Automator becomes really powerful since it allows you to string together simple actions to form complex workflows: you can download URLs, then copy the files downloaded from one directory to another, and then launch a script to run a process on those newly transferred files. In order to build these workflows, you drag and drop actions from the middle panel into the far right panel.

    At the top right corner, you have Record, Step, Stop, and Run buttons that will assist in checking your workflow once you have something down.

  2. For my Automator Application, I wanted to run a Shell script. So, I filtered the left panel directory to Utilities, then searched for the “Run Shell Script” command in the middle panel, and dragged it into the right panel to create my workflow:

  3. My next step was to customize my step. So, in the step panel, I entered my Terminal code for running my Shell script:
    /Users/mnachum/Desktop/./hello_world.sh >> /tmp/log.log

  4. I then ran my Application to verify that it worked:

    Checking the logs within Automator, everything seemed to have worked as expected. I also checked the outputted logs from my command (/tmp/log.log) and verified that those looked correct, as well.

    I saved my Application:

  5. I now had my Automator Application running my desired task, but I needed to schedule it to run regularly without input from me. Many forums online mentioned being able to create an appointment in iCal and then connecting that appointment to the running of the workflow, but this seemed like outdated advice to me. When searching for this functionality in the current iCal interface, I could not find a way to make appointments or events that linked to applications or scripts.

    However, I did note in Automator the ability to create an event in iCal that would launch an application. To get to this option, I opened a new Automator Workflow…

    …filtered to Calendar events in the directory panel, and dragged and dropped the “New Calendar Event” option into my Workflow.  Under the “Alarm” dropdown, there’s an option to “Open file”:

    I configured my New Calendar Event to open my previously configured Automator Application:

    Once the “Open file” option is selected under the “Alarm” field, an application needs to be selected. I selected my previously saved Automator Application.

  6. With my scheduling Workflow completed, I ran the Workflow:

    …and verified that the Event was added to my iCal, as expected:

    Whether or not you save this workflow is up to you. My desired outcome for this Automator Workflow was completed but I still decided to save my Workflow in case I needed to troubleshoot.

  7. From my iCal, I can now configure the Event to recur:

And that’s it!  That’s how I was able to stop using crontab to run my scripts and instead started using a combination of Automator Applications/Workflows and iCal Events. Note that from my experience, the Automator and iCal applications do not need to be open in order for the automated job to complete.

When the Automator Application does run, you will be able to see that it is running via the appearance of a sun-like icon in the toolbar:

How to Pass in Variables to Your Python Script from the Command Line

Python makes it very convenient to configure your script to take in certain command line inputs.

It’s important that you, as a writer of Python code, configure your scripts to take advantage of this, since it makes your scripts more modular.  And modularity is the key to successful development, since it allows you to develop more quickly by building on a foundation that you’ve already put in place.

Plus, you can feel like a true champ when a colleague asks you for help and in no time you are able to provide them with a script you’ve already developed and which solves their exact conundrum.

hero

The module I use in my Python scripts to allow for the passing of command line inputs is called argparse. It is not native to Python 2.7, so you will need to run the following command to install it in your virtual environment:

pip install argparse

I will be referencing this Python script throughout this example, so feel free to download and reference.

Once you have argparse installed, import it into your script and create a function to collect command line inputs. This is the format I use for that function:

def commandLineSetup():
    commandParser = argparse.ArgumentParser(description="Pings an API link for data "
                                                  "and outputs data to JSON file")
    commandParser.add_argument("-l", "--link", help="API link to ping for information")
    commandParser.add_argument("-o", "--outputfile", help="Name/Path of the JSON output file")

    args = commandParser.parse_args()

    return args.link, args.outputfile

Let’s break down what each of these elements means:

  • commandLineSteup():  This is the name of our command line parsing function.  You can name it whatever you want.  commandLineSetup works for me!
  • argparse.ArgumentParser: Here we are configuring the argument parser, which is the main object type of the argparse module.  There are multiple variables you can configure when setting up this object.  The main one I like to configure is the “description” variable.  This provides users with some information about what your script does and what they should expect.
  • add_argument: Here we are providing the ArgumentParser with information about what arguments to expect via the command line and how to differentiate one argument from another.  For example, to run our script, we need two variables: a URL and an output file.  Information about our URL will come after the string “-l” in the command line and information about our output file will come after the string “-o”.  Additionally, we provide some help text to the user to clarify what these variables mean and what our scripts needs to run successfully.
  • parse_args: This is a function of the ArgumentParser object class that we can use to parse the passed in arguments using our defined criteria.
  • return: We return the parsed arguments from our function, commandLineSetup().

To utilize this function, we need to call it when our script begins to run.

This is my reference:

LINK, OUTPUT = commandLineSetup()

Notice that the order of the variable assignment in my call to the function mirrors the order of the variables in the return clause of the function. (Note: You do not need to capitalize your variables; for me, it’s a personal preference, since it allows me to clearly note which variables are passed in and which are not.)

Lastly, I include a conditional statement in my script to verify that I have indeed received all the variables necessary to run my operations successfully:

if not LINK or not OUTPUT:
    print str(datetime.datetime.now()) + " - Insufficient inputs provided"
    exit()

The conditional statement stops the script from running if I am missing both input variables and returns an error message in Terminal to notify the user of what has happened. As the developer, you can write as many conditionals as you want in your script. For example, I could have added a regex clause to verify that the passed in OUTPUT string contains a “.json” file extension:

if re.match(r'(\.json)$', OUTPUT) == None:
    print str(datetime.datetime.now()) + " - Incorrect format for JSON file detected"
    exit()

Similarly, I could have set up a different conditional statement to verify that the passed in link is a valid URL.

In Practice

Once you’ve set up your argument parsing function, you can do all kinds of nifty things in Terminal. For example, by using the command “–h”, you can return information about what your script does and what variables it anticipates:

command_line_help

If we pass in information for our variables, we can see that the script is able to run to completion:

comman_line_inputs

We can also test that the opposite is true: that if we don’t pass certain variables to our script, that it will fail with an error message:

command_line_incom_inputs

Isn’t this nice?

Recap

I encourage you to know how to parse command line arguments in your Python scripts and to try to do this for all your scripts so that you can add modularity to them and impress your friends.

My complete Python example (used in this post) is available here.

You can learn more about the argparse module here.

Running Scripts in Pentaho Kettle, the Sequel

Surprise!  One of this blog’s most successful posts is about how to run scripts in Pentaho Kettle.

Confession: I wrote that post a long time ago (in fact, it was one of my very first posts about Pentaho Kettle).  And since then, I’ve learned a lot more about Kettle and about running scripts in Kettle.  Therefore, I wanted to offer a refresher to the original post and a new recommendation on how to better run scripts in Kettle.

More recently, I’ve been running scripts in Kettle like this:

scripts_general

What’s different?

  1. The “Insert Script” option is checked meaning that the second tab, “Script”, is now available for us to fill in.  This tab acts like our Terminal in Kettle.  Anything that you can run in Terminal, you can execute in the Script tab, but I’ll get more into that later.
  2. The “Working Directory” is now an environmental variable.  This is an improvement over our previous configuration, since it allows for greater transferability of the Kettle job from one person to another.

On the “Script” tab, this is my configuration:

scripts_script

In here, I’m using environmental variables to specify my output directories providing more ease of transferability when exchanging jobs with other persons.  Additionally, I am not relying on the machine’s version of Python, but rather a version of Python specific to a virtualenv.  This again, better insures that when transferring my job to other people, they are able to recreate the virtual environment I’ve created on my machine and run the job without a problem.

In Practice

Let’s say I wrote a script that:

  1. Pings an API
  2. Places the returned data in a JSON file

The script takes in two inputs: a link to ping the API and an output filename where the returned JSON will be placed.  This is my script:

import requests
import argparse
import json
import datetime

# Writes to a JSON file.
# Input: filename/location string, data in the form of array of
# dictionaries
###################################################################
def toJSON(filename, data):
    with open(filename, 'w') as outfile:
        json.dump(data, outfile)

# Call a given link.
# Input: API link to ping
###################################################################
def callAPI(link):
    response = requests.get(link)
    return response

# Parses incoming information
######################################################################
def commandLineSetup():
    commandParser = argparse.ArgumentParser(description="Pings an API link for data "
"and outputs data to JSON file")
    commandParser.add_argument("-l", "--link", help="API link to ping for information")
    commandParser.add_argument("-o", "--outputfile", help="Name/Path of the JSON output file")

    args = commandParser.parse_args()

    return args.link, args.outputfile

######################################################################
# MAIN
######################################################################

def main():
    LINK, OUTPUT = commandLineSetup()

# Check that proper inputs were provided
    if not LINK or not OUTPUT:
        print str(datetime.datetime.now()) + " - Insufficient inputs provided"
        exit()

    print str(datetime.datetime.now()) + " - Calling link %s" % LINK
    response = callAPI(LINK)
    print str(datetime.datetime.now()) + " - Outputting to file %s" % OUTPUT
    toJSON(OUTPUT, response.json())
    print str(datetime.datetime.now()) + " - Done!"

if __name__ == "__main__":
    main()

Notice that my script relies on two packages that are not native to Python: requests and argparse.  I use requests to ping the API to retrieve data and argparse to parse passed-in information from the command line.  To accommodate for these two modules, I create a virtual environment called “example”, which has a requirements.txt file.

Once my virtualenv is configured, I can test out my Python script in my virtualenv in my terminal window by running a command from within the working directory:

scripts_terminal

My working directory in this case is ~/Desktop/Personal/Sites/Blog/Projects.  This is also where I have my job saved:

scripts_job

Therefore, when configuring my script for execution within Kettle, I can use the variable ${Internal.Job.Filename.Directory} to specify my working directory and enter the same command as I did in Terminal and everything will execute just as it did in Terminal:

scripts_job_generalscripts_job_script

To check out my example transformation, please download this file (make sure to create the virtual environment before attempting to run the job; name the virtual environment “example”).

Recap

When executing scripts in Kettle, it is better to use the “Insert script” option, since it allows for:

  • Better job transferability
  • Easier compatibility of virtual environments
  • Integration of Kettle environmental variables

I hope you find this useful!

Parsing Huge XML Files in Pentaho Kettle

Recently I was working with Amazon Data Feed files.  These files are about 1GB zipped up and about 15GB unzipped.  And they’re XML.  These files…  are HUGE!

n4vdvup

For files of this magnitude, the traditional Get Data from XML step does not work since that step requires Kettle to load the XML data in memory, which at 15GB will cause Kettle to freeze and crash.  However, there is an alternative step called XML Input Stream (StAX) which can be used to parse the XML data.  This post will discuss how to configure that StAX step.

To start off, this is how the transformation looks:

stax_overview

1. XML Input Stream (StAX):

The main fields to complete here indicate…

  • Where the data is coming from
  • Whether the data should be limited (limiting the data initially is helpful as you figure out the structure of the data)

This is what my configuration looks like:

stax_main

The options below the ones mentioned specify what kind of data you want extracted from the XML file.  StAX essentially parses the file node-by-node and converts every node/tag/attribute to a row.  This is what the data looks like in preview:

stax_preview

The level of detail here is extremely granular.  Every start element, end element, attribute, etc. is given its own row.  Therefore, when running the transformation on a full file, expect to see 100M+ rows being passed through.  Processing should still be fast, though.

2. Filter

The second step, the filter step, filters out the START_ELEMENT and END_ELEMENT rows thus reducing the amount of data Kettle needs to work on.  This works if you are not interested in the start/end element information.  You can always keep it; however, the subsequent JavaScript step is dependent on only have one xml_data_name for unique attributes of the entity.

stax_filter

3. JavaScript

The transformation is dependent on having only one xml_data_name for unique attributes.  These unique attributes are used to increment a grouping variable that is used to differentiate one entity from the next.

Here’s my configuration for this (my explanation for this configuration is below the images):

stax_start_script

stax_main_script

First, I set up a start script.  What does this do?  It initializes a variable called “group” upon start-up of the step.  In this case, the group variable is being set to 0.

Second, in a standard JavaScript tab, I increment the “group” variable based on data passing through the JavaScript step.  Upon previewing the content of the XML, I noticed that the first unique attribute being passed in is called “ASIN”:

stax_asin.png

Therefore, I chose to increment my “group” variable on whether xml_data_name is equal to the value “ASIN”.  Note that this could be different for different XMLs, so always make sure to review your incoming data thoroughly to determine which value is unique — it may not always be the first one!

4. Denorm

Using the newly created “group” variable, I can now denorm the data into a format that’s easier to read.  In the denorm step, I pivot off data in the xml_path field and bring in data from the xml_data_value field.

stax_denorm

The denorm step provides the option to concatenate/group elements together, which makes things easier in terms of combining elements that might have multiple values.

(Note: do NOT attempt to sort the data by the grouping variable, since it may cause Kettle to crash — sorting is another step that is run in memory, which is what we want to avoid doing with a data file this large.)

And with that, your XML should be starting to look like a really nice CSV:

stax_final.png

To recap:

  1. Open the file in the StAX step — this will convert every aspect of the XML into a row.
  2. Filter out START_ELEMENT and END_ELEMENT.
  3. Figure out what the unique field is and use JavaScript to increment and create a variable for grouping.
  4. Using the grouping variable to denorm data of interest.

To get started, feel free to use my sample transformation.

Workaround for python setup.py egg_info Error

Recently, I was working in Python and trying to install the pandas module using the pip command, but kept getting an error like this:

InstallationError: Command python setup.py egg_info failed with error 
code 1 in /var/www/python/virtualenv

So frustrating!

I spent about two hours trying to figure out how to resolve this issue and wanted to share my solution with you here.

To summarize, I downloaded the necessary package using apt-get and then moved the module into my virtual environment via bash command. So, even if you did not experience the same error message as me, this post might be helpful if you’re just looking to move modules from the general Python directory into your virtual environment.

For those looking to get a better understanding of the difference between apt-get and pip, I recommend Aditya’s answer on this StackOverflow post (I don’t have much experience in this).

For those looking to resolve the error:

  1. To start, you need to install the desired module using the apt-get command. For pandas, that’s this:
    sudo apt-get install python-pandas

    This installs the module in a system-wide location on your server.

  2. Find where your recently-downloaded module is located on the server. It should be in your dist-packages directory under your installation of Python.

    The complete path for me was located at /usr/lib/python2.7/dist-packages.

    You can also launch Python using the command python –v and see where all the different loaded packages reside upon start-up of the application.

  3. Create a directory for the module in your virtual environment:
    cd /var/www/python/virtualenv/lib/python2.7/site-packages && mkdir pandas
  4. Now move all files from the system-wide module directory into your virtual environment:
     mv /usr/lib/python2.7/dist-packages/pandas/* /var/www/python/virtualenv/lib/python2.7/site-packages/pandas/

And that’s it! You should now be able to access the module in your virtual environment. You can verify this by launching your virtual environment version of Python and attempting to import the module. No errors means success.

Do note that given discrepancies between apt-get and pip, the version of the module you transferred may not be the most recent version available.

Also, this might not be the best solution for your problem! I was stuck on the error for a long time and opted for a workaround like this to alleviate my issues after trying a number of different recommendations across the web. If you find a better solution, please forward it my way.

Blog at WordPress.com.