This is a follow-up for my previous article on how to set up virtual environments for Python.
I recently started coding a bit again and attempted to set up a virtual environment per those older instructions and realized that they were outdated: Python 2.7 is now largely deprecated (replaced by Python 3.x) and prior packages I used to install pip (easy_install) have now been superseded by better options (homebrew).
So, with that in mind, I thought I would provide an update to that older article on how to set up a virtual environment.
python3 --version
xcode-select --install
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
brew install python
python3 --version
python3 -m venv project
source project/bin/activate
deactivate
pip freeze > requirements.txt
pip install -r requirements.txt
And that’s essentially it! If you don’t have Xcode and Homebrew, this process can be quite tedious. But once those utilities are installed, creating virtual environments for Python 3 is an easy process.
The other day, I updated the operating system on the server on which my crontab was installed. I thought that this would be a harmless upgrade to the system — nothing too crazy. However, I was wrong, the crontab that was working perfectly well on the old operating system now seemingly ceased to work. Nothing had changed in my setup of the scripts or the crontab, so why was it not working?
I set about troubleshooting the issue with some guidance from this article:
Disclaimer: After all my troubleshooting, I ultimately did not resolve the core of my issue and instead decided to give up on crontab and use a different solution. So, if you’re looking for a solution to my crontab problems at the end of this article, you will not find it.
env EDITOR=nano crontab -e
And sure enough the setup of my crontab was as I had left it:
0 * * * * cd ~/Desktop/how\ to\ troubleshoot\ crontab/ && python2.7 hello_world.py
While I was here, I thought to update the “0” in my job to “*” so that the job will run regularly every minute. This will make it easier to troubleshoot what was going on. I then closed and saved the crontab.
This is what I saw:
What happened to my crontab command? Well, just to be sure, I went ahead and entered the same command into this crontab file that I had opened with the sudo permissions and checked if that produced the result I wanted. It did not.
So, I speculated that perhaps iTerm, the Terminal application I was using to edit and install the crontab did not have proper permissions to install the crontab file on my server. I opened up my Mac’s “System Preferences”, accessible from the Apple icon in the upper left corner of the screen:
Navigated to “Security & Privacy” tab:
Scrolled to the “Full Disk Access” portion and unlocked the screen:
Then added the iTerm application to the list of applications with “Full Disk Access” using the “+” button under the list of applications to the right:
With iTerm now having “Full Disk Access”, I re-attempted to save the crontab (using sudo permissions), but had no success.
This is a directory that is otherwise hidden, so I used the “Spotlight” search feature on Mac (accessed via Cmd + Space) to search for the /etc/ directory and found it:
And sure enough, once I opened that directory, I noticed that it had no file called “crontab” in it.
In the absence of a “crontab” file, I decided I’d add one myself in a rather unorthodox manner. I opened a new file in Sublime Text, wrote my crontab command into the file, saved it to my “Desktop”, and then dragged it from that directory into the “etc” directory. What could go wrong?
And yet, despite all this, nothing seemed to change! My crontab job was still failing to run.
ps -ef | grep cron | grep -v grep
…to check if my crontab has run. And sure enough it had! Multiple times even!
To better understand this output, I looked up how to get the column headers for the grep command and found this helpful writeup. I went ahead and amended my command to the following:
ps -ef | head -1 && ps -ef | grep cron | grep -v grep
Now that it seemed that something was running, I did some accounting: Up to this point, it’s possible that I may have set up three different crontab files. One was set up via crontab using my own user permissions, another was set up via crontab using sudo permissions, and the last one was brute-forced into the “/etc/” directory using a text file.
Of those three, it seemed that one or two were running. I suspected that the command labeled “(cron)” was a result of my sudo permissions manipulation and that the command labeled “/usr/sbin/cron” was coming from the one setup with my standard user permissions. As to why two instances existed for both, I was not sure. So, in an effort to verify this, I deleted my brute-forced crontab file from the “/etc/” directory and waited a minute to see if the logs looked any different.
The logs did not look any different. Thus, my next step was to delete the crontab I added using sudo permissions. To do this, I opened the crontab file again using sudo permissions, deleted its contents, and saved. Again I waited a minute and re-ran the logs to see if anything had changed:
This time around, I noted that duplicate commands have been removed. This let me know a few things: a) the crontab I had originally configured using my standard user permissions was running and b) it was generating two commands. I suspected two commands were showing because I was concatenating two commands in my crontab: one command to change directories and another to run a Python script.
* * * * * echo 'success' >> /tmp/log1.log * * * * * cd ~/Desktop/how\ to\ troubleshoot\ crontab/ && echo 'success' >> /tmp/log2.log * * * * * cd ~/Desktop/how\ to\ troubleshoot\ crontab/ && python2.7 hello_world.py >> /tmp/log3.log * * * * * cd ~/Desktop/how\ to\ troubleshoot\ crontab/ && python2.7 hello_world.py && echo 'success' >> /tmp/log4.log
First thing I noted when running the new crontab is that each command line in the file generated two commands in the crontab run logs:
So my previous assumption that the number of commands in the log corresponded to the number of commands in a line was not correct.
Second, I noted that while I expected 4 logs to generate, only 3 did:
And, unfortunately, they were all empty. This was not what I expected at all since when I ran the third command line from my crontab file in Terminal…
cd ~/Desktop/how\ to\ troubleshoot\ crontab/ && python2.7 hello_world.py >> /tmp/log3.log
…it generated a log file with a value in it:
I was starting to suspect that crontab was having trouble running my Python script.
I apologize for the disappointing conclusion. Nonetheless, I think the troubleshooting process for my issue is helpful in understanding how crontab works and how to check that crontab is running. The main takeaways:
After struggling for a long time with trying to make my crontab work, I eventually gave up and sought out a different method to setup an automated way for my script to run. I came across multiple articles online mentioning Automator, a MacOS application that allows you to automate certain functions on your computer. Many of these guides mentioned creating an Automator job and then scheduling it to recur via iCal.
Here is how I went about doing that:
Automator Applications are self-running whereas Automator Workflows require user inputs to kick off.
Now let’s go over this screen a bit, because it was very confusing to me when I first opened this application:
Within the application, you have three main panels. The far left panel with the expandable/collapsable directories is a way to navigate through the different actions available for automation. When you first open a new file in Automator, you are automatically placed at the top of this directory (Library) and all actions are displayed.
The second panel, the one in the middle, is where the individual actions for automation are accessible. You can scroll up and down to see some of the actions at your disposal: you can launch an application, copy and paste files from one folder to another, download URLs, or do many other things using the actions in this column.
The third panel on the far right of the screen is the workflow panel. This allows you to build your workflow using the actions in the middle panel. This is where Automator becomes really powerful since it allows you to string together simple actions to form complex workflows: you can download URLs, then copy the files downloaded from one directory to another, and then launch a script to run a process on those newly transferred files. In order to build these workflows, you drag and drop actions from the middle panel into the far right panel.
At the top right corner, you have Record, Step, Stop, and Run buttons that will assist in checking your workflow once you have something down.
/Users/mnachum/Desktop/./hello_world.sh >> /tmp/log.log
Checking the logs within Automator, everything seemed to have worked as expected. I also checked the outputted logs from my command (/tmp/log.log) and verified that those looked correct, as well.
I saved my Application:
…filtered to Calendar events in the directory panel, and dragged and dropped the “New Calendar Event” option into my Workflow. Under the “Alarm” dropdown, there’s an option to “Open file”:
I configured my New Calendar Event to open my previously configured Automator Application:
Once the “Open file” option is selected under the “Alarm” field, an application needs to be selected. I selected my previously saved Automator Application.
…and verified that the Event was added to my iCal, as expected:
Whether or not you save this workflow is up to you. My desired outcome for this Automator Workflow was completed but I still decided to save my Workflow in case I needed to troubleshoot.
And that’s it! That’s how I was able to stop using crontab to run my scripts and instead started using a combination of Automator Applications/Workflows and iCal Events. Note that from my experience, the Automator and iCal applications do not need to be open in order for the automated job to complete.
When the Automator Application does run, you will be able to see that it is running via the appearance of a sun-like icon in the toolbar:
Why might you want to convert your Python script to a Shell script?
Originally, I thought that converting my Python script to a Shell script might make it easier for my crontab to run it, since the main benefit of a Shell script (as far as I can tell) is that it removes dependencies related to directories. Instead of putting a lot of information in your crontab command about changing directories, initializing virtual environments, and running scripts, a Shell command can do it all in one simple command.
So without further ado, here is how you can turn your Python script into a Shell script:
#!/bin/sh CWD="$(pwd)" path/to/python python_script.py
Adjust the last command to be specific to your file.
Be mindful that the Shell script is initialized in the directory into which you save it. That’s what’s being done by the “pwd” command on the third line; “pwd” is shorthand for the directory in which the file exists. You can test this out by navigating to a directory in Terminal and then running the “pwd” command:
To be safe, you can always use the absolute directory to your virtual environment and Python script. So, in my case, if I were to save the Shell script into my directory…
/Users/mnachum/Desktop/how to troubleshoot crontab
…I could format my Shell script as follows:
But, just to be safe that the Shell script will continue to work even if I moved it outside of its original directory, I formatted it as follows:
#!/bin/sh cd "$(dirname "$0")"; CWD="$(pwd)" echo $CWD /Users/mnachum/Desktop/how\ to\ troubleshoot\ crontab/hello/bin/python /Users/mnachum/Desktop/how\ to\ troubleshoot\ crontab/hello_world.py
chmod u+x python_script_shell.sh
In my case, this would be…
chmod u+x hello_world.sh
./hello_world.sh
And, if you were like me and used the absolute virtual environment and Python script directories, then you can move the Shell script outside of its initial directory and still run it. In the example below, I moved the Shell script out of its folder and onto the Desktop:
For additional information, reference this StackOverflow thread.
Python makes it very convenient to configure your script to take in certain command line inputs.
It’s important that you, as a writer of Python code, configure your scripts to take advantage of this, since it makes your scripts more modular. And modularity is the key to successful development, since it allows you to develop more quickly by building on a foundation that you’ve already put in place.
Plus, you can feel like a true champ when a colleague asks you for help and in no time you are able to provide them with a script you’ve already developed and which solves their exact conundrum.
The module I use in my Python scripts to allow for the passing of command line inputs is called argparse. It is not native to Python 2.7, so you will need to run the following command to install it in your virtual environment:
pip install argparse
I will be referencing this Python script throughout this example, so feel free to download and reference.
Once you have argparse installed, import it into your script and create a function to collect command line inputs. This is the format I use for that function:
def commandLineSetup(): commandParser = argparse.ArgumentParser(description="Pings an API link for data " "and outputs data to JSON file") commandParser.add_argument("-l", "--link", help="API link to ping for information") commandParser.add_argument("-o", "--outputfile", help="Name/Path of the JSON output file") args = commandParser.parse_args() return args.link, args.outputfile
Let’s break down what each of these elements means:
To utilize this function, we need to call it when our script begins to run.
This is my reference:
LINK, OUTPUT = commandLineSetup()
Notice that the order of the variable assignment in my call to the function mirrors the order of the variables in the return clause of the function. (Note: You do not need to capitalize your variables; for me, it’s a personal preference, since it allows me to clearly note which variables are passed in and which are not.)
Lastly, I include a conditional statement in my script to verify that I have indeed received all the variables necessary to run my operations successfully:
if not LINK or not OUTPUT: print str(datetime.datetime.now()) + " - Insufficient inputs provided" exit()
The conditional statement stops the script from running if I am missing both input variables and returns an error message in Terminal to notify the user of what has happened. As the developer, you can write as many conditionals as you want in your script. For example, I could have added a regex clause to verify that the passed in OUTPUT string contains a “.json” file extension:
if re.match(r'(\.json)$', OUTPUT) == None: print str(datetime.datetime.now()) + " - Incorrect format for JSON file detected" exit()
Similarly, I could have set up a different conditional statement to verify that the passed in link is a valid URL.
Once you’ve set up your argument parsing function, you can do all kinds of nifty things in Terminal. For example, by using the command “–h”, you can return information about what your script does and what variables it anticipates:
If we pass in information for our variables, we can see that the script is able to run to completion:
We can also test that the opposite is true: that if we don’t pass certain variables to our script, that it will fail with an error message:
Isn’t this nice?
I encourage you to know how to parse command line arguments in your Python scripts and to try to do this for all your scripts so that you can add modularity to them and impress your friends.
My complete Python example (used in this post) is available here.
You can learn more about the argparse module here.
Surprise! One of this blog’s most successful posts is about how to run scripts in Pentaho Kettle.
Confession: I wrote that post a long time ago (in fact, it was one of my very first posts about Pentaho Kettle). And since then, I’ve learned a lot more about Kettle and about running scripts in Kettle. Therefore, I wanted to offer a refresher to the original post and a new recommendation on how to better run scripts in Kettle.
More recently, I’ve been running scripts in Kettle like this:
What’s different?
On the “Script” tab, this is my configuration:
In here, I’m using environmental variables to specify my output directories providing more ease of transferability when exchanging jobs with other persons. Additionally, I am not relying on the machine’s version of Python, but rather a version of Python specific to a virtualenv. This again, better insures that when transferring my job to other people, they are able to recreate the virtual environment I’ve created on my machine and run the job without a problem.
Let’s say I wrote a script that:
The script takes in two inputs: a link to ping the API and an output filename where the returned JSON will be placed. This is my script:
import requests import argparse import json import datetime # Writes to a JSON file. # Input: filename/location string, data in the form of array of # dictionaries ################################################################### def toJSON(filename, data): with open(filename, 'w') as outfile: json.dump(data, outfile) # Call a given link. # Input: API link to ping ################################################################### def callAPI(link): response = requests.get(link) return response # Parses incoming information ###################################################################### def commandLineSetup(): commandParser = argparse.ArgumentParser(description="Pings an API link for data " "and outputs data to JSON file") commandParser.add_argument("-l", "--link", help="API link to ping for information") commandParser.add_argument("-o", "--outputfile", help="Name/Path of the JSON output file") args = commandParser.parse_args() return args.link, args.outputfile ###################################################################### # MAIN ###################################################################### def main(): LINK, OUTPUT = commandLineSetup() # Check that proper inputs were provided if not LINK or not OUTPUT: print str(datetime.datetime.now()) + " - Insufficient inputs provided" exit() print str(datetime.datetime.now()) + " - Calling link %s" % LINK response = callAPI(LINK) print str(datetime.datetime.now()) + " - Outputting to file %s" % OUTPUT toJSON(OUTPUT, response.json()) print str(datetime.datetime.now()) + " - Done!" if __name__ == "__main__": main()
Notice that my script relies on two packages that are not native to Python: requests and argparse. I use requests to ping the API to retrieve data and argparse to parse passed-in information from the command line. To accommodate for these two modules, I create a virtual environment called “example”, which has a requirements.txt file.
Once my virtualenv is configured, I can test out my Python script in my virtualenv in my terminal window by running a command from within the working directory:
My working directory in this case is ~/Desktop/Personal/Sites/Blog/Projects. This is also where I have my job saved:
Therefore, when configuring my script for execution within Kettle, I can use the variable ${Internal.Job.Filename.Directory} to specify my working directory and enter the same command as I did in Terminal and everything will execute just as it did in Terminal:
To check out my example transformation, please download this file (make sure to create the virtual environment before attempting to run the job; name the virtual environment “example”).
When executing scripts in Kettle, it is better to use the “Insert script” option, since it allows for:
I hope you find this useful!
Recently I was working with Amazon Data Feed files. These files are about 1GB zipped up and about 15GB unzipped. And they’re XML. These files… are HUGE!
For files of this magnitude, the traditional Get Data from XML step does not work since that step requires Kettle to load the XML data in memory, which at 15GB will cause Kettle to freeze and crash. However, there is an alternative step called XML Input Stream (StAX) which can be used to parse the XML data. This post will discuss how to configure that StAX step.
To start off, this is how the transformation looks:
The main fields to complete here indicate…
This is what my configuration looks like:
The options below the ones mentioned specify what kind of data you want extracted from the XML file. StAX essentially parses the file node-by-node and converts every node/tag/attribute to a row. This is what the data looks like in preview:
The level of detail here is extremely granular. Every start element, end element, attribute, etc. is given its own row. Therefore, when running the transformation on a full file, expect to see 100M+ rows being passed through. Processing should still be fast, though.
The second step, the filter step, filters out the START_ELEMENT and END_ELEMENT rows thus reducing the amount of data Kettle needs to work on. This works if you are not interested in the start/end element information. You can always keep it; however, the subsequent JavaScript step is dependent on only have one xml_data_name for unique attributes of the entity.
The transformation is dependent on having only one xml_data_name for unique attributes. These unique attributes are used to increment a grouping variable that is used to differentiate one entity from the next.
Here’s my configuration for this (my explanation for this configuration is below the images):
First, I set up a start script. What does this do? It initializes a variable called “group” upon start-up of the step. In this case, the group variable is being set to 0.
Second, in a standard JavaScript tab, I increment the “group” variable based on data passing through the JavaScript step. Upon previewing the content of the XML, I noticed that the first unique attribute being passed in is called “ASIN”:
Therefore, I chose to increment my “group” variable on whether xml_data_name is equal to the value “ASIN”. Note that this could be different for different XMLs, so always make sure to review your incoming data thoroughly to determine which value is unique — it may not always be the first one!
Using the newly created “group” variable, I can now denorm the data into a format that’s easier to read. In the denorm step, I pivot off data in the xml_path field and bring in data from the xml_data_value field.
The denorm step provides the option to concatenate/group elements together, which makes things easier in terms of combining elements that might have multiple values.
(Note: do NOT attempt to sort the data by the grouping variable, since it may cause Kettle to crash — sorting is another step that is run in memory, which is what we want to avoid doing with a data file this large.)
And with that, your XML should be starting to look like a really nice CSV:
To recap:
To get started, feel free to use my sample transformation.
Recently, I was working in Python and trying to install the pandas module using the pip command, but kept getting an error like this:
InstallationError: Command python setup.py egg_info failed with error code 1 in /var/www/python/virtualenv
So frustrating!
I spent about two hours trying to figure out how to resolve this issue and wanted to share my solution with you here.
To summarize, I downloaded the necessary package using apt-get and then moved the module into my virtual environment via bash command. So, even if you did not experience the same error message as me, this post might be helpful if you’re just looking to move modules from the general Python directory into your virtual environment.
For those looking to get a better understanding of the difference between apt-get and pip, I recommend Aditya’s answer on this StackOverflow post (I don’t have much experience in this).
For those looking to resolve the error:
sudo apt-get install python-pandas
This installs the module in a system-wide location on your server.
The complete path for me was located at /usr/lib/python2.7/dist-packages.
You can also launch Python using the command python –v and see where all the different loaded packages reside upon start-up of the application.
cd /var/www/python/virtualenv/lib/python2.7/site-packages && mkdir pandas
mv /usr/lib/python2.7/dist-packages/pandas/* /var/www/python/virtualenv/lib/python2.7/site-packages/pandas/
And that’s it! You should now be able to access the module in your virtual environment. You can verify this by launching your virtual environment version of Python and attempting to import the module. No errors means success.
Do note that given discrepancies between apt-get and pip, the version of the module you transferred may not be the most recent version available.
Also, this might not be the best solution for your problem! I was stuck on the error for a long time and opted for a workaround like this to alleviate my issues after trying a number of different recommendations across the web. If you find a better solution, please forward it my way.
I was recently attempting to download a lot of data from a site using Pentaho Kettle. Downloading the files was simple with the setup from my previous post. However, once the files were downloaded, I realized they were all in .zip format. I needed to unzip them.
The process was fairly difficult for me and the web was of little to no help. Therefore, I thought I’d put together this post so that future users of Pentaho Kettle don’t have to go through the same headaches I did.
The step to unzip files is only available in Jobs (not transformations). The icon looks like this:
This is what the step looks like when you double-click it (with some fields filled in by me):
The step, as it is shown here, is configured to unzip multiple files. If you only have one file to unzip, then this process is much easier. Additionally, you’ll notice that I have set up a ${PATH} variable for my job that points to a directory on my computer.
The main fields to note here are:
After specifying where the zipped files are residing, you need to specify to where you would like them unzippped. Specify your target directory in the “Target Directory” field.
Additionally, Pentaho Kettle offers you some other options, such as whether you only want to pull specific items from the zipped file (if there are multiple items in it) or if you want to exclude certain items. You’ll see in my example screen capture that I’m only pulling files ending in a .txt from the zipped files I’m unzipping.
Lastly, in my setup of the step, I elected to overwrite any existing documents, since I intended to run and re-run the step a couple of times (for troubleshooting).
That’s about it. I realize it looks simple, but it took me about an hour to get this hammered out. It seemed like no one else at my company had had a chance to unzip multiple files at once in the past.
Feel free to download my example as a starting point.