Downloading Files in Pentaho Kettle

For one of my transformations in Pentaho Kettle, I wanted to download data from an internet source. The source did not have an API, so I could not use a REST step to call on it. Instead, the source had a link that connected to a CSV that could then be downloaded in the browser. I wanted to download the data from this URL and automatically place it on my computer so that I could manipulate it as necessary. To do this, I used an HTTP step. This step is only available in Jobs (not in Transformations). The icon looks like this:

HTTP Icon

This is what the menu looks like when you double-click on the step:

HTTP Menu

The main fields to look at here are the “URL” field and the “Target file” field.

  • URL: This field is where you specify the URL to which you want Kettle to connect in order to access the file you would like to download.
  • Target file: This field is where you specify where you want Kettle to store the file once it’s been downloaded. It’s best to use variable locations in order to make sure that you can run this from any computer without running into issues with directory setup and names.

This is what my completed step looked like:

HTTP Single File Complete

You can download the Job I set up here. Feel free to manipulate the setup in order for it to suit your needs.

You’ll noticed that Kettle also provides options for authentication (in case there is a login process involved in accessing the file) and also for uploading a file. I did not experiment with either of these options, but would be interested in hearing from those that have.

As you can imagine, having access to this step can be fairly useful. Now, what happens if we wanted to download multiple files at once? That’s where things get a bit more complicated. To see my solution to that issue, you can download this Job. The solution here is not as clear-cut as when we download one file from a single source. Here’s a step-by-step of how I set up the mutli-file download job:

  1. START: Naturally, I start the job with a START step.
  2. PATH: Next, I use the “Set Variables” step to set my PATH variable. This is the variable I will use to specify the directory in which I would like to save the files once I download them. (Note: I could have also used this same process of assigning a PATH variable in the last job, but instead elected to specify the path directly in the HTTP step. This step is not truly necessary, but is good habit. Use a static directory here, since path variables get a bit murky when getting passed from one job to another. The original setup I used – C:\temp – may not work on your machine, so make sure to change it as necessary before running my job.)
  3. FILES: Here I used a “Transformation” step to call on a transformation named files.ktr that I set up in conjunction with this job. That transformation identifies what to name the file once it’s downloaded and the URL source of the file. Here’s that transformation step-by-step:
    • Data Grid: I set up a data grid with the columns filename and url. Under the Data tab, I specify the data for these columns:
      Files
    • Copy Rows to Results: Next I copy the created rows to the job using the “Copy Rows to Results” step. Note that between this step and the preceding step you can use JavaScript or some other combination of steps to add uniqueness to your filenames. I’ll leave that up to you to figure out.
  4. DOWNLOAD: Next I reference a job using the step “Job”. This job is structured similar to our previous single-file download job, but instead of referencing constants, we now reference the variables we set up. This is where all the magic happens, so note the changes we made:
    • While still configuring the job setup in the MAIN job, we specify under the “Advanced” tab to “Copy the results to parameters” and also to “Execute for every input row”.
      Advanced Job Configuration
    • Still in the setup of the DOWNLOAD job in the MAIN job, we also specify that we are passing parameters:
      Parameters
    • Now in the DOWNLOAD job, we go to Edit > Settings and specify under the Parameters tab for the job to anticipate and utilize the parameters FILENAME and URL.
      Parameters
    • Lastly, in the HTTP step in the DOWNLOAD job, we use the parameter names instead of the constants we used in the single-file download.
      HTTP Variables
  5. SUCCESS: We connect to the Success step to mark the completion of the job.

And that’s it! A bit more complicated, but definitely handy for downloading batches of files.

Big thanks to Raffael Volger who initially posted on this.

Blog at WordPress.com.