Automation of Impala jobs within Cloudera’s Hadoop distribution

automation

Since Impala is a very effective solution if we would like to create analytical queries in Hadoop, it would be great to create automated Impala jobs. A typical use-case can be daily running queries for reporting reasons, where Impala could be the fastest and most effective way of querying large amount of data in Hadoop (instead of Hive).

How to do it?

Since it is not possible to schedule Impala jobs within Oozie or Impala's CLI, another solution has to be found out. Assuming that Hadoop is installed on a Linux cluster, we can use cron service of Linux. Cron is a daemon service of Linux operating systems. With cron we are easily able to schedule tasks and scripts. It can be said that it is the equivalent of Task Scheduler in Windows. So if we have a Hadoop cluster installed on Windows machines, we can use Task Scheduler instead of cron.

Creating the script

So, all we have to do is creating the script which contains a command with impala-shell for running our SQL query. But what if we have a daily or weekly running SQL script, where the dates in the SQL query are important? Then we have to create bash date variables and insert them into the Impala query. But impala-shell command will not get the variables from Bash script, it would only run with the statical SQL query in its parameter, because it gets it as a string.

The solution is creating a Bash script which creates a separate SQL script within each running. And in each running it pastes the current date variables into the SQL script. Then it runs the just created SQL script. That's how it works!

Assuming that we run the script on a node which runs an impalad daemon, here can be seen an example script which runs daily:

Inserting our script into Crontab

Cron gets its input, which script has to run and when, from the crontab file. In Linux every user has its own crontab file and they can edit it in order to run their scheduled tasks.

To open the crontab which is connected to our user account, just type into a terminal.

This opens a text file which can be edited with our new entry. Every new line contains a new scheduled task in this format:

This entry starts script.sh in every morning at 4:00. The five parameter and their possible values in proper order are: minute (0-59), hour (0-23), day (1-31), month (1-12), dayofweek (0-6). If we don't want to use one or more of the parameters, we have to just put a * instead of any value.