How to schedule Cloudera Impala data pipelines in Apache Oozie?

oozie workflow

Oozie is a software built on Hadoop with which we are able to create workflows and schedule them. We can build data pipelines, the components of the pipelines can be Java code, Sqoop, Pig, Hive or Shell script and so on. Inside the workflow jobs can be defined to run either in parallel or in sequence. There is a graphical interface made for Oozie inside HUE. Here we can conveniently define our jobs, manage and monitor them. Components of Oozie, like workflows and coordinators can be also defined in an XML-like file, if we do not want to use the HUE GUI. There is also a Command Line Interface for Oozie, so we can submit our workflows and coordinator with the help of the Oozie shell. 

It is really important to emphasize that Oozie is built on Hadoop, which means that it uses HDFS for default file system and runs on YARN. This fact should not be forgotten when we define our jobs. So, Oozie seems to be a really nice and useful tool on Hadoop, until you don't want to use it with Cloudera Impala. Because Oozie does not support Impala jobs. However it does support Shell jobs, and since Impala has CLI, we may be able to create Impala shell scripts inside a Shell job. Let's see how it works! 

Creating the workflow 

  1. At first, create a workflow and put a Shell job into it and give it a name. 

  1. Give the shell command parameter the HDFS path where our shell script can be found. It is good to know that Oozie will run every workflow and coordinator from its working directory. 

  1. We would like to give input parameters of the shell. We can do this by giving arguments of the shell scripts. Take into consideration that shell scripts get the different arguments in a pre-defined and strict order. So it is important to note the order of the parameters, when we define them. 
    Kép 

  1. Give the workflow the script file's HDFS path, too. 

The shell script 

Let's see an example shell script. 

This is the magic part. Without this line the job won't run.

This is why we had to note the order of the given parameters. 

We are not able to define the query directly as a parameter of Impala-shell, becuase thus we won't be able to put parameters indide of the query. 

And when defining the query string, make sure you use double quotes, because this will enable to the string to use parameters inside. 

Finally give the impala-shell the query string.