R has great features if we want to perform statistical analysis on a database, or we would like to create data visualization as quickly and easily as possible. It would be worth to possess the advanteges of R in Hadoop environment.
It is possible to create Spark jobs in R. However in Cloudera's Hadoop distribution there is no support of getting Impala to work with R. We are only able to communicate with Impala service via SQL.
There is a solution which enables you to communicate with Impala daemons via R. It is also possible to have the Hadoop environment with Impala installed remotely and reach it from your desktop client. For this we need a special R package called RImpala. With the help of RImpala we are able to connect to a remote Impala daemon via JDBC connection. Thus we can write R code which invokes the Impala service and the Impala query runs on the Hadoop cluster. Assuming since it is a Hadoop cluster, we have large amount of data there which couldn't be processed on our local machine with R tools.
What do we need for this process?
During this post we assume that we already have a preinstalled Cloudera Hadoop distribution somewhere, and we also have a client device where R can be installed. Network conditions have to allow us to reach Impala service (at least one node of the cluster where an Impala daemon is running) from our client, the default port is 21050.
On the client you need to have:
- installed R environment
- installed Java (>=1.5)
- Cloudera Impala JDBC connector
- RImpala package
- RStudio (optional)
How it works
First we need to install the RImpala package
- Open RStudio.
- In Rstudio click on Install Packages menu under Tools. Then we just simply have to type the package name (RImpala) into the pop-up window.
- Choose a directory for the package to download.
Connecting to Impala
- First we have to invite RImpala lib with library(“RImpala”). This loads rJava package.
After that we have to pass the path of JDBC connector to R with
R1rimpala.init(impala_home = NULL, libs="C:/Users/username/Documents/R/impala-jdbc-0.5-2")
In case of using Windows we have to take care of using forward slash instead of backslash.
If the previous steps succeeded, we are able to connect to Impala.
It would be possible to authenticate with Kerberos. In that case we would have to use another parameter called principal.
- If we have done everything well, R notes this with a TRUE sign and we are now connected to the Impala service.
Creating Impala query with R
There are 10 functions implemented in RImpala package. We have already checked init and connect. The most important function of the package is the rimpala.query() function. With the help of the query function we are able to pass an SQL query to the Impala as a string parameter of the function. And, what is really the strenght of RImpala, we can easily give the results of the remotely ran query to a variable in our R environment.
From that point our data is already transferred to our client within R. We are now able to do statistical analysis with the power and efficiency of R language.
The most important thing is that we have to take into account that we copied data from Hadoop to our client. Hadoop is probably not only accidentally called Big Data framework, but also it contains a large amount of data which cannot be stored or/and processed within our client. So when we make an Impala query, it has to be made sure that Big Data turns into Small Data when it has given back to the variable on our client. The border between these two things is not predetermined, but a good data scientist always knows how much amount of data can be still considered as "small" in the current environment.
a = rimpala.query(Q="Select * from predie.music limit 10", isDDL=False, fetchSize="10")
Variable a has now got the result table of the query transferred as a string in Q parameter. It can be seen that query function has 3 parameters, and the last 2 could have been omitted. In this case they have default values, such as isDDL = False and fetchSize = “10000”. Parameter isDDL decides it is a DDL query or not. Basically we create SQL queries, so that is why the default is false for this. Parameter fetchSize tells to the function how many records we expect to get back maximally. If the value of fetchSize is smaller than the amount of records we get back, then we already killed R and it freezes. So it is worth to use LIMIT clause or think about the data given back and fetch size before we submit the function.
However, basically it is not a bad thing that there is a maximum fetchsize. Think about it, if we tried to submit a SELECT * FROM table query and there was no fetch size, not only R would be frozen, but also our client.
After all, what is it good for?
In the end we have to draw a conclusion. What is this RImpala actually good for? It does not interpret R language as an Impala query and runs it instantly. It can only submit queries to Impala. Then what is the use case of RImpala?
The already filtered or aggregated small data can be easily copied into R without any file transfer or ETL tool. That is the real power of it. Imagine that we have a data lake and we would like to make a statistical analysis on just a part of the data, or we would like to create data visualization as easy as possible, R would be one of the best solutions. And in this case with the help of RImpala we are able to use very easily the currently important part of our data lake.
It is not a rare case, right?