Generating normal distribution with Apache Zeppelin running on a Docker container on Microsoft Azure cloud platform

Apache Zeppelin with Spark on Docker on Microsoft Azure

Creating this project was mainly motivated by trying the capabilities of Apache Zeppelin, which seems to have a lot of potential in the hands of a data scientist. The project is built around a dice game. The user can determine how many dice we throw and how many times we throw those dice. Exciting so far, right? What is more exciting is that at a sufficient amount of throws we expect a Gaussian distribution curve to be emerged. Let's try if it will do or not. 

Apache Zeppelin 

Apache Zeppelin is basically a notebook, which focuses mainly on Apache Spark engine. A notebook is an interactive, mostly web-based environment, where you can combine code, documentation and plots. It can immediately run your code snippets within the linked interpreter, and you will see the results right at the end of the block. The smallest unit in such an environment is a block. In a block you can use only one kind of interpreters, for instance Scala, Python or SQL. 


A notebook like Zeppelin is an appropriate tool, when you would like to present your research, or you simply would like to create something reproducible, which is smoothly understandable. Imagine that you want to create a working app with a visualization quickly, and you would like to give it into the end-user's hands. You can hide the codes, leaving only the input fields and the visualizations visible, just like in my case. 

I think notebooks will arguably be an essential tool of every data scientist in the near future, because they are lightweight, simple and have all the capabilities which is needed by them, but not more. And here is another remark: Zeppelin is a really good tool, if you would like to use it on your own. However I think it should have some basic features like authentication and authorization in order to be enterprise-ready, because now it could barely be said to be that. 


There is an excellent initiative out there, which is called ZeppelinHub. They realized that sharing of interactive notebooks are not easy. Notebooks are stored in JSON structured files on the host machines, and these files can be migrated and edited easily, but launching an interactive notebook requires a Zeppelin environment. Why would an end-user have a Zeppelin environment? To be honest, it is not even simple to install one (see the Docker in Azure chapter). This problem actually could be solved by implementing the already mentioned authentication and authorization features. After that we would be able to operate a Zeppelin server, and it would solve the problem at least in-house (for instance inside the enterprise). But here is ZeppelinHub, which is intended to handle this, moreover via the internet, so it could be used not only inside an enterprise. Unfortunately it is still in beta, but you can ask for an account. 

Docker in Azure 

Docker is a really fancy stuff and has several good features and reasons to use it. In this post I won't go deeply inside the working of neither Docker nor Azure, I am only writing about why they were good partners in this project. 

In this case I needed Docker because I realized that installing Zeppelin onto a working Hadoop cluster or even onto an Ubuntu node is not "one-command easy". However I did really not want spending my time with installation and dependency issues. So I decided to search an appropriate image on Docker Hub, and pull that onto my machine. The machine actually is an Ubuntu Server running on my Microsoft Azure cloud environment, so I reached it via SSH. 

Pulling the image from Docker Hub: 

Running the image and creating external folders on the host machine for storing notebook files (your life will be easier in the future if you do it this way): 

Now we only have to create the appropriate port forwarding rules on our machine or cloud environment, and double check them. After that we are able to use our Zeppelin Notebook! 

Based on this story above, I can strongly recommend that if you ever find a new tool, and I'm most likely speaking about open-source tools, use Docker containers and get a Docker image of the tool. You will save yourself plenty of time. And nowadays there is already a prepared Docker image for every tool which is used by at least a smaller community. 

Dice game 

I would love to show my project via ZeppelinHub, but till this time, I haven't got my beta account. So, now I am just pasting here some of the code snippets. 

In the first block I am importing the Python packages and modules I will use later: numpy, mstats and pyspark.sql. 

I use %pyspark everytime when I use Python code in a block. This first line of the block will determine the interpreter. If it is empty, the code automatically goes to Scala interpreter, so %scala  or %spark are not needed to use, when you create Scala code. 

I am creating the input fields in Scala: 


Next, I am getting the objects from Scala input blocks and I run my Spark code in Python in order to throw those dice! Now I am starting my block with %pyspark again. 

I am creating all the throws and storing them in a Spark RDD. Next, I am separating them for the future usage and I am adding an index to all separated throws. There will be created a separated tuple for each throws with an index, one throw, so one tuple's structure looks like this: 

(index, list(  numDice amount of random integer numbers between 1 and 6 )) 

The code: 

In order to realize the normal distribution I am getting the summarized value of each throws. It can easily be seen that getting a little or high summarized value means that the given throw contains extreme values, for example, if we have 3 dice: 1, 1, 1 or 6, 6, 6.  Getting mixed values in a throw has a greater probability, because there is more possible line-ups giving the mean as a summarized value than giving 3 or 18.  That's why the summarized values will produce a normal distribution. 

After that I create a schema object and a dataframe and I pass it to a temp table, which can be queried by SQL. 

And I am finally getting my Gaussian distribution!