If you are considering introducting a data visualization tool on the top of your Hadoop environment which has to fit in your enterprise, Tableau could be your choice. Tableau has features like server-desktop client architecture and Active Directory integration which are necessary requirements of applying a business intelligence tool in an enterprise environment.
Data stored in Hadoop can be visualized and can be integrated within a BI system with other tools too, like Spotfire and Qlikview. However somehow everywhere I go I most likely meet the Tableau logo on the drawings of Hadoop-based architectures. Why is it? I honestly have no idea. Maybe their marketing works pretty well, or just I go to the wrong places. Either way their product does work with Hadoop, and I will dig into the mechanism of it later, but first let's talk a bit about the architecture of Tableau itself.
Tableau Desktop is the product with which end users are capable of creating visualizations. User can do this either with local data or data directly from a connected database. Basically within Tableau Desktop terminology there are three hierarchical level of a project:
- A sheet represents the lowest level, where a user can create a visualization based on the connected data.
- The next level is the dashboard. On a dashboard you are able to place multiple sheets, and you can also create actions and filters.
- On the top level there is the story. Stories are a group of dashboards and sheets and they usually groups insights which can be classified into the same topic.
Users can publish their reports, even if they are stories, sheets or dashboards as a new view onto Tableau Server.
On the server we can browse among Tableau reports and can view them or even edit them. On the server there is a detailed permission control system which determines how each users or user groups can reach the reports or completely deny reaching some of them. There is also a three level hierarchical system for storing the reports on Tableau server. With these levels and the permission control we are able to create a sophisticated access system for our users and groups which can fit with the enterprise divisions. The three levels are:
- A view is basically a report in Tableau Server terminology. A view can be a published sheet, dashboard or a story.
- A workbook can contain multiple views. When we are making a publishment, we are able to publish all items (sheets, stories or dashboards) separately from our Tableau Desktop project file. Of course if we have already put everything under a story this is needless. However we are able to publish more stories under one workbook from one Tableau Desktop project. So that, one workbook have to be in consistence with one Tableau project file.
- A project is where you store multiple workbooks. Important thing to mention: a project behaves like a folder, not a group. That means you are not able to have the same workbooks synchronized in multiple projects like if the project would more likely mean a logical grouping rather than a physical place.
Connecting Tableau to Hadoop
We've been arrived to the most exciting part. There is a lot embedded connectors in Tableau which support connecting to Hadoop, now I am going to touch the ability to connect to Cloudera's Hadoop distribution.
If you would like to connect Tableau to Cloudera Hadoop, you can do it by connecting with Impala, or Hive via ODBC. In case of conencting to Hive, you have to use HiveServer's external port, or if you would like to connect to Impala, you can use any of the nodes which have Impala Daemon running on it. For Cloudera-Tableau connection you will need either Hive's ODBC driver or Imapala's ODBC driver installed on your client machine.
Using Hive as the query engine under a BI application could mean a relatively poor user experience, because by default Hive is using MapReduce for working with the data on HDFS. As we know MapReduce can be relatively slow with even not much amount of data. So if you are considering an implementation of BI layer on top of a Cloudera cluster, I recommend to use Imapala instead of Hive. In case of direct connection between a Tableau report and Hadoop, when an end-user clicks on a filter or a new view on the visualization, everytime a new query will be sent directly to the database. And yes, Hadoop can work on vast amount of data really quick, but in this case you might be developing an application for business users. And they won't care about what is under the reporting layer. They will only care about their user experience. So your application have to be responsive and have to react quickly for any user interaction.
Using extractions versus direct connection to the database
Even when you are using Impala and your responsivity is good, there will be some cases when you will be on the border of acceptable response time. It will happen simply because of the limitations of your hardware of your Hadoop cluster. It will happen once, queries cannot be optimized to the extreme in all cases. Impala is really fast for analytics but when you are joining more tables, or do some more complicated maths in it, your response time will be increasing, and there will be a point where it comes to be annoying for the end-users.
That is the time when you can do two things. One of your options is creating new tables, so actually new files on HDFS or creating Tableau extracts. Tableau extract is a function which saves the actual view of the data to your local or Tableau server file system. In this case Tableau will read the data from here. Doing one of these options is not a simple choice, and sometimes you have to combine these solutions in order to fix the seeming performance and thus the user-experience.
For instance in some cases creating an extract could easily be worse than using real-time database connection. If the view is too large, your local machine or even Tableau Server's machine won't deal with it easily. Then you have to reduce your Impala tables and create smaller files, or you have to redesign your report on top of your data. You are also able to create extract periodically fully automatized, if for example daily or weekly refresh on the data is enough and you don't need real time access.
Also a big deal when you do more complex algorithms on your data that Impala or Tableau should be doing that. Because Tableau's in memory computing sometimes can be faster, but in a lot cases it is better to leave that work for Impala.
You have multiple options how to analyze and visualize your data with Tableau on top of Hadoop. The most important thing that you always have to take into consideration is user experience. Visualizations have to be as dynamic as they can be, and at the same time they have to be smooth and responsive and of course sexy too. So we have to find the golden mean among these capabilities and working methods of the architecture.