When it comes to Big Data, in a technological sense we always talk about distributed data management systems. Hadoop is definitely the flagship of them. Although everyone talks about Hadoop, today Hadoop became only a part of the technological arsenal, which is used for managing huge amount of data.
Structure of Hadoop ecosystem
But still Hadoop can be said to be the standard ecosystem of the distributed software tools, even if some of them does not require it necessarily. At the first look, this ecosystem could seem to be complicated, but if we take a closer look, it can easily be seen that all the technologies related to Hadoop can be classified into three different groups. Of course we could make more precise separation with more classes, as many examples like that can be found on the internet, but this three is enough for basic understanding of the structure.
The base of the structure is the methodology of storing the data. This layer is responsible for storing the data within the distributed system in a particular way which makes data reachable for processing engines. The standard technology is the HDFS, for Hadoop Distributed File System.
Data Processing Engines
This layer defines how we reach the data. There are lots of technologies on the market which are developed for different use cases. MapReduce is said to be the most common way of processing data in Hadoop. However nowadays MapReduce is not widely used, because in some use cases other, newer technologies came and carried out dethronement, for instance Spark and Impala.
While the other two layers are important parts of the structure, they don't have to be transparent for users who are not developers or system engineers of the ecosystem. This layer is intended to serve data analysts and in a well-developed system, in some cases they don't even have to know that they are working on the top of a Hadoop environment. This layer is the interface for data processing engines providing different possibilities to work with data, for example SQL, JAVA, Python and others depending on what the engine understands. The well-known HiveQL (SQL) and Pig Latin belongs here, which are interpreted by MapReduce engine. For Spark engine, code can be written for example in Java, Python or SCALA. But anything could be developed on the top of an engine, for example there is a possibility to write Spark code in R, too.
The keys of distributed data management
First of all Hadoop and technologies built on it are massively parallel. It means there are separate computers working simultaneously on one process, which is properly coordinated by the ruleset of the engine. In order to do this, separation of the tasks has to be done by a master node. Distributed technologies typically developed with a master-slave architecture, even if we are talking about MapReduce, Spark, Impala or the file system, HDFS. Master nodes are also the ones that are paying attention to the health or slave nodes and providing fault tolerance.
Storing data and processing it happens at the same place, on the same computer. Pieces of data are processed at the particular nodes locally before they gathered together to create the final results. This involves that nodes has to be powerful both in storage and processing capacity. If we use in-memory tools like Spark, even the memory capacity of the nodes has to be strong.
Shared-nothing architecture is also a key characteristic of the ecosystem. It means that none of processor time, memory and storage are shared among the nodes. They are independent, which makes eliminating SPOF (Single Point Of Failure) easier, and this architecture also provides the most important thing of Hadoop and Hadoop related technologies: linear scalability.