5
After all, what is Hadoop? Hadoop is a database? I have often heard "that company uses the Hadoop database". But when I started studying Big Data I saw that actually things weren’t quite like that.
So if it’s not a database, what is?
5
After all, what is Hadoop? Hadoop is a database? I have often heard "that company uses the Hadoop database". But when I started studying Big Data I saw that actually things weren’t quite like that.
So if it’s not a database, what is?
11
Hadoop is an ecosystem for distributed computing, that is, created to handle the processing of large amounts of data (petabytes) at high speed. This ecosystem is composed of several systems/technologies.
The idea of Hadoop is to perform a heavy processing by dividing the task into several nodes (cluster) in order to increase computational power. For this to happen, a file system is used on the nodes of each cluster called HDFS (Hadoop Distributed file system), which holds files with large amounts of data and the processing is performed using a programming technique called Mapreduce.
The following is an example of systems that may be part of this ecosystem and a brief explanation of each one.
HDFS -Hadoop file system, this file system works in a distributed way, using large memory blocks.
Map Reduce -Programming model for large-scale processing. Based on mapping(map) and reduction (reduce).
Yarn -It is a resource management platform responsible for cluster management of computational resources, as well as resource scheduling.
Hive -Converts SQL queries to Mapreduces.
Pig -Language for creating Mapreduces
Hbase -A column-oriented Nosql database (columnar) that can be used over the HDFS. Provides access to large amounts of high-speed data.
Flume -Log export system, containing large amount of data to the HDFS
Anbari -Monitoring of Hadoop clusters
Sqoop -DBMS data export tool for Hadoop. Uses JDBC, generates a Java class of data export for each table in the relational schema
Oozie / Control-M -Scheduler/task manager and Workflows for Adoop.
Today Hadoop is maintained by the Apache Foundation. And has as distributions Enterprise better known to Cloudera and Hortonworks.
0
Hadoop is a distributed system for data storage and processing. Its core consists of a file system (HDFS), a programming model for distributed processing (Mapreduce), and a resource manager (YARN). From this core or integrating with it, several projects and software were created that added new features to Hadoop, transforming it into a true ecosystem. E.g.: Nosql (HBASE) databases, software for data transfer between RDBMS and Hadoop (Sqoop), etc.
This post below is an excellent introduction to the subject, worth reading:
https://medium.com/dataengineerbr/um-quase-guia-nada-completo-sobre-o-hadoop-a3eeee170deb
Browser other questions tagged database hadoop
You are not signed in. Login or sign up in order to post.
On Hadoop’s own website explains what it is, Hadoop with a distributed computing software platform in Java aimed at clustering and processing large data masses, it was designed to expand servers
– Miguel Garcia Silvestre