Hadoop is a Database? What is Hadoop?

Asked

Viewed 4,353 times

5

After all, what is Hadoop? Hadoop is a database? I have often heard "that company uses the Hadoop database". But when I started studying Big Data I saw that actually things weren’t quite like that.

So if it’s not a database, what is?

  • On Hadoop’s own website explains what it is, Hadoop with a distributed computing software platform in Java aimed at clustering and processing large data masses, it was designed to expand servers

2 answers

11


Hadoop is an ecosystem for distributed computing, that is, created to handle the processing of large amounts of data (petabytes) at high speed. This ecosystem is composed of several systems/technologies.

The idea of Hadoop is to perform a heavy processing by dividing the task into several nodes (cluster) in order to increase computational power. For this to happen, a file system is used on the nodes of each cluster called HDFS (Hadoop Distributed file system), which holds files with large amounts of data and the processing is performed using a programming technique called Mapreduce.

The following is an example of systems that may be part of this ecosystem and a brief explanation of each one. inserir a descrição da imagem aqui

HDFS -Hadoop file system, this file system works in a distributed way, using large memory blocks.

Map Reduce -Programming model for large-scale processing. Based on mapping(map) and reduction (reduce).

Yarn -It is a resource management platform responsible for cluster management of computational resources, as well as resource scheduling.

Hive -Converts SQL queries to Mapreduces.

Pig -Language for creating Mapreduces

Hbase -A column-oriented Nosql database (columnar) that can be used over the HDFS. Provides access to large amounts of high-speed data.

Flume -Log export system, containing large amount of data to the HDFS

Anbari -Monitoring of Hadoop clusters

Sqoop -DBMS data export tool for Hadoop. Uses JDBC, generates a Java class of data export for each table in the relational schema

Oozie / Control-M -Scheduler/task manager and Workflows for Adoop.

Today Hadoop is maintained by the Apache Foundation. And has as distributions Enterprise better known to Cloudera and Hortonworks.

0

Hadoop is a distributed system for data storage and processing. Its core consists of a file system (HDFS), a programming model for distributed processing (Mapreduce), and a resource manager (YARN). From this core or integrating with it, several projects and software were created that added new features to Hadoop, transforming it into a true ecosystem. E.g.: Nosql (HBASE) databases, software for data transfer between RDBMS and Hadoop (Sqoop), etc.

This post below is an excellent introduction to the subject, worth reading:

https://medium.com/dataengineerbr/um-quase-guia-nada-completo-sobre-o-hadoop-a3eeee170deb

Browser other questions tagged

You are not signed in. Login or sign up in order to post.