What is RDD (resilient Distributed dataset)?

Asked

Viewed 675 times

5

I’m studying about Spark in Python and the acronym RDD always appears. However I can’t understand what this nomenclature is really about.

So I’d like to know what it is resilient Distributed dataset (RDD) in the context of Spark?

2 answers

3

Resilient Distributed Datasets (RDD): abstract a set of objects distributed in the cluster, usually executed in the main memory. These can be stored in traditional file systems, in the HDFS (Hadoopdistributed File System) and in some Nosql databases such as Cassandra and Hbase. It is the main object of the Spark programming model, because it is in these objects that the data processing will be executed.

I recommend reading this article : https://www.devmedia.com.br/introducao-ao-apache-spark/34178

But to sum it up, it’s basically a way for multiple computers to process the same job,task.

1

The main abstraction that Spark offers is a resilient distributed data set (RDD), which is a collection of partitioned elements at cluster nodes that can be operated in parallel.

In Spark there are action and transformation functions, the transformation functions are Lazy Evaluation and so will only be performed when some action is called.

Important information is that Rdds are immutable.

Creating a RDD:

lst = [2, 4, 5, 6, 7]
testData = sc.parallelize(lst)

Object type:

type(testData)
pyspark.rdd.RDD

Calling an action:

testData.count()
5

Another action:

testData.collect()
[2, 4, 5, 6, 7]

Transformation:

# lazy evaluation
testFltr = testData.filter(lambda x: x > 3) # observe que uma nova variável vai ser criada pois é imutável. 

Now with action the transformation will be made:

testFltr.collect()
[4, 5, 6, 7]

Browser other questions tagged

You are not signed in. Login or sign up in order to post.