Now let us see details about data structures in Spark such as Resilient Distributed Datasets, Data Frames, Directed Acyclic Graph, Lazy Evaluation etc.
Data Structures – RDD and Data Frames
Quick overview about APIs – Transformations and Actions
Directed Acyclic Graph and Lazy Evaluation
Difference between Python list and RDD
Resilient Distributed Datasets
Resilient Distributed Datasets (in short RDD) is the fundamental data structure in Spark.
Creation of RDD
In-memory
Distributed
Resilient
Execution Life Cycle
Data from files will be divided into RDD partitions and each partition is processed by separate task
By default it will use HDFS block size (128 MB) to determine partition
We can increase (cannot decrease) number of partitions by using additional parameter in sc.textFile
By default when data is loaded into memory each record will be serialized into Java object
RDD Persistence
Typically data will not be loaded into memory immediately when we create RDD as part of the program. It will be processed in real time by loading data into memory as it is processed. If we have to retain RDD in memory for an extended period of time, then we have to use RDD Persistence.
Let us see what happens when RDD is loaded into memory
Serialize into Java Objects
Get into memory
As data is processed RDD partitions will be flushed out of memory as tasks are completed.
We can persist the RDD partitions at different storage levels
MEMORY_ONLY (default)
MEMORY_AND_DISK
DISK_ONLY
and more
Data Frames
Many times data will have structure. Using RDD and then core APIs is some what tedious and cryptic. We can use Data Frames to address these issues. Here are the some of the advantages using Data Frames
Flexible APIs (Data Frame native operations as well as SQL)
Code will be readable
Better organized and manageable
Uses latest optimizers
Process data in binary format
Can generate execution plans based on statistics collected
We will talk about processing data using Data Frames in next chapter. For now we will be focusing on Core APIs
Overview of Transformations and Actions
Spark Core APIs are categorized into Transformations and Actions. Let us explore them at higher level.
Transformations are the APIs which take RDD as input and return another RDD as output. These APIs does not trigger execution but update the DAG. Actions take RDD as input and return a primitive data type or regular collection to the driver program. Also we can use actions to save the output to the files. Actions trigger execution of DAG.
Directed Acyclic Graph and Lazy Evaluation
Thare are many APIs in Spark. But most of the APIs do not trigger execution of Spark job.
When we create Spark Context object it will procure resources in the cluster
APIs used to read the data such as textFile as well as to process the data such as map, reduce, filter etc does not trigger immediate execution. They create variables of type RDD which also point to DAG.
They run in driver program and build DAG. DAG will tell how it should execute. Each variable have a DAG associated with it.
When APIs which are categorized as action (such as take, collect, saveAsTextFile) are used DAG associated with the variable is executed.
In Scala, we can look at the DAG details by using toDebugString on top of the variables created.