We need to deal with 2 types of data structures in Spark – RDD and Data Frames.
We will see both in detail as we proceed further.
- RDD is there for quite some time and it is the low-level data structure which spark uses to distribute the data between tasks while data is being processed.
- RDD can be created using SparkContext APIs such as textFile.
- RDD will be divided into partitions while data being processed. Each partition will be processed by one task.
- The number of RDD partitions is typically based on HDFS block size which is 128 MB by default. We can control the number of minimum partitions by using additional arguments while invoking APIs such as textFile.
- Data Frame is nothing but RDD with the structure. We should be able to access the attributes of Data Frame using names.
- Typically we read data from file systems such as HDFS, S3,Azure Blob, Local file system etc
- Based on the file formats we need to use different APIs available in Spark to read data into RDD or Data Frame
- Spark uses HDFS APIs to read from, and/or write data to the underlying file system.