Let us have a quick review of Core APIs that are available in Spark. We will cover Data Frame APIs and Spark SQL at a later point in time.
- SparkContext exposes APIs such as textFile, sequenceFile to read data from files into a distributed collection called as RDD.
- RDD stands for Resilient Distributed Dataset and it is nothing but a distributed collection.
- It is typically loaded on to the executors created at the time of execution.
- RDD exposes APIs called as Transformations and Actions
- Transformations take one RDD as input and return another RDD as output while Actions trigger execution and get data into driver program.
- Examples of Transformations
- Row Level Transformations – map, filter, flatMap etc
- Aggregations – reduceByKey, aggregateByKey
- Joins – join, leftOuterJoin, rightOuterJoin
- Sorting – sortByKey
- Ranking – groupByKey followed by flatMap with a lambda function
- Except for Row Level Transformations, most of the other transformations have to go through the shuffle phase and trigger new stage.
- Row Level Transformations are also known as Narrow Transformations.
- Transformations that trigger shuffle and new stage dates are also called Wide Transformations.
- Examples of Actions
- Preview Data: take, takeSample, top, takeOrdered
- Convert into Python List: collect
- Total Aggregation: reduce
- Writing into Files: saveAsTextFile, saveAsSequenceFile