Quick Review of APIs

Let us have a quick review of Core APIs that are available in Spark. We will cover Data Frame APIs and Spark SQL at a later point in time.

SparkContext exposes APIs such as textFile, sequenceFile to read data from files into a distributed collection called as RDD.
RDD stands for Resilient Distributed Dataset and it is nothing but a distributed collection.
It is typically loaded on to the executors created at the time of execution.
RDD exposes APIs called as Transformations and Actions
Transformations take one RDD as input and return another RDD as output while Actions trigger execution and get data into driver program.
Examples of Transformations
- Row Level Transformations – map, filter, flatMap etc
- Aggregations – reduceByKey, aggregateByKey
- Joins – join, leftOuterJoin, rightOuterJoin
- Sorting – sortByKey
- Ranking – groupByKey followed by flatMap with a lambda function
- Except for Row Level Transformations, most of the other transformations have to go through the shuffle phase and trigger new stage.
- Row Level Transformations are also known as Narrow Transformations.
- Transformations that trigger shuffle and new stage dates are also called Wide Transformations.
Examples of Actions
- Preview Data: take, takeSample, top, takeOrdered
- Convert into Python List: collect
- Total Aggregation: reduce
- Writing into Files: saveAsTextFile, saveAsSequenceFile

Join Our Community