Quick Review of APIs

Let us have a quick review of Core APIs that are available in Spark. We will cover  Data Frame APIs and Spark SQL at a later point in time.

  • SparkContext exposes APIs such as textFile, sequenceFile to read data from files into a distributed collection called as RDD.
  • RDD stands for Resilient Distributed Dataset and it is nothing but a distributed collection.
  • It is typically loaded on to the executors created at the time of execution.
  • RDD exposes APIs called as Transformations and Actions
  • Transformations take one RDD as input and return another RDD as output while Actions trigger execution and get data into driver program.
  • Examples of Transformations
    • Row Level Transformations – map, filter, flatMap etc
    • ¬†Aggregations – reduceByKey, aggregateByKey
    • Joins – join, leftOuterJoin, rightOuterJoin
    • Sorting – sortByKey
    • Ranking – groupByKey followed by flatMap with a lambda function
    • Except for Row Level Transformations, most of the other transformations have to go through the shuffle phase and trigger new stage.
    • Row Level Transformations are also known as Narrow Transformations.
    • Transformations that trigger shuffle and new stage dates are also called Wide Transformations.
  • Examples of Actions
    • Preview Data: take, takeSample, top, takeOrdered
    • Convert into Python List: collect
    • Total Aggregation: reduce
    • Writing into Files: saveAsTextFile, saveAsSequenceFile

