Advanced Transformations

As we understand basic transformations such as map, flatMap, reduce etc, now let us look at few advanced operations.

  • mapPartitions
  • ranking using groupByKey

mapPartitions

APIs such as map, filter, flatMap work on individual records. We can implement any of this functionality using mapPartitions, but the difference is in its execution.

  • For map, filter, flatMap – number of executions of lambda function is equal to number of records
  • For mapPartitions – number of executions of lambda function is equal to number of partitions
  • As part of the lambda function in mapPartitions
    • Process data as collection
    • Apply Python map or filter or flatten
    • Return a collection
  • The elements from the collection returned from lambda function will be added to RDD
  • Use cases where mapPartitions can perform better –¬†Looking up into a database. Instead of creating connection for each record, we can establish connection once per for each partition (if looking up into database is required as part of data processing)
  • Here is the example of getting word count using mapPartitions

ranking using groupByKey

groupByKey is a very powerful API which groups the values based on the key. It can be used to solve problems such as ranking.

  • Task 1: Get top N products by price in each category
    • Let us read products data into RDD
    • Convert the data to (k, v) using product category id as key and the entire product record as value
    • Use groupByKey
    • Use first and get first record and read the second element to regular python collection variable (productsPerCategory)
    • Develop function to get top N products by price in that list
    • Validate the function using productsPerCategory
    • Invoke the function on output of groupByKey as part of flatMap

  • Task 2: Get top N Priced products in each category
    • Let us read products data into RDD
    • Convert the data to (k, v) using product category id as key and the entire product record as value
    • Use groupByKey
    • Use first and get first record and read the second element to regular python collection variable (productsPerCategory)
    • Develop function to get top N priced products in that list (simulating dense rank)
    • Validate the function using productsPerCategory
    • Invoke the function on output of groupByKey as part of flatMap