As we understand basic transformations such as map, flatMap, reduce etc, now let us look at few advanced operations.
mapPartitions
ranking using groupByKey
mapPartitions
APIs such as map, filter, flatMap work on individual records. We can implement any of this functionality using mapPartitions, but the difference is in its execution.
For map, filter, flatMap – number of executions of lambda function is equal to number of records
For mapPartitions – number of executions of lambda function is equal to number of partitions
As part of the lambda function in mapPartitions
Process data as collection
Apply Python map or filter or flatten
Return a collection
The elements from the collection returned from lambda function will be added to RDD
Use cases where mapPartitions can perform better – Looking up into a database. Instead of creating connection for each record, we can establish connection once per for each partition (if looking up into database is required as part of data processing)
Here is the example of getting word count using mapPartitions
ranking using groupByKey
groupByKey is a very powerful API which groups the values based on the key. It can be used to solve problems such as ranking.
Task 1: Get top N products by price in each category
Let us read products data into RDD
Convert the data to (k, v) using product category id as key and the entire product record as value
Use groupByKey
Use first and get first record and read the second element to regular python collection variable (productsPerCategory)
Develop function to get top N products by price in that list
Validate the function using productsPerCategory
Invoke the function on output of groupByKey as part of flatMap
Task 2: Get top N Priced products in each category
Let us read products data into RDD
Convert the data to (k, v) using product category id as key and the entire product record as value
Use groupByKey
Use first and get first record and read the second element to regular python collection variable (productsPerCategory)
Develop function to get top N priced products in that list (simulating dense rank)
Validate the function using productsPerCategory
Invoke the function on output of groupByKey as part of flatMap