As we understand about collections and how to manipulate them using traditional looping, now let us check out already existing APIs such as map reduce to process collection data.
Define problem statements
Develop myFilter, myMap and myReduce APIs
Understanding existing packages and APIs
Developing Solutions using Map Reduce APIs
Define Problem Statements
Let us see few similar problem statements and understand how we can build solutions using conventional loops.
Filtering
Get COMPLETE orders from orders data set
Get orders placed on 2013-07-25
Get order items for given order id
In all 3 cases we need to iterate through collection, filter based on criteria and return collection.
Mapping
Get order_id and order_status from orders (1st and 4th fields of orders data)
Get order_item_order_id and order_item_subtotal from order_items (2nd and 5th field of order_items data)
Get order_month from orders data (extract year and month from 2nd field)
In all 3 cases we need to iterate through collection, transform individual records and add them to new collection
Reduce (on filtered and mapped order item subtotal based on order_id)
Get total revenue by adding all the revenues
Get minimum of order item subtotal
Get maximum of order item subtotal
In all 3 cases we need to initialize aggregator, loop through the values in collection and add it to the aggregator
Develop myFilter, myMap and myReduce APIs
Now let us see how we can leverage lambda functions to develop generic functions to filter data, to apply transformation or mapping, to perform aggregations using reduce.
myFilter function
Define function with two arguments
first argument – lambda function with one argument (at run time we pass a code snippet which return True or False)
second argument – collection
Develop the logic which will iterate through the elements in collection, apply passed filter criteria and add elements to new collections which satisfied the criteria.
Here is the code and also sample invocations covering all 3 scenarios discussed above.
myMap function
Define function with two arguments
first argument – lambda function with one argument (at run time we pass a code snippet which transform one record to another)
second argument – collection
Develop the logic which will iterate through the elements in collection, apply passed transformation rule and add elements to new collections which satisfied the criteria.
myReduce function
Define function with two arguments
first argument – lambda function with 2 arguments (at run time we need pass logic which perform arithmetic operation between the 2)
second argument – collection
Develop the logic which will iterate through the elements in collection, apply aggregation and return one value.
Understanding existing packages and APIs
As we have seen how to develop reusable functions to process the data, now let us understand existing APIs in different Python packages.
map
filter
functools reduce (in Python 3)
itertools have several functions
numpy
pandas
and more
We will review some of the APIs by going through help. In place of myFilter, myMap, myReduce – you can leverage existing APIs to get the similar functionality.
Developing Solutions using Map Reduce APIs
Now, let us understand how to build applications using existing APIs.
Get revenue for given order id from order_items
Use filter to filter for items for a given order id
Use map to get order item subtotals
Use reduce to aggregate. We can also use sum to get total of elements in numeric list.
We do not have APIs directly to perform by key aggregations
We need to use plugins such as itertools, pandas etc
itertools approach – Get revenue for each order id from order_items
Read data into collection – order items
Sort data using sort function based on the key we are going to use to group – order item order id
Group data using groupby function of itertools using key on which we need to get the aggregation
groupby returns new collection in which each element contains
key on which data is grouped
collection corresponding to the key
apply map function to process the collection corresponding to key to return sum of order item subtotal
pandas approach – Get revenue for each order id from order_items
We will actually look into pandas in detail as part of next chapter
Create list for column names
Pass path and column names to pandas read_csv function to create data frame
We can refer attributes in data frames using names
Apply group by function to group data using order item order id and invoke aggregate function sum on order item subtotal – this will return a new data frame which contain order item order id and revenue.