Manipulating Collections using Map Reduce APIs – Python 3

As we understand about collections and how to manipulate them using traditional looping, now let us check out already existing APIs such as map reduce to process collection data.

  • Define problem statements
  • Develop myFilter, myMap and myReduce APIs
  • Understanding existing packages and APIs
  • Developing Solutions using Map Reduce APIs

Define Problem Statements

Let us see few similar problem statements and understand how we can build solutions using conventional loops.

  • Filtering
    • Get COMPLETE orders from orders data set
    • Get orders placed on 2013-07-25
    • Get order items for given order id
    • In all 3 cases we need to iterate through collection, filter based on criteria and return collection.

  • Mapping
    • Get order_id and order_status from orders (1st and 4th fields of orders data)
    • Get order_item_order_id and order_item_subtotal from order_items (2nd and 5th field of order_items data)
    • Get order_month from orders data (extract year and month from 2nd field)
    • In all 3 cases we need to iterate through collection, transform individual records and add them to new collection

  • Reduce (on filtered and mapped order item subtotal based on order_id)
    • Get total revenue by adding all the revenues
    • Get minimum of order item subtotal
    • Get maximum of order item subtotal
    • In all 3 cases we need to initialize aggregator, loop through the values in collection and add it to the aggregator

Develop myFilter, myMap and myReduce APIs

Now let us see how we can leverage lambda functions to develop generic functions to filter data, to apply transformation or mapping, to perform aggregations using reduce.

  • myFilter function
    • Define function with two arguments
    • first argument – lambda function with one argument (at run time we pass a code snippet which return True or False)
    • second argument – collection
    • Develop the logic which will iterate through the elements in collection, apply passed filter criteria and add elements to new collections which satisfied the criteria.
    • Here is the code and also sample invocations covering all 3 scenarios discussed above.

  • myMap function
    • Define function with two arguments
    • first argument – lambda function with one argument (at run time we pass a code snippet which transform one record to another)
    • second argument – collection
    • Develop the logic which will iterate through the elements in collection, apply passed transformation rule and add elements to new collections which satisfied the criteria.

  • myReduce function
    • Define function with two arguments
    • first argument – lambda function with 2 arguments (at run time we need pass logic which perform arithmetic operation between the 2)
    • second argument – collection
    • Develop the logic which will iterate through the elements in collection, apply aggregation and return one value.

Understanding existing packages and APIs

As we have seen how to develop reusable functions to process the data, now let us understand existing APIs in different Python packages.

  • map
  • filter
  • functools reduce (in Python 3)
  • itertools have several functions
  • numpy
  • pandas
  • and more

We will review some of the APIs by going through help. In place of myFilter, myMap, myReduce – you can leverage existing APIs to get the similar functionality.

Developing Solutions using Map Reduce APIs

Now, let us understand how to build applications using existing APIs.

  • Get revenue for given order id from order_items
    • Use filter to filter for items for a given order id
    • Use map to get order item subtotals
    • Use reduce to aggregate. We can also use sum to get total of elements in numeric list.

  • We do not have APIs directly to perform by key aggregations
  • We need to use plugins such as itertools, pandas etc
  • itertools approach – Get revenue for each order id from order_items
    • Read data into collection – order items
    • Sort data using sort function based on the key we are going to use to group – order item order id
    • Group data using groupby function of itertools using key on which we need to get the aggregation
    • groupby returns new collection in which each element contains
      • key on which data is grouped
      • collection corresponding to the key
    • apply map function to process the collection corresponding to key to return sum of order item subtotal

  • pandas approach –  Get revenue for each order id from order_items
    • We will actually look into pandas in detail as part of next chapter
    • Create list for column names
    • Pass path and column names to pandas read_csv function to create data frame
    • We can refer attributes in data frames using names
    • Apply group by function to group data using order item order id and invoke aggregate function sum on order item subtotal – this will return a new data frame which contain order item order id and revenue.