Aggregations using reduce¶
Let us understand how to perform global aggregations using reduce
.
- We can use
reduce
on top ofiterable
to return aggregated result. - It takes aggregation logic and iterable as arguments. We can pass aggregation logic either as regular function or lambda function.
reduce
returns objects of typeint
,float
etc. It is typically of type elements in the collection that is being processed.- Unlike
map
andfilter
we need to importreduce
from functools.
In [1]:
%run 02_preparing_data_sets.ipynb
In [2]:
orders[:10]
Out[2]:
['1,2013-07-25 00:00:00.0,11599,CLOSED', '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT', '3,2013-07-25 00:00:00.0,12111,COMPLETE', '4,2013-07-25 00:00:00.0,8827,CLOSED', '5,2013-07-25 00:00:00.0,11318,COMPLETE', '6,2013-07-25 00:00:00.0,7130,COMPLETE', '7,2013-07-25 00:00:00.0,4530,COMPLETE', '8,2013-07-25 00:00:00.0,2911,PROCESSING', '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT', '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
In [3]:
len(orders)
Out[3]:
68883
In [4]:
order_items[:10]
Out[4]:
['1,1,957,1,299.98,299.98', '2,2,1073,1,199.99,199.99', '3,2,502,5,250.0,50.0', '4,2,403,1,129.99,129.99', '5,4,897,2,49.98,24.99', '6,4,365,5,299.95,59.99', '7,4,502,3,150.0,50.0', '8,4,1014,4,199.92,49.98', '9,5,957,1,299.98,299.98', '10,5,365,5,299.95,59.99']
In [5]:
len(order_items)
Out[5]:
172198
Task 1 – Get Count¶
Use orders and get total number of records for a given month (201401).
- Filter the data.
- Perform row level transformation by changing each record to 1.
- Use reduce to aggregate.
In [6]:
orders[:10]
Out[6]:
['1,2013-07-25 00:00:00.0,11599,CLOSED', '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT', '3,2013-07-25 00:00:00.0,12111,COMPLETE', '4,2013-07-25 00:00:00.0,8827,CLOSED', '5,2013-07-25 00:00:00.0,11318,COMPLETE', '6,2013-07-25 00:00:00.0,7130,COMPLETE', '7,2013-07-25 00:00:00.0,4530,COMPLETE', '8,2013-07-25 00:00:00.0,2911,PROCESSING', '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT', '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
In [7]:
order = '1,2013-07-25 00:00:00.0,11599,CLOSED'
In [8]:
order.split(',')
Out[8]:
['1', '2013-07-25 00:00:00.0', '11599', 'CLOSED']
In [9]:
order.split(',')[1]
Out[9]:
'2013-07-25 00:00:00.0'
In [10]:
order.split(',')[1][:7]
Out[10]:
'2013-07'
In [11]:
order.split(',')[1][:7].replace('-', '')
Out[11]:
'201307'
In [12]:
int(order.split(',')[1][:7].replace('-', ''))
Out[12]:
201307
In [13]:
orders_filtered = filter(
lambda order: int(order.split(',')[1][:7].replace('-', '')) == 201307,
orders
)
In [14]:
list(orders_filtered)[:10]
Out[14]:
['1,2013-07-25 00:00:00.0,11599,CLOSED', '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT', '3,2013-07-25 00:00:00.0,12111,COMPLETE', '4,2013-07-25 00:00:00.0,8827,CLOSED', '5,2013-07-25 00:00:00.0,11318,COMPLETE', '6,2013-07-25 00:00:00.0,7130,COMPLETE', '7,2013-07-25 00:00:00.0,4530,COMPLETE', '8,2013-07-25 00:00:00.0,2911,PROCESSING', '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT', '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
In [15]:
orders_filtered = filter(
lambda order: int(order.split(',')[1][:7].replace('-', '')) == 201307,
orders
)
len(list(orders_filtered))
Out[15]:
1533
In [16]:
orders_filtered = filter(
lambda order: int(order.split(',')[1][:7].replace('-', '')) == 201307,
orders
)
In [17]:
orders_mapped = map(
lambda order: 1,
orders_filtered
)
In [18]:
list(orders_mapped)[:10]
Out[18]:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
In [19]:
orders_filtered = filter(
lambda order: int(order.split(',')[1][:7].replace('-', '')) == 201307,
orders
)
orders_mapped = map(
lambda order: 1,
orders_filtered
)
len(list(orders_mapped))
Out[19]:
1533
In [20]:
from functools import reduce
reduce?
Docstring: reduce(function, sequence[, initial]) -> value Apply a function of two arguments cumulatively to the items of a sequence, from left to right, so as to reduce the sequence to a single value. For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates ((((1+2)+3)+4)+5). If initial is present, it is placed before the items of the sequence in the calculation, and serves as a default when the sequence is empty. Type: builtin_function_or_method
In [21]:
orders_filtered = filter(
lambda order: int(order.split(',')[1][:7].replace('-', '')) == 201307,
orders
)
orders_mapped = map(
lambda order: 1,
orders_filtered
)
In [22]:
reduce(
lambda tot, ele: tot + ele,
orders_mapped
)
Out[22]:
1533