Limitations of using Loops¶
There are several limitations using loops.
- If you look at the below examples related to processing collections using loops, most of the functions have similar logic to iterate through elements.
- We are spending more time on coding non business logic.
- It results in too much of code and it can become a maintenance problem.
In [13]:
%run 07_preparing_data_sets.ipynb
In [14]:
def get_customer_orders(orders, customer_id):
orders_filtered = []
for order in orders:
if int(order.split(',')[2]) == customer_id:
orders_filtered.append(order)
return orders_filtered
In [15]:
def get_customer_orders_for_month(orders, customer_id, order_month):
orders_filtered = []
for order in orders:
order_elements = order.split(',')
if (int(order_elements[2]) == customer_id and
order_elements[1].startswith(order_month)):
orders_filtered.append(order)
return orders_filtered
In [16]:
for order in orders:
order_elements = order.split(',')
if int(order_elements[2]) == 12431 \
and order_elements[1].startswith('2014-01') \
and (order_elements[3] in ('PROCESSING', 'PENDING_PAYMENT')):
print(order)
27585,2014-01-12 00:00:00.0,12431,PROCESSING 28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT
- Map Reduce APIs or Higher level libraries such as Pandas will solve these problems.
- We do not have to develop loops and conditionals.
- Loops and Conditionals are taken care by the existing APIs.
- We can just focus on business logic. It can be passed using Lambda Functions.
{note}
Here is the approach using `filter` that comes as part of Map Reduce APIs. You will learn about Map Reduce APIs soon.
In [17]:
orders_filtered = filter(
lambda order: int(order.split(',')[2]) == 12431,
orders
)
list(orders_filtered)
Out[17]:
['3774,2013-08-16 00:00:00.0,12431,CANCELED', '3870,2013-08-17 00:00:00.0,12431,PENDING_PAYMENT', '4032,2013-08-17 00:00:00.0,12431,ON_HOLD', '22812,2013-12-12 00:00:00.0,12431,PENDING', '22927,2013-12-13 00:00:00.0,12431,CLOSED', '25614,2013-12-30 00:00:00.0,12431,CLOSED', '27585,2014-01-12 00:00:00.0,12431,PROCESSING', '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT', '29109,2014-01-21 00:00:00.0,12431,ON_HOLD', '29232,2014-01-21 00:00:00.0,12431,ON_HOLD', '45894,2014-05-06 00:00:00.0,12431,CLOSED', '46217,2014-05-07 00:00:00.0,12431,CLOSED', '49678,2014-05-31 00:00:00.0,12431,PENDING', '51865,2014-06-15 00:00:00.0,12431,PROCESSING', '63146,2014-02-13 00:00:00.0,12431,PENDING_PAYMENT', '67110,2014-07-14 00:00:00.0,12431,PENDING']
In [18]:
orders_filtered = filter(
lambda order: int(order.split(',')[2]) == 12431
and order.split(',')[1].startswith('2014-01'),
orders
)
list(orders_filtered)
Out[18]:
['27585,2014-01-12 00:00:00.0,12431,PROCESSING', '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT', '29109,2014-01-21 00:00:00.0,12431,ON_HOLD', '29232,2014-01-21 00:00:00.0,12431,ON_HOLD']
In [19]:
orders_filtered = filter(
lambda order: int(order.split(',')[2]) == 12431
and order.split(',')[1].startswith('2014-01')
and (order.split(',')[3] in ('PROCESSING', 'PENDING_PAYMENT')),
orders
)
list(orders_filtered)
Out[19]:
['27585,2014-01-12 00:00:00.0,12431,PROCESSING', '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT']
{note}
Here is the approach using Pandas library. You will learn about how to process data using Pandas in subsequent sections.
In [20]:
import pandas as pd
orders_schema = [
'order_id',
'order_date',
'order_customer_id',
'order_status'
]
orders = pd.read_csv('/data/retail_db/orders/part-00000', names=orders_schema)
In [21]:
orders.query('order_customer_id == 12431')
Out[21]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
3773 | 3774 | 2013-08-16 00:00:00.0 | 12431 | CANCELED |
3869 | 3870 | 2013-08-17 00:00:00.0 | 12431 | PENDING_PAYMENT |
4031 | 4032 | 2013-08-17 00:00:00.0 | 12431 | ON_HOLD |
22811 | 22812 | 2013-12-12 00:00:00.0 | 12431 | PENDING |
22926 | 22927 | 2013-12-13 00:00:00.0 | 12431 | CLOSED |
25613 | 25614 | 2013-12-30 00:00:00.0 | 12431 | CLOSED |
27584 | 27585 | 2014-01-12 00:00:00.0 | 12431 | PROCESSING |
28243 | 28244 | 2014-01-15 00:00:00.0 | 12431 | PENDING_PAYMENT |
29108 | 29109 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
29231 | 29232 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
45893 | 45894 | 2014-05-06 00:00:00.0 | 12431 | CLOSED |
46216 | 46217 | 2014-05-07 00:00:00.0 | 12431 | CLOSED |
49677 | 49678 | 2014-05-31 00:00:00.0 | 12431 | PENDING |
51864 | 51865 | 2014-06-15 00:00:00.0 | 12431 | PROCESSING |
63145 | 63146 | 2014-02-13 00:00:00.0 | 12431 | PENDING_PAYMENT |
67109 | 67110 | 2014-07-14 00:00:00.0 | 12431 | PENDING |
In [22]:
orders.query('order_customer_id == 12431 & order_date.str.startswith("2014-01")')
Out[22]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
27584 | 27585 | 2014-01-12 00:00:00.0 | 12431 | PROCESSING |
28243 | 28244 | 2014-01-15 00:00:00.0 | 12431 | PENDING_PAYMENT |
29108 | 29109 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
29231 | 29232 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
In [23]:
orders.query('order_customer_id == 12431 & ' +
'order_date.str.startswith("2014-01") &' +
'order_status in ("PROCESSING", "PENDING_PAYMENT")'
)
Out[23]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
27584 | 27585 | 2014-01-12 00:00:00.0 | 12431 | PROCESSING |
28243 | 28244 | 2014-01-15 00:00:00.0 | 12431 | PENDING_PAYMENT |