Filtering Data¶
Let us perform few tasks to understand how to filter the data in collections using loops and conditionals.
Here are the details about orders.
- Data is in text file format
- Each line in the file contains one record.
- Each record contains 4 attributes which are separated by “,”
- order_id
- order_date
- order_customer_id
- order_status
In [1]:
%%sh
ls -ltr /data/retail_db/orders/part-00000
-rw-rw-r-- 1 itversity itversity 2999944 Mar 8 02:04 /data/retail_db/orders/part-00000
In [2]:
%%sh
tail /data/retail_db/orders/part-00000
68874,2014-07-03 00:00:00.0,1601,COMPLETE 68875,2014-07-04 00:00:00.0,10637,ON_HOLD 68876,2014-07-06 00:00:00.0,4124,COMPLETE 68877,2014-07-07 00:00:00.0,9692,ON_HOLD 68878,2014-07-08 00:00:00.0,6753,COMPLETE 68879,2014-07-09 00:00:00.0,778,COMPLETE 68880,2014-07-13 00:00:00.0,1117,COMPLETE 68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT 68882,2014-07-22 00:00:00.0,10000,ON_HOLD 68883,2014-07-23 00:00:00.0,5533,COMPLETE
In [3]:
path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\orders\\part-00000
orders_file = open(path)
In [4]:
type(orders_file)
Out[4]:
_io.TextIOWrapper
In [5]:
orders_raw = orders_file.read()
In [6]:
type(orders_raw)
Out[6]:
str
In [7]:
orders_raw.splitlines?
Signature: orders_raw.splitlines(keepends=False) Docstring: Return a list of the lines in the string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true. Type: builtin_function_or_method
In [8]:
orders = orders_raw.splitlines()
In [9]:
type(orders)
Out[9]:
list
In [10]:
orders[:10]
Out[10]:
['1,2013-07-25 00:00:00.0,11599,CLOSED', '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT', '3,2013-07-25 00:00:00.0,12111,COMPLETE', '4,2013-07-25 00:00:00.0,8827,CLOSED', '5,2013-07-25 00:00:00.0,11318,COMPLETE', '6,2013-07-25 00:00:00.0,7130,COMPLETE', '7,2013-07-25 00:00:00.0,4530,COMPLETE', '8,2013-07-25 00:00:00.0,2911,PROCESSING', '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT', '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
In [11]:
type(orders[0])
Out[11]:
str
In [12]:
len(orders)
Out[12]:
68883
In [13]:
%%sh
wc -l /data/retail_db/orders/part-00000
68883 /data/retail_db/orders/part-00000
Task 1¶
Create a function by name get_customer_orders which take orders list and customer_id as arguments and return all the orders placed by customer_id
In [14]:
orders[:10]
Out[14]:
['1,2013-07-25 00:00:00.0,11599,CLOSED', '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT', '3,2013-07-25 00:00:00.0,12111,COMPLETE', '4,2013-07-25 00:00:00.0,8827,CLOSED', '5,2013-07-25 00:00:00.0,11318,COMPLETE', '6,2013-07-25 00:00:00.0,7130,COMPLETE', '7,2013-07-25 00:00:00.0,4530,COMPLETE', '8,2013-07-25 00:00:00.0,2911,PROCESSING', '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT', '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
In [15]:
order = '3,2013-07-25 00:00:00.0,12111,COMPLETE'
In [16]:
int(order.split(',')[2]) == 12111
Out[16]:
True
In [17]:
def get_customer_orders(orders, customer_id):
orders_filtered = []
for order in orders:
if int(order.split(',')[2]) == customer_id:
orders_filtered.append(order)
return orders_filtered
In [18]:
# Use the function and get all the orders which are placed by customer with id 12431
get_customer_orders(orders, 12431)
Out[18]:
['3774,2013-08-16 00:00:00.0,12431,CANCELED', '3870,2013-08-17 00:00:00.0,12431,PENDING_PAYMENT', '4032,2013-08-17 00:00:00.0,12431,ON_HOLD', '22812,2013-12-12 00:00:00.0,12431,PENDING', '22927,2013-12-13 00:00:00.0,12431,CLOSED', '25614,2013-12-30 00:00:00.0,12431,CLOSED', '27585,2014-01-12 00:00:00.0,12431,PROCESSING', '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT', '29109,2014-01-21 00:00:00.0,12431,ON_HOLD', '29232,2014-01-21 00:00:00.0,12431,ON_HOLD', '45894,2014-05-06 00:00:00.0,12431,CLOSED', '46217,2014-05-07 00:00:00.0,12431,CLOSED', '49678,2014-05-31 00:00:00.0,12431,PENDING', '51865,2014-06-15 00:00:00.0,12431,PROCESSING', '63146,2014-02-13 00:00:00.0,12431,PENDING_PAYMENT', '67110,2014-07-14 00:00:00.0,12431,PENDING']
In [19]:
len(get_customer_orders(orders, 12431))
Out[19]:
16
Task 2¶
Create a function by name get_customer_orders_for_month which take orders list, customer_id and month in the format YYYY-MM as arguments and return all the orders placed by customer_id for a given month.
In [20]:
order = '3,2013-07-25 00:00:00.0,12111,COMPLETE'
In [21]:
int(order.split(',')[2]) == 12111
Out[21]:
True
In [22]:
order.split(',')[1].startswith('2013-07')
Out[22]:
True
In [23]:
import datetime as dt
d = dt.datetime.strptime(order.split(',')[1], '%Y-%m-%d %H:%M:%S.%f')
In [24]:
d.year == 2013 and d.month == 7
Out[24]:
True
In [25]:
int(order.split(',')[2]) == 12111 and order.split(',')[1].startswith('2013-07')
Out[25]:
True
In [26]:
def get_customer_orders_for_month(orders, customer_id, order_month):
orders_filtered = []
for order in orders:
order_elements = order.split(',')
if (int(order_elements[2]) == customer_id and
order_elements[1].startswith(order_month)):
orders_filtered.append(order)
return orders_filtered
In [27]:
# Use the function and get all the orders which are placed by customer with id 12431 in January 2014
get_customer_orders_for_month(orders, 12431, '2014-01')
Out[27]:
['27585,2014-01-12 00:00:00.0,12431,PROCESSING', '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT', '29109,2014-01-21 00:00:00.0,12431,ON_HOLD', '29232,2014-01-21 00:00:00.0,12431,ON_HOLD']
Task 3¶
Write ad hoc code to get all the orders which are placed by customer with id 12431 in January 2014 and status is in PENDING_PAYMENT or PROCESSING
In [28]:
for order in orders:
order_elements = order.split(',')
if int(order_elements[2]) == 12431 \
and order_elements[1].startswith('2014-01') \
and (order_elements[3] in ('PROCESSING', 'PENDING_PAYMENT')):
print(order)
27585,2014-01-12 00:00:00.0,12431,PROCESSING 28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT