Validate myJoin Function¶
Let us perform a task to validate myJoin. In this case we will be joining order_id and order_date from orders with order_id and order_item_subtotal from order_items.
- Read orders data set from file.
- Create list of tuples using order_id and order_date with order_id as first element. We can use
myMap
function developed earlier to convert list of strings related to orders to list of tuples with order_id and order_date. - Read order_items data set from file.
- Create list of tuples using order_item_order_id and order_item_subtotal with order_item_order_id as first element. We can use the same
myMap
function to convert list of strings related to order_items to list of tuples with order_item_order_id and order_item_subtotal. - Invoke myJoin function with these collections.
In [19]:
%run 04_develop_myMap_function.ipynb
In [20]:
orders_path = "/data/retail_db/orders/part-00000"
orders = open(orders_path). \
read(). \
splitlines()
In [21]:
orders[:10]
Out[21]:
['1,2013-07-25 00:00:00.0,11599,CLOSED', '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT', '3,2013-07-25 00:00:00.0,12111,COMPLETE', '4,2013-07-25 00:00:00.0,8827,CLOSED', '5,2013-07-25 00:00:00.0,11318,COMPLETE', '6,2013-07-25 00:00:00.0,7130,COMPLETE', '7,2013-07-25 00:00:00.0,4530,COMPLETE', '8,2013-07-25 00:00:00.0,2911,PROCESSING', '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT', '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
In [22]:
len(orders)
Out[22]:
68883
In [23]:
orders_map = myMap(orders,
lambda order: (int(order.split(',')[0]), order.split(',')[1])
)
orders_map[:10]
Out[23]:
[(1, '2013-07-25 00:00:00.0'), (2, '2013-07-25 00:00:00.0'), (3, '2013-07-25 00:00:00.0'), (4, '2013-07-25 00:00:00.0'), (5, '2013-07-25 00:00:00.0'), (6, '2013-07-25 00:00:00.0'), (7, '2013-07-25 00:00:00.0'), (8, '2013-07-25 00:00:00.0'), (9, '2013-07-25 00:00:00.0'), (10, '2013-07-25 00:00:00.0')]
In [24]:
len(orders_map)
Out[24]:
68883
In [25]:
len(dict(orders_map).keys()) # It should be same as count in orders as order ids are unique in orders data set
Out[25]:
68883
In [26]:
order_items_path = "/data/retail_db/order_items/part-00000"
order_items = open(order_items_path). \
read(). \
splitlines()
In [27]:
order_items[:10]
Out[27]:
['1,1,957,1,299.98,299.98', '2,2,1073,1,199.99,199.99', '3,2,502,5,250.0,50.0', '4,2,403,1,129.99,129.99', '5,4,897,2,49.98,24.99', '6,4,365,5,299.95,59.99', '7,4,502,3,150.0,50.0', '8,4,1014,4,199.92,49.98', '9,5,957,1,299.98,299.98', '10,5,365,5,299.95,59.99']
In [28]:
len(order_items)
Out[28]:
172198
In [29]:
order_items_map = myMap(order_items,
lambda order_item: (int(order_item.split(',')[1]),
float(order_item.split(',')[4])
)
)
In [30]:
order_items_map[:10]
Out[30]:
[(1, 299.98), (2, 199.99), (2, 250.0), (2, 129.99), (4, 49.98), (4, 299.95), (4, 150.0), (4, 199.92), (5, 299.98), (5, 299.95)]
In [31]:
len(order_items_map)
Out[31]:
172198
In [32]:
len(dict(order_items_map).keys())
Out[32]:
57431
In [33]:
%run 10_develop_myJoin_function.ipynb
In [34]:
orders_join = myJoin(orders_map, order_items_map)
In [35]:
orders_join[:10]
Out[35]:
[(1, ('2013-07-25 00:00:00.0', 299.98)), (2, ('2013-07-25 00:00:00.0', 199.99)), (2, ('2013-07-25 00:00:00.0', 250.0)), (2, ('2013-07-25 00:00:00.0', 129.99)), (4, ('2013-07-25 00:00:00.0', 49.98)), (4, ('2013-07-25 00:00:00.0', 299.95)), (4, ('2013-07-25 00:00:00.0', 150.0)), (4, ('2013-07-25 00:00:00.0', 199.92)), (5, ('2013-07-25 00:00:00.0', 299.98)), (5, ('2013-07-25 00:00:00.0', 299.95))]
In [36]:
# As there is one to many relationship between orders and order_items,
# this will be same as number of elements in order_items
len(orders_join)
Out[36]:
172198