Preparing Data Sets¶
We will be primarily using orders and order_items data set to understand about manipulating collections.
- orders is available at path /data/retail_db/orders/part-00000
- order_items is available at path /data/retail_db/order_items/part-00000
- orders – columns
- order_id – it is of type integer and unique
- order_date – it can be considered as string
- order_customer_id – it is of type integer
- order_status – it is of type string
- order_items – columns
- order_item_id – it is of type integer and unique
- order_item_order_id – it is of type integer and refers to orders.order_id
- order_item_product_id – it is of type integer and refers to products.product_id
- order_item_quantity – it is of type integer and represents number of products as an order item with in an order.
- order_item_subtotal – it is item level revenue (product of order_item_quantity and order_item_product_price)
- order_item_product_price – it is product price for each item with in an order.
- orders is parent data set to order_items and will contain one record per order. Each order can contain multiple items.
- order_items is child data set to orders and can contain multiple entries for a given order_item_order_id.### Task 1 – Read orders into collection
Let us read orders data set into the collection called as orders. This will be used later.
In [2]:
orders_path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\orders\\part-00000
orders_file = open(orders_path)
In [3]:
orders_raw = orders_file.read()
In [4]:
orders = orders_raw.splitlines()
In [5]:
orders[:10]
Out[5]:
['1,2013-07-25 00:00:00.0,11599,CLOSED', '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT', '3,2013-07-25 00:00:00.0,12111,COMPLETE', '4,2013-07-25 00:00:00.0,8827,CLOSED', '5,2013-07-25 00:00:00.0,11318,COMPLETE', '6,2013-07-25 00:00:00.0,7130,COMPLETE', '7,2013-07-25 00:00:00.0,4530,COMPLETE', '8,2013-07-25 00:00:00.0,2911,PROCESSING', '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT', '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
In [6]:
len(orders) # same as number of records in the file
Out[6]:
68883
Task 2 – Read order_items into collection¶
Let us read order_items data set into the collection called as order_items. This will be used later.
In [7]:
order_items_path = '/data/retail_db/order_items/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\order_items\\part-00000
order_items_file = open(order_items_path)
In [8]:
order_items_raw = order_items_file.read()
In [9]:
order_items = order_items_raw.splitlines()
In [10]:
order_items[:10]
Out[10]:
['1,1,957,1,299.98,299.98', '2,2,1073,1,199.99,199.99', '3,2,502,5,250.0,50.0', '4,2,403,1,129.99,129.99', '5,4,897,2,49.98,24.99', '6,4,365,5,299.95,59.99', '7,4,502,3,150.0,50.0', '8,4,1014,4,199.92,49.98', '9,5,957,1,299.98,299.98', '10,5,365,5,299.95,59.99']
In [11]:
len(order_items) # same as number of records in the file
Out[11]:
172198