Row level transformations¶
Let us understand how to perform row level transformations using orders data set. Here are the details about orders.
- Data is in text file format
- Each line in the file contains one record.
- Each record contains 4 attributes which are separated by “,”
- order_id
- order_date
- order_customer_id
- order_status
In [1]:
%%sh
ls -ltr /data/retail_db/orders/part-00000
-rw-rw-r-- 1 itversity itversity 2999944 Mar 8 02:04 /data/retail_db/orders/part-00000
In [2]:
%%sh
tail /data/retail_db/orders/part-00000
68874,2014-07-03 00:00:00.0,1601,COMPLETE 68875,2014-07-04 00:00:00.0,10637,ON_HOLD 68876,2014-07-06 00:00:00.0,4124,COMPLETE 68877,2014-07-07 00:00:00.0,9692,ON_HOLD 68878,2014-07-08 00:00:00.0,6753,COMPLETE 68879,2014-07-09 00:00:00.0,778,COMPLETE 68880,2014-07-13 00:00:00.0,1117,COMPLETE 68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT 68882,2014-07-22 00:00:00.0,10000,ON_HOLD 68883,2014-07-23 00:00:00.0,5533,COMPLETE
In [3]:
path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\orders\\part-00000
orders_file = open(path)
In [4]:
type(orders_file)
Out[4]:
_io.TextIOWrapper
In [5]:
orders_raw = orders_file.read()
In [6]:
type(orders_raw)
Out[6]:
str
In [7]:
orders_raw.splitlines?
Signature: orders_raw.splitlines(keepends=False) Docstring: Return a list of the lines in the string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true. Type: builtin_function_or_method
In [8]:
orders = orders_raw.splitlines()
In [9]:
type(orders)
Out[9]:
list
In [10]:
orders[:10]
Out[10]:
['1,2013-07-25 00:00:00.0,11599,CLOSED', '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT', '3,2013-07-25 00:00:00.0,12111,COMPLETE', '4,2013-07-25 00:00:00.0,8827,CLOSED', '5,2013-07-25 00:00:00.0,11318,COMPLETE', '6,2013-07-25 00:00:00.0,7130,COMPLETE', '7,2013-07-25 00:00:00.0,4530,COMPLETE', '8,2013-07-25 00:00:00.0,2911,PROCESSING', '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT', '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
In [11]:
type(orders[0])
Out[11]:
str
In [12]:
len(orders)
Out[12]:
68883
In [13]:
%%sh
wc -l /data/retail_db/orders/part-00000
68883 /data/retail_db/orders/part-00000
Task 1¶
Get all order ids and associated statuses. Each record in the output should be comma separated string.
In [14]:
order = '1,2013-07-25 00:00:00.0,11599,CLOSED' # -> '1,CLOSED'
In [15]:
# We invokde join on delimiter
In [16]:
str.join?
Signature: str.join(self, iterable, /) Docstring: Concatenate any number of strings. The string whose method is called is inserted in between each given string. The result is returned as a new string. Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs' Type: method_descriptor
In [17]:
':'.join(['1', '2', '3', '4'])
Out[17]:
'1:2:3:4'
In [18]:
order.split(',')[0]
Out[18]:
'1'
In [19]:
order.split(',')[3]
Out[19]:
'CLOSED'
In [20]:
[order.split(',')[0], order.split(',')[3]]
Out[20]:
['1', 'CLOSED']
In [21]:
','.join([order.split(',')[0], order.split(',')[3]])
Out[21]:
'1,CLOSED'
In [22]:
l = [1]
In [23]:
l.append(2)
In [24]:
l
Out[24]:
[1, 2]
In [25]:
order_statuses = []
for order in orders:
order_statuses.append(','.join([order.split(',')[0], order.split(',')[3]]))
In [26]:
order_statuses[:10]
Out[26]:
['1,CLOSED', '2,PENDING_PAYMENT', '3,COMPLETE', '4,CLOSED', '5,COMPLETE', '6,COMPLETE', '7,COMPLETE', '8,PROCESSING', '9,PENDING_PAYMENT', '10,PENDING_PAYMENT']
In [27]:
len(order_statuses)
Out[27]:
68883
In [28]:
order_statuses = [','.join([order.split(',')[0], order.split(',')[3]]) for order in orders] # alternative solution
In [29]:
order_statuses[:10]
Out[29]:
['1,CLOSED', '2,PENDING_PAYMENT', '3,COMPLETE', '4,CLOSED', '5,COMPLETE', '6,COMPLETE', '7,COMPLETE', '8,PROCESSING', '9,PENDING_PAYMENT', '10,PENDING_PAYMENT']
In [30]:
len(order_statuses)
Out[30]:
68883
Task 2¶
Get all order ids, the dates on which order is placed and order status. Each record in the output should be dict with following column names as keys.
- order_id
- order_date
- order_status
In [31]:
{'order_id': 1, 'order_date': '2020-12-22', 'order_status': 'COMPLETE'}
Out[31]:
{'order_id': 1, 'order_date': '2020-12-22', 'order_status': 'COMPLETE'}
In [32]:
def get_order_details(order):
"""Extract order details such as id, date as well as status and return as dict"""
order_values = order.split(',')
return ({
'order_id': int(order_values[0]),
'order_date': order_values[1],
'order_status': order_values[3]
})
In [33]:
get_order_details('1,2013-07-25 00:00:00.0,11599,CLOSED')
Out[33]:
{'order_id': 1, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'CLOSED'}
In [34]:
order_details = []
for order in orders:
order_details.append(get_order_details(order))
In [35]:
order_details[:10]
Out[35]:
[{'order_id': 1, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'CLOSED'}, {'order_id': 2, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'PENDING_PAYMENT'}, {'order_id': 3, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'COMPLETE'}, {'order_id': 4, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'CLOSED'}, {'order_id': 5, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'COMPLETE'}, {'order_id': 6, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'COMPLETE'}, {'order_id': 7, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'COMPLETE'}, {'order_id': 8, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'PROCESSING'}, {'order_id': 9, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'PENDING_PAYMENT'}, {'order_id': 10, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'PENDING_PAYMENT'}]
In [36]:
len(order_details)
Out[36]:
68883
In [37]:
order_details = [get_order_details(order) for order in orders]
In [38]:
order_details[:10]
Out[38]:
[{'order_id': 1, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'CLOSED'}, {'order_id': 2, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'PENDING_PAYMENT'}, {'order_id': 3, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'COMPLETE'}, {'order_id': 4, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'CLOSED'}, {'order_id': 5, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'COMPLETE'}, {'order_id': 6, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'COMPLETE'}, {'order_id': 7, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'COMPLETE'}, {'order_id': 8, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'PROCESSING'}, {'order_id': 9, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'PENDING_PAYMENT'}, {'order_id': 10, 'order_date': '2013-07-25 00:00:00.0', 'order_status': 'PENDING_PAYMENT'}]
In [39]:
len(order_details)
Out[39]:
68883