Read Delimited files into list of tuples¶
Let us go through reading delimited files into list of tuples. Here are the steps involved.
- Open the file in read mode (default).
- Read the data in the file as string using
read
and then usesplitlines
to create a collection. - At this time, we will have a list where each element is a line from the file.
- The data in each element is typically delimited. We have to read the data at each attribute level.
- We typically process further to create list of tuples or list of dicts. Each string will be converted to tuple or a dict.
- We can either use conventional loops or list comprehensions or functions like
map
to convert each element in the list to a tuple or dict. For now, we will focus on tuple.
In [1]:
!ls -ltr /data/retail_db/orders
total 2932 -rw-rw-r-- 1 itversity itversity 2999944 Mar 8 02:04 part-00000
In [2]:
!head -5 /data/retail_db/orders/part-00000
1,2013-07-25 00:00:00.0,11599,CLOSED 2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT 3,2013-07-25 00:00:00.0,12111,COMPLETE 4,2013-07-25 00:00:00.0,8827,CLOSED 5,2013-07-25 00:00:00.0,11318,COMPLETE
In [3]:
!wc -l /data/retail_db/orders/part-00000
68883 /data/retail_db/orders/part-00000
- Open the file in read mode.
In [4]:
orders_file = open('/data/retail_db/orders/part-00000')
- Read the data from the file into list of strings.
In [5]:
orders_list = orders_file.read().splitlines()
In [6]:
type(orders_list)
Out[6]:
list
In [7]:
orders_list[:10]
Out[7]:
['1,2013-07-25 00:00:00.0,11599,CLOSED', '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT', '3,2013-07-25 00:00:00.0,12111,COMPLETE', '4,2013-07-25 00:00:00.0,8827,CLOSED', '5,2013-07-25 00:00:00.0,11318,COMPLETE', '6,2013-07-25 00:00:00.0,7130,COMPLETE', '7,2013-07-25 00:00:00.0,4530,COMPLETE', '8,2013-07-25 00:00:00.0,2911,PROCESSING', '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT', '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
In [8]:
len(orders_list)
Out[8]:
68883
In [9]:
orders_list[0]
Out[9]:
'1,2013-07-25 00:00:00.0,11599,CLOSED'
In [10]:
type(orders_list[0])
Out[10]:
str
- Converting each string in orders_list into list of tuples using conventional
for
loop.
In [11]:
order = orders_list[0]
In [12]:
order.split(',')
Out[12]:
['1', '2013-07-25 00:00:00.0', '11599', 'CLOSED']
In [13]:
tuple(order.split(','))
Out[13]:
('1', '2013-07-25 00:00:00.0', '11599', 'CLOSED')
In [14]:
tuple((int(order.split(',')[0]), order.split(',')[1], int(order.split(',')[2]), order.split(',')[3]))
Out[14]:
(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED')
In [15]:
orders_conventional_for = []
for order in orders_list:
order_details = order.split(',')
orders_conventional_for.append((int(order_details[0]), order_details[1], int(order_details[2]), order_details[3]))
In [16]:
orders_conventional_for[:10]
Out[16]:
[(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED'), (2, '2013-07-25 00:00:00.0', 256, 'PENDING_PAYMENT'), (3, '2013-07-25 00:00:00.0', 12111, 'COMPLETE'), (4, '2013-07-25 00:00:00.0', 8827, 'CLOSED'), (5, '2013-07-25 00:00:00.0', 11318, 'COMPLETE'), (6, '2013-07-25 00:00:00.0', 7130, 'COMPLETE'), (7, '2013-07-25 00:00:00.0', 4530, 'COMPLETE'), (8, '2013-07-25 00:00:00.0', 2911, 'PROCESSING'), (9, '2013-07-25 00:00:00.0', 5657, 'PENDING_PAYMENT'), (10, '2013-07-25 00:00:00.0', 5648, 'PENDING_PAYMENT')]
In [17]:
len(orders_conventional_for)
Out[17]:
68883
In [18]:
orders_conventional_for[0]
Out[18]:
(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED')
In [19]:
type(orders_conventional_for[0])
Out[19]:
tuple
- Converting each string in orders_list into list of tuples using list comprehensions.
In [20]:
orders_list_comprehension = [
(int(order.split(',')[0]), order.split(',')[1], int(order.split(',')[2]), order.split(',')[3]) for order in orders_list
]
In [21]:
orders_list_comprehension[:10]
Out[21]:
[(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED'), (2, '2013-07-25 00:00:00.0', 256, 'PENDING_PAYMENT'), (3, '2013-07-25 00:00:00.0', 12111, 'COMPLETE'), (4, '2013-07-25 00:00:00.0', 8827, 'CLOSED'), (5, '2013-07-25 00:00:00.0', 11318, 'COMPLETE'), (6, '2013-07-25 00:00:00.0', 7130, 'COMPLETE'), (7, '2013-07-25 00:00:00.0', 4530, 'COMPLETE'), (8, '2013-07-25 00:00:00.0', 2911, 'PROCESSING'), (9, '2013-07-25 00:00:00.0', 5657, 'PENDING_PAYMENT'), (10, '2013-07-25 00:00:00.0', 5648, 'PENDING_PAYMENT')]
In [22]:
len(orders_list_comprehension)
Out[22]:
68883
In [23]:
orders_list_comprehension[0]
Out[23]:
(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED')
In [24]:
type(orders_list_comprehension[0])
Out[24]:
tuple
- Converting each string in orders_list into list of tuples using
map
function.
In [25]:
orders_list_map = map(
lambda order: (
int(order.split(',')[0]), order.split(',')[1], int(order.split(',')[2]), order.split(',')[3]
),
orders_list
)
In [26]:
type(orders_list_map)
Out[26]:
map
In [27]:
list(orders_list_map)[:10]
Out[27]:
[(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED'), (2, '2013-07-25 00:00:00.0', 256, 'PENDING_PAYMENT'), (3, '2013-07-25 00:00:00.0', 12111, 'COMPLETE'), (4, '2013-07-25 00:00:00.0', 8827, 'CLOSED'), (5, '2013-07-25 00:00:00.0', 11318, 'COMPLETE'), (6, '2013-07-25 00:00:00.0', 7130, 'COMPLETE'), (7, '2013-07-25 00:00:00.0', 4530, 'COMPLETE'), (8, '2013-07-25 00:00:00.0', 2911, 'PROCESSING'), (9, '2013-07-25 00:00:00.0', 5657, 'PENDING_PAYMENT'), (10, '2013-07-25 00:00:00.0', 5648, 'PENDING_PAYMENT')]
In [28]:
orders_list_map = map(
lambda order: (
int(order.split(',')[0]), order.split(',')[1], int(order.split(',')[2]), order.split(',')[3]
),
orders_list
)
len(list(orders_list_map))
Out[28]:
68883
In [29]:
orders_list_map = map(
lambda order: (
int(order.split(',')[0]), order.split(',')[1], int(order.split(',')[2]), order.split(',')[3]
),
orders_list
)
list(orders_list_map)[0]
Out[29]:
(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED')
In [30]:
orders_list_map = map(
lambda order: (
int(order.split(',')[0]), order.split(',')[1], int(order.split(',')[2]), order.split(',')[3]
),
orders_list
)
type(list(orders_list_map)[0])
Out[30]:
tuple