List of tuples¶
Let us see an example of how we can read data from a file into list of tuples using Python as programming language.* When we read data from a file into a list
, typically each element in the list will be of type binary or string.
- We can convert the element into
tuple
to simplify the processing. - Once each element is converted to
tuple
, we can access elements in thetuple
using positional notation. - Let us see an example to read the data from a file into list of tuples and access dates.
In [1]:
%%sh
ls -ltr /data/retail_db/orders/part-00000
-rw-rw-r-- 1 itversity itversity 2999944 Mar 8 02:04 /data/retail_db/orders/part-00000
In [2]:
%%sh
tail /data/retail_db/orders/part-00000
68874,2014-07-03 00:00:00.0,1601,COMPLETE 68875,2014-07-04 00:00:00.0,10637,ON_HOLD 68876,2014-07-06 00:00:00.0,4124,COMPLETE 68877,2014-07-07 00:00:00.0,9692,ON_HOLD 68878,2014-07-08 00:00:00.0,6753,COMPLETE 68879,2014-07-09 00:00:00.0,778,COMPLETE 68880,2014-07-13 00:00:00.0,1117,COMPLETE 68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT 68882,2014-07-22 00:00:00.0,10000,ON_HOLD 68883,2014-07-23 00:00:00.0,5533,COMPLETE
In [3]:
# Reading data from file into a list
path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\orders\\part-00000
orders_file = open(path)
In [4]:
type(orders_file)
Out[4]:
_io.TextIOWrapper
In [5]:
orders_raw = orders_file.read()
In [6]:
type(orders_raw)
Out[6]:
str
In [7]:
str.splitlines?
Signature: str.splitlines(self, /, keepends=False) Docstring: Return a list of the lines in the string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true. Type: method_descriptor
In [8]:
orders_raw[:10]
Out[8]:
'1,2013-07-'
In [9]:
orders = orders_raw.splitlines()
In [10]:
type(orders)
Out[10]:
list
In [11]:
orders[:10]
Out[11]:
['1,2013-07-25 00:00:00.0,11599,CLOSED', '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT', '3,2013-07-25 00:00:00.0,12111,COMPLETE', '4,2013-07-25 00:00:00.0,8827,CLOSED', '5,2013-07-25 00:00:00.0,11318,COMPLETE', '6,2013-07-25 00:00:00.0,7130,COMPLETE', '7,2013-07-25 00:00:00.0,4530,COMPLETE', '8,2013-07-25 00:00:00.0,2911,PROCESSING', '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT', '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
In [12]:
len(orders) # same as number of records in the file
Out[12]:
68883
In [13]:
order = '1,2013-07-25 00:00:00.0,11599,CLOSED'
In [14]:
order
Out[14]:
'1,2013-07-25 00:00:00.0,11599,CLOSED'
In [15]:
order.split(',')
Out[15]:
['1', '2013-07-25 00:00:00.0', '11599', 'CLOSED']
In [16]:
tuple(order.split(','))
Out[16]:
('1', '2013-07-25 00:00:00.0', '11599', 'CLOSED')
In [17]:
(*order.split(','), )# special operator to convert list to tuple
Out[17]:
('1', '2013-07-25 00:00:00.0', '11599', 'CLOSED')
In [18]:
order_tuples = [(*order.split(','),) for order in orders]
In [19]:
order_tuples = [tuple(order.split(',')) for order in orders]
In [20]:
type(order_tuples)
Out[20]:
list
In [21]:
order_tuples[0]
Out[21]:
('1', '2013-07-25 00:00:00.0', '11599', 'CLOSED')
In [22]:
order_tuples[:3]
Out[22]:
[('1', '2013-07-25 00:00:00.0', '11599', 'CLOSED'), ('2', '2013-07-25 00:00:00.0', '256', 'PENDING_PAYMENT'), ('3', '2013-07-25 00:00:00.0', '12111', 'COMPLETE')]
In [23]:
len(order_tuples)
Out[23]:
68883
In [24]:
order_dates = [order[1] for order in order_tuples]
In [25]:
order_dates[:3]
Out[25]:
['2013-07-25 00:00:00.0', '2013-07-25 00:00:00.0', '2013-07-25 00:00:00.0']
In [26]:
len(order_dates)
Out[26]:
68883
In [27]:
# We can also change the data types of elements in the tuples
def get_order_details(order):
order_details = order.split(',')
return (int(order_details[0]), order_details[1], int(order_details[2]), order_details[3])
In [28]:
order_tuples = [get_order_details(order) for order in orders]
In [29]:
order_tuples[:3]
Out[29]:
[(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED'), (2, '2013-07-25 00:00:00.0', 256, 'PENDING_PAYMENT'), (3, '2013-07-25 00:00:00.0', 12111, 'COMPLETE')]
In [30]:
order_customer_ids = [order[2] for order in order_tuples]
In [31]:
order_customer_ids[:3]
Out[31]:
[11599, 256, 12111]
In [32]:
type(order_customer_ids[0])
Out[32]:
int
In [33]:
path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\orders\\part-00000
orders_file = open(path)
orders_raw = orders_file.read()
orders = orders_raw.splitlines()
order_tuples = [(*order.split(','),) for order in orders]
order_dates = [order[1] for order in order_tuples]
In [34]:
unique_dates = set(order_dates)
In [35]:
len(unique_dates)
Out[35]:
364