Read Delimited data using CSV¶
Let us understand how to read data from delimited files using Python I/O functions as well as csv
. csv
can be used to read iterable of delimited strings into iterable of tuples or dicts.
- We will go through the steps related to how to read the contents of the file into list of tuples. We will also see how we can apply transformations such as changing the data types of elements in each of the tuple.
- Also, we will go through the steps related to how to read the contents of the file into list of dicts.
In [3]:
!ls -lhtr /data/retail_db/orders
total 2.9M -rw-rw-r-- 1 itversity itversity 2.9M Mar 8 02:04 part-00000
In [4]:
!head -5 /data/retail_db/orders/part-00000
1,2013-07-25 00:00:00.0,11599,CLOSED 2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT 3,2013-07-25 00:00:00.0,12111,COMPLETE 4,2013-07-25 00:00:00.0,8827,CLOSED 5,2013-07-25 00:00:00.0,11318,COMPLETE
In [5]:
import csv
In [6]:
orders_file = open('/data/retail_db/orders/part-00000')
In [7]:
orders_list = orders_file.read().splitlines()
In [8]:
type(orders_list)
Out[8]:
list
In [9]:
orders_list[:10]
Out[9]:
['1,2013-07-25 00:00:00.0,11599,CLOSED', '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT', '3,2013-07-25 00:00:00.0,12111,COMPLETE', '4,2013-07-25 00:00:00.0,8827,CLOSED', '5,2013-07-25 00:00:00.0,11318,COMPLETE', '6,2013-07-25 00:00:00.0,7130,COMPLETE', '7,2013-07-25 00:00:00.0,4530,COMPLETE', '8,2013-07-25 00:00:00.0,2911,PROCESSING', '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT', '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
In [10]:
csv.reader?
Docstring: csv_reader = reader(iterable [, dialect='excel'] [optional keyword args]) for row in csv_reader: process(row) The "iterable" argument can be any object that returns a line of input for each iteration, such as a file object or a list. The optional "dialect" parameter is discussed below. The function also accepts optional keyword arguments which override settings provided by the dialect. The returned object is an iterator. Each iteration returns a row of the CSV file (which can span multiple input lines). Type: builtin_function_or_method
In [11]:
orders = csv.reader(orders_list, delimiter=',')
# Default for delimter is ','
# The above piece of code is same as csv.reader(orders_list)
In [12]:
type(orders)
Out[12]:
_csv.reader
In [13]:
orders = csv.reader(orders_list)
list(orders)[:10]
Out[13]:
[['1', '2013-07-25 00:00:00.0', '11599', 'CLOSED'], ['2', '2013-07-25 00:00:00.0', '256', 'PENDING_PAYMENT'], ['3', '2013-07-25 00:00:00.0', '12111', 'COMPLETE'], ['4', '2013-07-25 00:00:00.0', '8827', 'CLOSED'], ['5', '2013-07-25 00:00:00.0', '11318', 'COMPLETE'], ['6', '2013-07-25 00:00:00.0', '7130', 'COMPLETE'], ['7', '2013-07-25 00:00:00.0', '4530', 'COMPLETE'], ['8', '2013-07-25 00:00:00.0', '2911', 'PROCESSING'], ['9', '2013-07-25 00:00:00.0', '5657', 'PENDING_PAYMENT'], ['10', '2013-07-25 00:00:00.0', '5648', 'PENDING_PAYMENT']]
In [14]:
orders = csv.reader(orders_list)
order = list(orders)[0]
In [15]:
order
Out[15]:
['1', '2013-07-25 00:00:00.0', '11599', 'CLOSED']
In [16]:
order[0]
Out[16]:
'1'
In [17]:
int(order[0]) if order[0].isdigit() else order[0]
Out[17]:
1
In [18]:
list(map(lambda item: int(item) if item.isdigit() else item, order))
Out[18]:
[1, '2013-07-25 00:00:00.0', 11599, 'CLOSED']
In [19]:
tuple(map(lambda item: int(item) if item.isdigit() else item, order))
Out[19]:
(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED')
In [20]:
orders = csv.reader(orders_list)
list(
tuple(
map(
lambda order: tuple(map(lambda item: int(item) if item.isdigit() else item, order)),
orders
)
)
)[:10]
Out[20]:
[(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED'), (2, '2013-07-25 00:00:00.0', 256, 'PENDING_PAYMENT'), (3, '2013-07-25 00:00:00.0', 12111, 'COMPLETE'), (4, '2013-07-25 00:00:00.0', 8827, 'CLOSED'), (5, '2013-07-25 00:00:00.0', 11318, 'COMPLETE'), (6, '2013-07-25 00:00:00.0', 7130, 'COMPLETE'), (7, '2013-07-25 00:00:00.0', 4530, 'COMPLETE'), (8, '2013-07-25 00:00:00.0', 2911, 'PROCESSING'), (9, '2013-07-25 00:00:00.0', 5657, 'PENDING_PAYMENT'), (10, '2013-07-25 00:00:00.0', 5648, 'PENDING_PAYMENT')]
- Here is the code to convert the data in a CSV file to list of tuples using
csv
.
In [21]:
orders_file = open('/data/retail_db/orders/part-00000')
orders_list = orders_file.read().splitlines()
orders = csv.reader(orders_list)
list(
tuple(
map(
lambda order: tuple(map(lambda item: int(item) if item.isdigit() else item, order)),
orders
)
)
)[:10]
Out[21]:
[(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED'), (2, '2013-07-25 00:00:00.0', 256, 'PENDING_PAYMENT'), (3, '2013-07-25 00:00:00.0', 12111, 'COMPLETE'), (4, '2013-07-25 00:00:00.0', 8827, 'CLOSED'), (5, '2013-07-25 00:00:00.0', 11318, 'COMPLETE'), (6, '2013-07-25 00:00:00.0', 7130, 'COMPLETE'), (7, '2013-07-25 00:00:00.0', 4530, 'COMPLETE'), (8, '2013-07-25 00:00:00.0', 2911, 'PROCESSING'), (9, '2013-07-25 00:00:00.0', 5657, 'PENDING_PAYMENT'), (10, '2013-07-25 00:00:00.0', 5648, 'PENDING_PAYMENT')]
- Let us see how we can read iterable of CSV strings into list of dict type objects using
csv.DictReader
.
In [22]:
csv.DictReader?
Init signature: csv.DictReader( f, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds, ) Docstring: <no docstring> File: /usr/local/lib/python3.8/csv.py Type: type Subclasses:
In [23]:
users = [
'1,Scott,Tiger',
'2,Donald,Duck'
]
In [24]:
csv.DictReader(users, fieldnames=['user_id', 'user_first_name', 'user_last_name'])
Out[24]:
<csv.DictReader at 0x7f04dc57d880>
In [25]:
list(csv.DictReader(users, fieldnames=['user_id', 'user_first_name', 'user_last_name']))
Out[25]:
[{'user_id': '1', 'user_first_name': 'Scott', 'user_last_name': 'Tiger'}, {'user_id': '2', 'user_first_name': 'Donald', 'user_last_name': 'Duck'}]
In [26]:
users_dicts = list(csv.DictReader(users, fieldnames=['user_id', 'user_first_name', 'user_last_name']))
In [27]:
users_dicts
Out[27]:
[{'user_id': '1', 'user_first_name': 'Scott', 'user_last_name': 'Tiger'}, {'user_id': '2', 'user_first_name': 'Donald', 'user_last_name': 'Duck'}]
In [28]:
users_dicts[0]
Out[28]:
{'user_id': '1', 'user_first_name': 'Scott', 'user_last_name': 'Tiger'}
In [29]:
users_dicts[0]['user_first_name']
Out[29]:
'Scott'
In [30]:
orders_file = open('/data/retail_db/orders/part-00000')
In [31]:
orders_list = orders_file.read().splitlines()
In [32]:
orders_list[:10]
Out[32]:
['1,2013-07-25 00:00:00.0,11599,CLOSED', '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT', '3,2013-07-25 00:00:00.0,12111,COMPLETE', '4,2013-07-25 00:00:00.0,8827,CLOSED', '5,2013-07-25 00:00:00.0,11318,COMPLETE', '6,2013-07-25 00:00:00.0,7130,COMPLETE', '7,2013-07-25 00:00:00.0,4530,COMPLETE', '8,2013-07-25 00:00:00.0,2911,PROCESSING', '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT', '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
In [33]:
list(csv.DictReader(orders_list, fieldnames=['order_id', 'order_date', 'order_customer_id', 'order_status']))[:10]
Out[33]:
[{'order_id': '1', 'order_date': '2013-07-25 00:00:00.0', 'order_customer_id': '11599', 'order_status': 'CLOSED'}, {'order_id': '2', 'order_date': '2013-07-25 00:00:00.0', 'order_customer_id': '256', 'order_status': 'PENDING_PAYMENT'}, {'order_id': '3', 'order_date': '2013-07-25 00:00:00.0', 'order_customer_id': '12111', 'order_status': 'COMPLETE'}, {'order_id': '4', 'order_date': '2013-07-25 00:00:00.0', 'order_customer_id': '8827', 'order_status': 'CLOSED'}, {'order_id': '5', 'order_date': '2013-07-25 00:00:00.0', 'order_customer_id': '11318', 'order_status': 'COMPLETE'}, {'order_id': '6', 'order_date': '2013-07-25 00:00:00.0', 'order_customer_id': '7130', 'order_status': 'COMPLETE'}, {'order_id': '7', 'order_date': '2013-07-25 00:00:00.0', 'order_customer_id': '4530', 'order_status': 'COMPLETE'}, {'order_id': '8', 'order_date': '2013-07-25 00:00:00.0', 'order_customer_id': '2911', 'order_status': 'PROCESSING'}, {'order_id': '9', 'order_date': '2013-07-25 00:00:00.0', 'order_customer_id': '5657', 'order_status': 'PENDING_PAYMENT'}, {'order_id': '10', 'order_date': '2013-07-25 00:00:00.0', 'order_customer_id': '5648', 'order_status': 'PENDING_PAYMENT'}]
In [34]:
orders = list(csv.DictReader(orders_list, fieldnames=['order_id', 'order_date', 'order_customer_id', 'order_status']))
In [35]:
orders[:3]
Out[35]:
[{'order_id': '1', 'order_date': '2013-07-25 00:00:00.0', 'order_customer_id': '11599', 'order_status': 'CLOSED'}, {'order_id': '2', 'order_date': '2013-07-25 00:00:00.0', 'order_customer_id': '256', 'order_status': 'PENDING_PAYMENT'}, {'order_id': '3', 'order_date': '2013-07-25 00:00:00.0', 'order_customer_id': '12111', 'order_status': 'COMPLETE'}]
In [36]:
order = orders[0]
In [37]:
type(order)
Out[37]:
dict
In [38]:
order
Out[38]:
{'order_id': '1', 'order_date': '2013-07-25 00:00:00.0', 'order_customer_id': '11599', 'order_status': 'CLOSED'}
In [39]:
order['order_date']
Out[39]:
'2013-07-25 00:00:00.0'