Sai

Read Delimited Data using CSV

Read Delimited data using CSV¶

Let us understand how to read data from delimited files using Python I/O functions as well as csv. csv can be used to read iterable of delimited strings into iterable of tuples or dicts.

We will go through the steps related to how to read the contents of the file into list of tuples. We will also see how we can apply transformations such as changing the data types of elements in each of the tuple.
Also, we will go through the steps related to how to read the contents of the file into list of dicts.

In [3]:

!ls -lhtr /data/retail_db/orders

total 2.9M
-rw-rw-r-- 1 itversity itversity 2.9M Mar  8 02:04 part-00000

In [4]:

!head -5 /data/retail_db/orders/part-00000

1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE

In [5]:

import csv

In [6]:

orders_file = open('/data/retail_db/orders/part-00000')

In [7]:

orders_list = orders_file.read().splitlines()

In [8]:

type(orders_list)

Out[8]:

list

In [9]:

orders_list[:10]

Out[9]:

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [10]:

csv.reader?

Docstring:
csv_reader = reader(iterable [, dialect='excel']
                        [optional keyword args])
    for row in csv_reader:
        process(row)

The "iterable" argument can be any object that returns a line
of input for each iteration, such as a file object or a list.  The
optional "dialect" parameter is discussed below.  The function
also accepts optional keyword arguments which override settings
provided by the dialect.

The returned object is an iterator.  Each iteration returns a row
of the CSV file (which can span multiple input lines).
Type:      builtin_function_or_method

In [11]:

orders = csv.reader(orders_list, delimiter=',') 
# Default for delimter is ','
# The above piece of code is same as csv.reader(orders_list) 

In [12]:

type(orders)

Out[12]:

_csv.reader

In [13]:

orders = csv.reader(orders_list)
list(orders)[:10]

Out[13]:

[['1', '2013-07-25 00:00:00.0', '11599', 'CLOSED'],
 ['2', '2013-07-25 00:00:00.0', '256', 'PENDING_PAYMENT'],
 ['3', '2013-07-25 00:00:00.0', '12111', 'COMPLETE'],
 ['4', '2013-07-25 00:00:00.0', '8827', 'CLOSED'],
 ['5', '2013-07-25 00:00:00.0', '11318', 'COMPLETE'],
 ['6', '2013-07-25 00:00:00.0', '7130', 'COMPLETE'],
 ['7', '2013-07-25 00:00:00.0', '4530', 'COMPLETE'],
 ['8', '2013-07-25 00:00:00.0', '2911', 'PROCESSING'],
 ['9', '2013-07-25 00:00:00.0', '5657', 'PENDING_PAYMENT'],
 ['10', '2013-07-25 00:00:00.0', '5648', 'PENDING_PAYMENT']]

In [14]:

orders = csv.reader(orders_list)
order = list(orders)[0]

In [15]:

order

Out[15]:

['1', '2013-07-25 00:00:00.0', '11599', 'CLOSED']

In [16]:

order[0]

Out[16]:

'1'

In [17]:

int(order[0]) if order[0].isdigit() else order[0]

Out[17]:

In [18]:

list(map(lambda item: int(item) if item.isdigit() else item, order))

Out[18]:

[1, '2013-07-25 00:00:00.0', 11599, 'CLOSED']

In [19]:

tuple(map(lambda item: int(item) if item.isdigit() else item, order))

Out[19]:

(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED')

In [20]:

orders = csv.reader(orders_list)

list(
    tuple(
        map(
            lambda order: tuple(map(lambda item: int(item) if item.isdigit() else item, order)), 
            orders
        )
    )
)[:10]

Out[20]:

[(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED'),
 (2, '2013-07-25 00:00:00.0', 256, 'PENDING_PAYMENT'),
 (3, '2013-07-25 00:00:00.0', 12111, 'COMPLETE'),
 (4, '2013-07-25 00:00:00.0', 8827, 'CLOSED'),
 (5, '2013-07-25 00:00:00.0', 11318, 'COMPLETE'),
 (6, '2013-07-25 00:00:00.0', 7130, 'COMPLETE'),
 (7, '2013-07-25 00:00:00.0', 4530, 'COMPLETE'),
 (8, '2013-07-25 00:00:00.0', 2911, 'PROCESSING'),
 (9, '2013-07-25 00:00:00.0', 5657, 'PENDING_PAYMENT'),
 (10, '2013-07-25 00:00:00.0', 5648, 'PENDING_PAYMENT')]

Here is the code to convert the data in a CSV file to list of tuples using csv.

In [21]:

orders_file = open('/data/retail_db/orders/part-00000')
orders_list = orders_file.read().splitlines()

orders = csv.reader(orders_list)

list(
    tuple(
        map(
            lambda order: tuple(map(lambda item: int(item) if item.isdigit() else item, order)), 
            orders
        )
    )
)[:10]

Out[21]:

[(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED'),
 (2, '2013-07-25 00:00:00.0', 256, 'PENDING_PAYMENT'),
 (3, '2013-07-25 00:00:00.0', 12111, 'COMPLETE'),
 (4, '2013-07-25 00:00:00.0', 8827, 'CLOSED'),
 (5, '2013-07-25 00:00:00.0', 11318, 'COMPLETE'),
 (6, '2013-07-25 00:00:00.0', 7130, 'COMPLETE'),
 (7, '2013-07-25 00:00:00.0', 4530, 'COMPLETE'),
 (8, '2013-07-25 00:00:00.0', 2911, 'PROCESSING'),
 (9, '2013-07-25 00:00:00.0', 5657, 'PENDING_PAYMENT'),
 (10, '2013-07-25 00:00:00.0', 5648, 'PENDING_PAYMENT')]

Let us see how we can read iterable of CSV strings into list of dict type objects using csv.DictReader.

In [22]:

csv.DictReader?

Init signature:
csv.DictReader(
    f,
    fieldnames=None,
    restkey=None,
    restval=None,
    dialect='excel',
    *args,
    **kwds,
)
Docstring:      <no docstring>
File:           /usr/local/lib/python3.8/csv.py
Type:           type
Subclasses:

In [23]:

users = [
    '1,Scott,Tiger',
    '2,Donald,Duck'
]

In [24]:

csv.DictReader(users, fieldnames=['user_id', 'user_first_name', 'user_last_name'])

Out[24]:

<csv.DictReader at 0x7f04dc57d880>

In [25]:

list(csv.DictReader(users, fieldnames=['user_id', 'user_first_name', 'user_last_name']))

Out[25]:

[{'user_id': '1', 'user_first_name': 'Scott', 'user_last_name': 'Tiger'},
 {'user_id': '2', 'user_first_name': 'Donald', 'user_last_name': 'Duck'}]

In [26]:

users_dicts = list(csv.DictReader(users, fieldnames=['user_id', 'user_first_name', 'user_last_name']))

In [27]:

users_dicts

Out[27]:

[{'user_id': '1', 'user_first_name': 'Scott', 'user_last_name': 'Tiger'},
 {'user_id': '2', 'user_first_name': 'Donald', 'user_last_name': 'Duck'}]

In [28]:

users_dicts[0]

Out[28]:

{'user_id': '1', 'user_first_name': 'Scott', 'user_last_name': 'Tiger'}

In [29]:

users_dicts[0]['user_first_name']

Out[29]:

'Scott'

In [30]:

orders_file = open('/data/retail_db/orders/part-00000')

In [31]:

orders_list = orders_file.read().splitlines()

In [32]:

orders_list[:10]

Out[32]:

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [33]:

list(csv.DictReader(orders_list, fieldnames=['order_id', 'order_date', 'order_customer_id', 'order_status']))[:10]

Out[33]:

[{'order_id': '1',
  'order_date': '2013-07-25 00:00:00.0',
  'order_customer_id': '11599',
  'order_status': 'CLOSED'},
 {'order_id': '2',
  'order_date': '2013-07-25 00:00:00.0',
  'order_customer_id': '256',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': '3',
  'order_date': '2013-07-25 00:00:00.0',
  'order_customer_id': '12111',
  'order_status': 'COMPLETE'},
 {'order_id': '4',
  'order_date': '2013-07-25 00:00:00.0',
  'order_customer_id': '8827',
  'order_status': 'CLOSED'},
 {'order_id': '5',
  'order_date': '2013-07-25 00:00:00.0',
  'order_customer_id': '11318',
  'order_status': 'COMPLETE'},
 {'order_id': '6',
  'order_date': '2013-07-25 00:00:00.0',
  'order_customer_id': '7130',
  'order_status': 'COMPLETE'},
 {'order_id': '7',
  'order_date': '2013-07-25 00:00:00.0',
  'order_customer_id': '4530',
  'order_status': 'COMPLETE'},
 {'order_id': '8',
  'order_date': '2013-07-25 00:00:00.0',
  'order_customer_id': '2911',
  'order_status': 'PROCESSING'},
 {'order_id': '9',
  'order_date': '2013-07-25 00:00:00.0',
  'order_customer_id': '5657',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': '10',
  'order_date': '2013-07-25 00:00:00.0',
  'order_customer_id': '5648',
  'order_status': 'PENDING_PAYMENT'}]

In [34]:

orders = list(csv.DictReader(orders_list, fieldnames=['order_id', 'order_date', 'order_customer_id', 'order_status']))

In [35]:

orders[:3]

Out[35]:

[{'order_id': '1',
  'order_date': '2013-07-25 00:00:00.0',
  'order_customer_id': '11599',
  'order_status': 'CLOSED'},
 {'order_id': '2',
  'order_date': '2013-07-25 00:00:00.0',
  'order_customer_id': '256',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': '3',
  'order_date': '2013-07-25 00:00:00.0',
  'order_customer_id': '12111',
  'order_status': 'COMPLETE'}]

In [36]:

order = orders[0]

In [37]:

type(order)

Out[37]:

dict

In [38]:

order

Out[38]:

{'order_id': '1',
 'order_date': '2013-07-25 00:00:00.0',
 'order_customer_id': '11599',
 'order_status': 'CLOSED'}

In [39]:

order['order_date']

Out[39]:

'2013-07-25 00:00:00.0'

Sai

Read Delimited Data using CSV

Read Delimited data using CSV¶

Share this post

Join Our Community

Follow Us

Links

Contact Info

Address

Phone

Email