Reading files into collections¶
Let us understand how to read data from files into collections.
- Python have simple and yet rich APIs to perform file I/O
- We can create a file object with open in different modes (by default read only mode)
- To read the contents from the file into memory, we have APIs on top of file object such as read()
- read() will create large string using contents of the files
- If the data have multiple records with new line character as delimiter, we can apply splitlines() on the output of read
- splitlines() will convert the string into list with new line character as delimiter
In [1]:
%%sh
ls -ltr /data/retail_db/orders/part-00000
-rw-rw-r-- 1 itversity itversity 2999944 Mar 8 02:04 /data/retail_db/orders/part-00000
In [2]:
%%sh
tail /data/retail_db/orders/part-00000
68874,2014-07-03 00:00:00.0,1601,COMPLETE 68875,2014-07-04 00:00:00.0,10637,ON_HOLD 68876,2014-07-06 00:00:00.0,4124,COMPLETE 68877,2014-07-07 00:00:00.0,9692,ON_HOLD 68878,2014-07-08 00:00:00.0,6753,COMPLETE 68879,2014-07-09 00:00:00.0,778,COMPLETE 68880,2014-07-13 00:00:00.0,1117,COMPLETE 68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT 68882,2014-07-22 00:00:00.0,10000,ON_HOLD 68883,2014-07-23 00:00:00.0,5533,COMPLETE
In [3]:
path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research
orders_file = open(path)
In [4]:
type(orders_file)
Out[4]:
_io.TextIOWrapper
In [5]:
orders_raw = orders_file.read()
In [6]:
type(orders_raw)
Out[6]:
str
In [7]:
orders_raw.splitlines?
Signature: orders_raw.splitlines(keepends=False) Docstring: Return a list of the lines in the string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true. Type: builtin_function_or_method
In [8]:
orders = orders_raw.splitlines()
In [9]:
type(orders)
Out[9]:
list
In [10]:
orders[:10]
Out[10]:
['1,2013-07-25 00:00:00.0,11599,CLOSED', '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT', '3,2013-07-25 00:00:00.0,12111,COMPLETE', '4,2013-07-25 00:00:00.0,8827,CLOSED', '5,2013-07-25 00:00:00.0,11318,COMPLETE', '6,2013-07-25 00:00:00.0,7130,COMPLETE', '7,2013-07-25 00:00:00.0,4530,COMPLETE', '8,2013-07-25 00:00:00.0,2911,PROCESSING', '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT', '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
In [11]:
len(orders) # same as number of records in the file
Out[11]:
68883