Overview of Delimited Text Files¶
Let us get an overview of delimited text files. Delimited also means Separated.
- Comma seperated or delimited is the most common delimited text files.
- Each line in the text files are separated by either new line character
\n
or carriage return\r
. - You will not be able to see those characters with naked eye when you open the file or use commands such as
cat
,head
, etc. - Each line contain plain text with a delimiter or separator. The values are typically related to the attributes from the source using which the files are sourced.
- We have setup delimited files under /data/retail_db. It have 6 folders and each folder have a text file where each line have data related to multiple attributes. The data in each line is separated or delimited by
,
. - There are other types of delimited text files as well – pipe separated or delimited, tab separated or delimited, etc.
In [1]:
!ls -ltr /data/retail_db
total 20128 drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 categories drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 customers -rw-rw-r-- 1 itversity itversity 1748 Mar 8 02:04 create_db_tables_pg.sql -rw-rw-r-- 1 itversity itversity 10303297 Mar 8 02:04 create_db.sql drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 departments drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 order_items -rw-rw-r-- 1 itversity itversity 10297372 Mar 8 02:04 load_db_tables_pg.sql drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 orders drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 products
Here are the folders which contain delimited text files.
- departments
- categories
- products
- customers
- orders
- order_items
In [2]:
!ls -ltr /data/retail_db/orders
total 2932 -rw-rw-r-- 1 itversity itversity 2999944 Mar 8 02:04 part-00000
In [3]:
!head -5 /data/retail_db/orders/part-00000
1,2013-07-25 00:00:00.0,11599,CLOSED 2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT 3,2013-07-25 00:00:00.0,12111,COMPLETE 4,2013-07-25 00:00:00.0,8827,CLOSED 5,2013-07-25 00:00:00.0,11318,COMPLETE
You need to spend some time in understanding characteristics of the data in the files.
- There is no header in the file. Some files contain header and the header will give the information about columns or attributes from the source.
- Each line contain values related to 4 attributes and these values are separated or delimited by
,
.