Overview of Retail Data¶
We will be using data from hypothetical retail application. Let us get an overview about the data sets.
- We typically setup data under
- There are 6 folders by following names.
!ls -ltr /data/retail_db
total 20128 drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 categories drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 customers -rw-rw-r-- 1 itversity itversity 1748 Mar 8 02:04 create_db_tables_pg.sql -rw-rw-r-- 1 itversity itversity 10303297 Mar 8 02:04 create_db.sql drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 departments drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 order_items -rw-rw-r-- 1 itversity itversity 10297372 Mar 8 02:04 load_db_tables_pg.sql drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 orders drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 products
!ls -ltr /data/retail_db/departments \ /data/retail_db/categories \ /data/retail_db/products \ /data/retail_db/customers \ /data/retail_db/orders \ /data/retail_db/order_items
/data/retail_db/categories: total 4 -rw-rw-r-- 1 itversity itversity 1029 Mar 8 02:04 part-00000 /data/retail_db/customers: total 932 -rw-rw-r-- 1 itversity itversity 953719 Mar 8 02:04 part-00000 /data/retail_db/departments: total 4 -rw-rw-r-- 1 itversity itversity 60 Mar 8 02:04 part-00000 /data/retail_db/order_items: total 5284 -rw-rw-r-- 1 itversity itversity 5408880 Mar 8 02:04 part-00000 /data/retail_db/orders: total 2932 -rw-rw-r-- 1 itversity itversity 2999944 Mar 8 02:04 part-00000 /data/retail_db/products: total 172 -rw-rw-r-- 1 itversity itversity 174155 Mar 8 02:04 part-00000
- All of them have files with name
- You can check the type of the file by using
filecommand. Typically, we see the extensions such as
tsvas part of the file names. Extensions are only informational.
/data/retail_db/orders/part-00000: CSV text
- As the file type is text, we can use commands such as
tailto preview the data.
headis typically used to see first few lines. It helps us to validate whether files have headers or not.
tailis typically used to see last few lines.
catis used to see the contents of the entire file.
- We typically use
tailto preview the data in large files.
tailcommand on one of the files to see how the data is organized.
!tail -5 /data/retail_db/orders/part-00000
68879,2014-07-09 00:00:00.0,778,COMPLETE 68880,2014-07-13 00:00:00.0,1117,COMPLETE 68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT 68882,2014-07-22 00:00:00.0,10000,ON_HOLD 68883,2014-07-23 00:00:00.0,5533,COMPLETE
- You can run
wc -lcommand to see number of lines in the file.
!wc -l /data/retail_db/orders/part-00000
Now let us understand the characteristics of the data.
- We have 68883 lines in the file.
- Each line have comma separated values. They are related to 4 different fields related to orders.
- These lines are also called as records. As the attribute values in each record are delimited or separated by comma, they are known as comma separated values.
- As the file
/data/retail_db/orders/part-00000contain comma separated values, the file is known as csv.
- All the files are text files which contain csv records. They are also known as csv files.
- When we use Python libraries to perform I/O on these files, we can read these files using text format.