Overview of Retail Data¶
We will be using data from hypothetical retail application. Let us get an overview about the data sets.
- We typically setup data under
/data/retail_db
- There are 6 folders by following names.
- departments
- categories
- products
- customers
- orders
- order_items
In [1]:
!ls -ltr /data/retail_db
total 20128 drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 categories drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 customers -rw-rw-r-- 1 itversity itversity 1748 Mar 8 02:04 create_db_tables_pg.sql -rw-rw-r-- 1 itversity itversity 10303297 Mar 8 02:04 create_db.sql drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 departments drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 order_items -rw-rw-r-- 1 itversity itversity 10297372 Mar 8 02:04 load_db_tables_pg.sql drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 orders drwxrwxr-x 2 itversity itversity 24 Mar 8 02:04 products
In [2]:
!ls -ltr /data/retail_db/departments \
/data/retail_db/categories \
/data/retail_db/products \
/data/retail_db/customers \
/data/retail_db/orders \
/data/retail_db/order_items
/data/retail_db/categories: total 4 -rw-rw-r-- 1 itversity itversity 1029 Mar 8 02:04 part-00000 /data/retail_db/customers: total 932 -rw-rw-r-- 1 itversity itversity 953719 Mar 8 02:04 part-00000 /data/retail_db/departments: total 4 -rw-rw-r-- 1 itversity itversity 60 Mar 8 02:04 part-00000 /data/retail_db/order_items: total 5284 -rw-rw-r-- 1 itversity itversity 5408880 Mar 8 02:04 part-00000 /data/retail_db/orders: total 2932 -rw-rw-r-- 1 itversity itversity 2999944 Mar 8 02:04 part-00000 /data/retail_db/products: total 172 -rw-rw-r-- 1 itversity itversity 174155 Mar 8 02:04 part-00000
- All of them have files with name
part-00000
. - You can check the type of the file by using
file
command. Typically, we see the extensions such astxt
,csv
,tsv
as part of the file names. Extensions are only informational.
In [3]:
!file /data/retail_db/orders/part-00000
/data/retail_db/orders/part-00000: CSV text
- As the file type is text, we can use commands such as
cat
,head
,tail
to preview the data.head
is typically used to see first few lines. It helps us to validate whether files have headers or not.tail
is typically used to see last few lines.cat
is used to see the contents of the entire file.- We typically use
head
ortail
to preview the data in large files.
- Run
head
ortail
command on one of the files to see how the data is organized.
In [4]:
!tail -5 /data/retail_db/orders/part-00000
68879,2014-07-09 00:00:00.0,778,COMPLETE 68880,2014-07-13 00:00:00.0,1117,COMPLETE 68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT 68882,2014-07-22 00:00:00.0,10000,ON_HOLD 68883,2014-07-23 00:00:00.0,5533,COMPLETE
- You can run
wc -l
command to see number of lines in the file.
In [5]:
!wc -l /data/retail_db/orders/part-00000
68883 /data/retail_db/orders/part-00000
Now let us understand the characteristics of the data.
- We have 68883 lines in the file.
- Each line have comma separated values. They are related to 4 different fields related to orders.
- These lines are also called as records. As the attribute values in each record are delimited or separated by comma, they are known as comma separated values.
- As the file
/data/retail_db/orders/part-00000
contain comma separated values, the file is known as csv. - All the files are text files which contain csv records. They are also known as csv files.
- When we use Python libraries to perform I/O on these files, we can read these files using text format.