Sai

Overview of Retail Data

Overview of Retail Data¶

We will be using data from hypothetical retail application. Let us get an overview about the data sets.

We typically setup data under /data/retail_db
There are 6 folders by following names.
- departments
- categories
- products
- customers
- orders
- order_items

In [1]:

!ls -ltr /data/retail_db

total 20128
drwxrwxr-x 2 itversity itversity       24 Mar  8 02:04 categories
drwxrwxr-x 2 itversity itversity       24 Mar  8 02:04 customers
-rw-rw-r-- 1 itversity itversity     1748 Mar  8 02:04 create_db_tables_pg.sql
-rw-rw-r-- 1 itversity itversity 10303297 Mar  8 02:04 create_db.sql
drwxrwxr-x 2 itversity itversity       24 Mar  8 02:04 departments
drwxrwxr-x 2 itversity itversity       24 Mar  8 02:04 order_items
-rw-rw-r-- 1 itversity itversity 10297372 Mar  8 02:04 load_db_tables_pg.sql
drwxrwxr-x 2 itversity itversity       24 Mar  8 02:04 orders
drwxrwxr-x 2 itversity itversity       24 Mar  8 02:04 products

In [2]:

!ls -ltr /data/retail_db/departments \
    /data/retail_db/categories \
    /data/retail_db/products \
    /data/retail_db/customers \
    /data/retail_db/orders \
    /data/retail_db/order_items

/data/retail_db/categories:
total 4
-rw-rw-r-- 1 itversity itversity 1029 Mar  8 02:04 part-00000

/data/retail_db/customers:
total 932
-rw-rw-r-- 1 itversity itversity 953719 Mar  8 02:04 part-00000

/data/retail_db/departments:
total 4
-rw-rw-r-- 1 itversity itversity 60 Mar  8 02:04 part-00000

/data/retail_db/order_items:
total 5284
-rw-rw-r-- 1 itversity itversity 5408880 Mar  8 02:04 part-00000

/data/retail_db/orders:
total 2932
-rw-rw-r-- 1 itversity itversity 2999944 Mar  8 02:04 part-00000

/data/retail_db/products:
total 172
-rw-rw-r-- 1 itversity itversity 174155 Mar  8 02:04 part-00000

All of them have files with name part-00000.
You can check the type of the file by using file command. Typically, we see the extensions such as txt, csv, tsv as part of the file names. Extensions are only informational.

In [3]:

!file /data/retail_db/orders/part-00000

/data/retail_db/orders/part-00000: CSV text

As the file type is text, we can use commands such as cat, head, tail to preview the data.
- head is typically used to see first few lines. It helps us to validate whether files have headers or not.
- tail is typically used to see last few lines.
- cat is used to see the contents of the entire file.
- We typically use head or tail to preview the data in large files.
Run head or tail command on one of the files to see how the data is organized.

In [4]:

!tail -5 /data/retail_db/orders/part-00000

68879,2014-07-09 00:00:00.0,778,COMPLETE
68880,2014-07-13 00:00:00.0,1117,COMPLETE
68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT
68882,2014-07-22 00:00:00.0,10000,ON_HOLD
68883,2014-07-23 00:00:00.0,5533,COMPLETE

You can run wc -l command to see number of lines in the file.

In [5]:

!wc -l /data/retail_db/orders/part-00000

68883 /data/retail_db/orders/part-00000

Now let us understand the characteristics of the data.

We have 68883 lines in the file.
Each line have comma separated values. They are related to 4 different fields related to orders.
These lines are also called as records. As the attribute values in each record are delimited or separated by comma, they are known as comma separated values.
As the file /data/retail_db/orders/part-00000 contain comma separated values, the file is known as csv.
All the files are text files which contain csv records. They are also known as csv files.
When we use Python libraries to perform I/O on these files, we can read these files using text format.

Sai

Overview of Retail Data

Overview of Retail Data¶

Share this post

Join Our Community

Follow Us

Links

Contact Info

Address

Phone

Email