Indexes on Tables

Let us go through the details related to indexes supported in RDBMS such as Postgres.

  • An index can be unique or non unique.

  • Unique Index - Data will be sorted in ascending order and uniqueness is enforced.

  • Non Unique Index - Data will be sorted in ascending order and uniqueness is not enforced.

  • Unless specified all indexes are of type B Tree.

  • For sparsely populated columns, we tend to create B Tree indexes. B Tree indexes are the most commonly used ones.

  • For densely populated columns such as gender, month etc with very few distinct values we can leverage bit map index. However bitmap indexes are not used quite extensively in typical web or mobile applications.

  • Write operations will become relatively slow as data have to be managed in index as well as table.

  • We need to be careful while creating indexes on the tables as write operations can become slow as more indexes are added to the table.

  • Here are some of the criteria for creating indexes.

    • Create unique indexes when you want to enforce uniqueness. If you define unique constraint or primary key constraint, it will create unique index internally.

    • If we are performing joins between 2 tables based on a value, then the foreign key column in the child table should be indexed.

      • Typically as part of order management system, we tend to get all the order details for a given order using order id.

      • In our case we will be able to improve the query performance by adding index on order_items.order_item_order_id.

      • However, write operation will become a bit slow. But it is acceptable and required to create index on order_items.order_item_order_id as we write once and read many times over the life of the order.

  • Let us perform tasks related to indexes.

    • Drop and recreate retail db tables.

    • Load data into retail db tables.

    • Compute statistics (Optional). It is typically taken care automatically by the schedules defined by DBAs.

    • Use code to randomly fetch 2000 orders and join with order_items - compute time.

    • Create index for order_items.order_item_order_id and compute statistics

    • Use code to randomly fetch 2000 orders and join with order_items - compute time.

  • Script to create tables and load data in case there are no tables in retail database.

psql -U itversity_retail_user 
  -h localhost 
  -p 5432 
  -d itversity_retail_db 
  -W

DROP TABLE order_items;
DROP TABLE orders;
DROP TABLE products;
DROP TABLE categories;
DROP TABLE departments;
DROP TABLE customers;

i /data/retail_db/create_db_tables_pg.sql
i /data/retail_db/load_db_tables_pg.sql

Copy to clipboard

!pip install psycopg2

Copy to clipboard

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: psycopg2 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (2.8.6)

Copy to clipboard

import psycopg2

Copy to clipboard

%%time

from random import randrange
connection = psycopg2.connect(
    host='localhost',
    port='5432',
    database='itversity_retail_db',
    user='itversity_retail_user',
    password='retail_password'
)
cursor = connection.cursor()
query = '''SELECT * 
FROM orders o JOIN order_items oi 
    ON o.order_id = oi.order_item_order_id
WHERE o.order_id = %s
'''
ctr = 0
while True:
    if ctr == 2000:
        break
    order_id = randrange(1, 68883)
    cursor.execute(query, (order_id,))
    ctr += 1
cursor.close()
connection.close()

Copy to clipboard

CPU times: user 73.8 ms, sys: 31.4 ms, total: 105 ms
Wall time: 19.6 s

Copy to clipboard

%load_ext sql

Copy to clipboard

The sql extension is already loaded. To reload it, use:
  %reload_ext sql

Copy to clipboard

%env DATABASE_URL=postgresql://itversity_retail_user:retail_password@localhost:5432/itversity_retail_db

Copy to clipboard

env: DATABASE_URL=postgresql://itversity_retail_user:retail_password@localhost:5432/itversity_retail_db

Copy to clipboard

%%sql

CREATE INDEX order_items_oid_idx
ON order_items(order_item_order_id)

Copy to clipboard

 * postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
Done.

Copy to clipboard

[]

Copy to clipboard

%%time

from random import randrange
connection = psycopg2.connect(
    host='localhost',
    port='5432',
    database='itversity_retail_db',
    user='itversity_retail_user',
    password='retail_password'
)
cursor = connection.cursor()
query = '''SELECT * 
FROM orders o JOIN order_items oi 
    ON o.order_id = oi.order_item_order_id
WHERE o.order_id = %s
'''
ctr = 0
while True:
    if ctr == 2000:
        break
    order_id = randrange(1, 68883)
    cursor.execute(query, (order_id,))
    ctr += 1
cursor.close()
connection.close()

Copy to clipboard

CPU times: user 49.1 ms, sys: 32.9 ms, total: 82 ms
Wall time: 265 ms