[
Projecting and Filtering¶
Let us understand how to project as well as filter data in Data Frames.
In [1]:
%run 06_csv_to_pandas_data_frame.ipynb
In [2]:
orders
Out[2]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
0 | 1 | 2013-07-25 00:00:00.0 | 11599 | CLOSED |
1 | 2 | 2013-07-25 00:00:00.0 | 256 | PENDING_PAYMENT |
2 | 3 | 2013-07-25 00:00:00.0 | 12111 | COMPLETE |
3 | 4 | 2013-07-25 00:00:00.0 | 8827 | CLOSED |
4 | 5 | 2013-07-25 00:00:00.0 | 11318 | COMPLETE |
… | … | … | … | … |
68878 | 68879 | 2014-07-09 00:00:00.0 | 778 | COMPLETE |
68879 | 68880 | 2014-07-13 00:00:00.0 | 1117 | COMPLETE |
68880 | 68881 | 2014-07-19 00:00:00.0 | 2518 | PENDING_PAYMENT |
68881 | 68882 | 2014-07-22 00:00:00.0 | 10000 | ON_HOLD |
68882 | 68883 | 2014-07-23 00:00:00.0 | 5533 | COMPLETE |
68883 rows × 4 columns
In [3]:
order_items
Out[3]:
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 957 | 1 | 299.98 | 299.98 |
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
4 | 5 | 4 | 897 | 2 | 49.98 | 24.99 |
… | … | … | … | … | … | … |
172193 | 172194 | 68881 | 403 | 1 | 129.99 | 129.99 |
172194 | 172195 | 68882 | 365 | 1 | 59.99 | 59.99 |
172195 | 172196 | 68882 | 502 | 1 | 50.00 | 50.00 |
172196 | 172197 | 68883 | 208 | 1 | 1999.99 | 1999.99 |
172197 | 172198 | 68883 | 502 | 3 | 150.00 | 50.00 |
172198 rows × 6 columns
- Projecting data
In [4]:
orders.order_date
Out[4]:
0 2013-07-25 00:00:00.0 1 2013-07-25 00:00:00.0 2 2013-07-25 00:00:00.0 3 2013-07-25 00:00:00.0 4 2013-07-25 00:00:00.0 ... 68878 2014-07-09 00:00:00.0 68879 2014-07-13 00:00:00.0 68880 2014-07-19 00:00:00.0 68881 2014-07-22 00:00:00.0 68882 2014-07-23 00:00:00.0 Name: order_date, Length: 68883, dtype: object
In [5]:
orders['order_date']
Out[5]:
0 2013-07-25 00:00:00.0 1 2013-07-25 00:00:00.0 2 2013-07-25 00:00:00.0 3 2013-07-25 00:00:00.0 4 2013-07-25 00:00:00.0 ... 68878 2014-07-09 00:00:00.0 68879 2014-07-13 00:00:00.0 68880 2014-07-19 00:00:00.0 68881 2014-07-22 00:00:00.0 68882 2014-07-23 00:00:00.0 Name: order_date, Length: 68883, dtype: object
In [6]:
# Project order_item_order_id and order_item_subtotal
order_items[['order_item_order_id', 'order_item_subtotal']]
Out[6]:
order_item_order_id | order_item_subtotal | |
---|---|---|
0 | 1 | 299.98 |
1 | 2 | 199.99 |
2 | 2 | 250.00 |
3 | 2 | 129.99 |
4 | 4 | 49.98 |
… | … | … |
172193 | 68881 | 129.99 |
172194 | 68882 | 59.99 |
172195 | 68882 | 50.00 |
172196 | 68883 | 1999.99 |
172197 | 68883 | 150.00 |
172198 rows × 2 columns
- Filter for order_item_order_id 2
In [7]:
order_items.columns
Out[7]:
Index(['order_item_id', 'order_item_order_id', 'order_item_product_id', 'order_item_quantity', 'order_item_subtotal', 'order_item_product_price'], dtype='object')
In [8]:
order_items.order_item_order_id == 2
Out[8]:
0 False 1 True 2 True 3 True 4 False ... 172193 False 172194 False 172195 False 172196 False 172197 False Name: order_item_order_id, Length: 172198, dtype: bool
In [9]:
order_items[order_items.order_item_order_id == 2]
Out[9]:
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
In [10]:
order_items['order_item_order_id'] == 2
Out[10]:
0 False 1 True 2 True 3 True 4 False ... 172193 False 172194 False 172195 False 172196 False 172197 False Name: order_item_order_id, Length: 172198, dtype: bool
In [11]:
order_items[order_items['order_item_order_id'] == 2]
Out[11]:
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
In [12]:
order_items.query('order_item_order_id == 2')
Out[12]:
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
In [13]:
order_items[
(order_items.order_item_order_id == 2) &
((order_items.order_item_subtotal >= 150) &
(order_items.order_item_subtotal <= 250)
)]
Out[13]:
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
{note}
String passed to `query` API is broken into multiple lines for readability purposes.
In [14]:
order_items.query('order_item_order_id == 2 and ' +
'order_item_subtotal >= 150 and ' +
'order_item_subtotal <= 250')
Out[14]:
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
In [15]:
orders[orders.order_date == '2013-08-01 00:00:00.0']
Out[15]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
1296 | 1297 | 2013-08-01 00:00:00.0 | 11607 | COMPLETE |
1297 | 1298 | 2013-08-01 00:00:00.0 | 5105 | CLOSED |
1298 | 1299 | 2013-08-01 00:00:00.0 | 7802 | COMPLETE |
1299 | 1300 | 2013-08-01 00:00:00.0 | 553 | PENDING_PAYMENT |
1300 | 1301 | 2013-08-01 00:00:00.0 | 1604 | PENDING_PAYMENT |
… | … | … | … | … |
57959 | 57960 | 2013-08-01 00:00:00.0 | 10177 | PENDING |
57960 | 57961 | 2013-08-01 00:00:00.0 | 835 | COMPLETE |
57961 | 57962 | 2013-08-01 00:00:00.0 | 10521 | PENDING_PAYMENT |
67446 | 67447 | 2013-08-01 00:00:00.0 | 8956 | COMPLETE |
67447 | 67448 | 2013-08-01 00:00:00.0 | 3697 | SUSPECTED_FRAUD |
246 rows × 4 columns
In [16]:
orders.query('order_date == "2013-08-01 00:00:00.0"')
Out[16]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
1296 | 1297 | 2013-08-01 00:00:00.0 | 11607 | COMPLETE |
1297 | 1298 | 2013-08-01 00:00:00.0 | 5105 | CLOSED |
1298 | 1299 | 2013-08-01 00:00:00.0 | 7802 | COMPLETE |
1299 | 1300 | 2013-08-01 00:00:00.0 | 553 | PENDING_PAYMENT |
1300 | 1301 | 2013-08-01 00:00:00.0 | 1604 | PENDING_PAYMENT |
… | … | … | … | … |
57959 | 57960 | 2013-08-01 00:00:00.0 | 10177 | PENDING |
57960 | 57961 | 2013-08-01 00:00:00.0 | 835 | COMPLETE |
57961 | 57962 | 2013-08-01 00:00:00.0 | 10521 | PENDING_PAYMENT |
67446 | 67447 | 2013-08-01 00:00:00.0 | 8956 | COMPLETE |
67447 | 67448 | 2013-08-01 00:00:00.0 | 3697 | SUSPECTED_FRAUD |
246 rows × 4 columns
{note}
We can use the functions available as part of `str` usng `python` as engine.
In [17]:
order_date = '2013-08-01 00:00:00.0'
In [18]:
order_date.startswith?
Docstring: S.startswith(prefix[, start[, end]]) -> bool Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try. Type: builtin_function_or_method
In [19]:
order_date.startswith('2013-08')
Out[19]:
True
In [20]:
orders.query('order_date.str.startswith("2013-08")', engine='python')
Out[20]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
1296 | 1297 | 2013-08-01 00:00:00.0 | 11607 | COMPLETE |
1297 | 1298 | 2013-08-01 00:00:00.0 | 5105 | CLOSED |
1298 | 1299 | 2013-08-01 00:00:00.0 | 7802 | COMPLETE |
1299 | 1300 | 2013-08-01 00:00:00.0 | 553 | PENDING_PAYMENT |
1300 | 1301 | 2013-08-01 00:00:00.0 | 1604 | PENDING_PAYMENT |
… | … | … | … | … |
68705 | 68706 | 2013-08-20 00:00:00.0 | 130 | COMPLETE |
68706 | 68707 | 2013-08-23 00:00:00.0 | 11730 | COMPLETE |
68707 | 68708 | 2013-08-26 00:00:00.0 | 8852 | ON_HOLD |
68708 | 68709 | 2013-08-30 00:00:00.0 | 4756 | COMPLETE |
68709 | 68710 | 2013-08-31 00:00:00.0 | 9685 | COMPLETE |
5680 rows × 4 columns
Task 1¶
Get all the orders placed by customer_id
In [21]:
orders[:10]
Out[21]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
0 | 1 | 2013-07-25 00:00:00.0 | 11599 | CLOSED |
1 | 2 | 2013-07-25 00:00:00.0 | 256 | PENDING_PAYMENT |
2 | 3 | 2013-07-25 00:00:00.0 | 12111 | COMPLETE |
3 | 4 | 2013-07-25 00:00:00.0 | 8827 | CLOSED |
4 | 5 | 2013-07-25 00:00:00.0 | 11318 | COMPLETE |
5 | 6 | 2013-07-25 00:00:00.0 | 7130 | COMPLETE |
6 | 7 | 2013-07-25 00:00:00.0 | 4530 | COMPLETE |
7 | 8 | 2013-07-25 00:00:00.0 | 2911 | PROCESSING |
8 | 9 | 2013-07-25 00:00:00.0 | 5657 | PENDING_PAYMENT |
9 | 10 | 2013-07-25 00:00:00.0 | 5648 | PENDING_PAYMENT |
In [22]:
orders.query('order_customer_id == 12431')
Out[22]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
3773 | 3774 | 2013-08-16 00:00:00.0 | 12431 | CANCELED |
3869 | 3870 | 2013-08-17 00:00:00.0 | 12431 | PENDING_PAYMENT |
4031 | 4032 | 2013-08-17 00:00:00.0 | 12431 | ON_HOLD |
22811 | 22812 | 2013-12-12 00:00:00.0 | 12431 | PENDING |
22926 | 22927 | 2013-12-13 00:00:00.0 | 12431 | CLOSED |
25613 | 25614 | 2013-12-30 00:00:00.0 | 12431 | CLOSED |
27584 | 27585 | 2014-01-12 00:00:00.0 | 12431 | PROCESSING |
28243 | 28244 | 2014-01-15 00:00:00.0 | 12431 | PENDING_PAYMENT |
29108 | 29109 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
29231 | 29232 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
45893 | 45894 | 2014-05-06 00:00:00.0 | 12431 | CLOSED |
46216 | 46217 | 2014-05-07 00:00:00.0 | 12431 | CLOSED |
49677 | 49678 | 2014-05-31 00:00:00.0 | 12431 | PENDING |
51864 | 51865 | 2014-06-15 00:00:00.0 | 12431 | PROCESSING |
63145 | 63146 | 2014-02-13 00:00:00.0 | 12431 | PENDING_PAYMENT |
67109 | 67110 | 2014-07-14 00:00:00.0 | 12431 | PENDING |
In [23]:
orders[orders.order_customer_id == 12431]
Out[23]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
3773 | 3774 | 2013-08-16 00:00:00.0 | 12431 | CANCELED |
3869 | 3870 | 2013-08-17 00:00:00.0 | 12431 | PENDING_PAYMENT |
4031 | 4032 | 2013-08-17 00:00:00.0 | 12431 | ON_HOLD |
22811 | 22812 | 2013-12-12 00:00:00.0 | 12431 | PENDING |
22926 | 22927 | 2013-12-13 00:00:00.0 | 12431 | CLOSED |
25613 | 25614 | 2013-12-30 00:00:00.0 | 12431 | CLOSED |
27584 | 27585 | 2014-01-12 00:00:00.0 | 12431 | PROCESSING |
28243 | 28244 | 2014-01-15 00:00:00.0 | 12431 | PENDING_PAYMENT |
29108 | 29109 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
29231 | 29232 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
45893 | 45894 | 2014-05-06 00:00:00.0 | 12431 | CLOSED |
46216 | 46217 | 2014-05-07 00:00:00.0 | 12431 | CLOSED |
49677 | 49678 | 2014-05-31 00:00:00.0 | 12431 | PENDING |
51864 | 51865 | 2014-06-15 00:00:00.0 | 12431 | PROCESSING |
63145 | 63146 | 2014-02-13 00:00:00.0 | 12431 | PENDING_PAYMENT |
67109 | 67110 | 2014-07-14 00:00:00.0 | 12431 | PENDING |
In [24]:
orders[orders['order_customer_id'] == 12431]
Out[24]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
3773 | 3774 | 2013-08-16 00:00:00.0 | 12431 | CANCELED |
3869 | 3870 | 2013-08-17 00:00:00.0 | 12431 | PENDING_PAYMENT |
4031 | 4032 | 2013-08-17 00:00:00.0 | 12431 | ON_HOLD |
22811 | 22812 | 2013-12-12 00:00:00.0 | 12431 | PENDING |
22926 | 22927 | 2013-12-13 00:00:00.0 | 12431 | CLOSED |
25613 | 25614 | 2013-12-30 00:00:00.0 | 12431 | CLOSED |
27584 | 27585 | 2014-01-12 00:00:00.0 | 12431 | PROCESSING |
28243 | 28244 | 2014-01-15 00:00:00.0 | 12431 | PENDING_PAYMENT |
29108 | 29109 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
29231 | 29232 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
45893 | 45894 | 2014-05-06 00:00:00.0 | 12431 | CLOSED |
46216 | 46217 | 2014-05-07 00:00:00.0 | 12431 | CLOSED |
49677 | 49678 | 2014-05-31 00:00:00.0 | 12431 | PENDING |
51864 | 51865 | 2014-06-15 00:00:00.0 | 12431 | PROCESSING |
63145 | 63146 | 2014-02-13 00:00:00.0 | 12431 | PENDING_PAYMENT |
67109 | 67110 | 2014-07-14 00:00:00.0 | 12431 | PENDING |
Task 2¶
Get all the orders placed by customer_id for a given month. Month is passed as yyyy-MM format.
In [25]:
orders.query('order_customer_id == 12431 and order_date.str.startswith("2014-01")', engine='python')
Out[25]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
27584 | 27585 | 2014-01-12 00:00:00.0 | 12431 | PROCESSING |
28243 | 28244 | 2014-01-15 00:00:00.0 | 12431 | PENDING_PAYMENT |
29108 | 29109 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
29231 | 29232 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
In [26]:
orders[(orders.order_customer_id == 12431) & (orders.order_date.str.startswith('2014-01'))]
Out[26]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
27584 | 27585 | 2014-01-12 00:00:00.0 | 12431 | PROCESSING |
28243 | 28244 | 2014-01-15 00:00:00.0 | 12431 | PENDING_PAYMENT |
29108 | 29109 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
29231 | 29232 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
Task 3¶
Get all the orders which are placed by customer with id 12431 in January 2014 and status is in PENDING_PAYMENT or PROCESSING
In [27]:
orders.query('order_customer_id == 12431 and ' +
'order_date.str.startswith("2014-01") and ' +
'order_status in ("PROCESSING", "PENDING_PAYMENT")',
engine='python'
)
Out[27]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
27584 | 27585 | 2014-01-12 00:00:00.0 | 12431 | PROCESSING |
28243 | 28244 | 2014-01-15 00:00:00.0 | 12431 | PENDING_PAYMENT |
In [28]:
orders[(orders.order_customer_id == 12431) &
(orders.order_date.str.startswith('2014-01')) &
(orders.order_status.isin(['PROCESSING', 'PENDING_PAYMENT']))
]
Out[28]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
27584 | 27585 | 2014-01-12 00:00:00.0 | 12431 | PROCESSING |
28243 | 28244 | 2014-01-15 00:00:00.0 | 12431 | PENDING_PAYMENT |
]