[
Performing Total Aggregations¶
Let us understand how to perform total or global aggregations using Pandas.
In [1]:
%run 06_csv_to_pandas_data_frame.ipynb
- Getting number of records in the Data Frame.
In [2]:
orders.shape
Out[2]:
(68883, 4)
In [3]:
orders.shape[0]
Out[3]:
68883
In [4]:
order_items.shape[0]
Out[4]:
172198
- Getting number of non np.NaN values in each attribute in a Data Frame
In [5]:
orders.count()
Out[5]:
order_id 68883 order_date 68883 order_customer_id 68883 order_status 68883 dtype: int64
In [6]:
type(orders.count())
Out[6]:
pandas.core.series.Series
In [7]:
orders.count()['order_id']
Out[7]:
68883
In [8]:
orders.order_id.count()
Out[8]:
68883
In [9]:
orders['order_id'].count()
Out[9]:
68883
- Getting basic statistics of numeric fields of a Data Frame
In [10]:
orders.describe()
Out[10]:
order_id | order_customer_id | |
---|---|---|
count | 68883.000000 | 68883.000000 |
mean | 34442.000000 | 6216.571099 |
std | 19884.953633 | 3586.205241 |
min | 1.000000 | 1.000000 |
25% | 17221.500000 | 3122.000000 |
50% | 34442.000000 | 6199.000000 |
75% | 51662.500000 | 9326.000000 |
max | 68883.000000 | 12435.000000 |
- Get revenue for order id 2 from order_items
In [11]:
order_items[order_items.order_item_order_id == 2]
Out[11]:
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
In [12]:
order_items[order_items.order_item_order_id == 2].order_item_subtotal
Out[12]:
1 199.99 2 250.00 3 129.99 Name: order_item_subtotal, dtype: float64
In [13]:
order_items[order_items.order_item_order_id == 2].order_item_subtotal.sum()
Out[13]:
579.98
In [14]:
order_items[order_items.order_item_order_id == 2]['order_item_subtotal'].sum()
Out[14]:
579.98
Task 1¶
Use orders and get total number of records for a given month (201401).
In [15]:
orders
Out[15]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
0 | 1 | 2013-07-25 00:00:00.0 | 11599 | CLOSED |
1 | 2 | 2013-07-25 00:00:00.0 | 256 | PENDING_PAYMENT |
2 | 3 | 2013-07-25 00:00:00.0 | 12111 | COMPLETE |
3 | 4 | 2013-07-25 00:00:00.0 | 8827 | CLOSED |
4 | 5 | 2013-07-25 00:00:00.0 | 11318 | COMPLETE |
… | … | … | … | … |
68878 | 68879 | 2014-07-09 00:00:00.0 | 778 | COMPLETE |
68879 | 68880 | 2014-07-13 00:00:00.0 | 1117 | COMPLETE |
68880 | 68881 | 2014-07-19 00:00:00.0 | 2518 | PENDING_PAYMENT |
68881 | 68882 | 2014-07-22 00:00:00.0 | 10000 | ON_HOLD |
68882 | 68883 | 2014-07-23 00:00:00.0 | 5533 | COMPLETE |
68883 rows × 4 columns
In [16]:
orders.order_date
Out[16]:
0 2013-07-25 00:00:00.0 1 2013-07-25 00:00:00.0 2 2013-07-25 00:00:00.0 3 2013-07-25 00:00:00.0 4 2013-07-25 00:00:00.0 ... 68878 2014-07-09 00:00:00.0 68879 2014-07-13 00:00:00.0 68880 2014-07-19 00:00:00.0 68881 2014-07-22 00:00:00.0 68882 2014-07-23 00:00:00.0 Name: order_date, Length: 68883, dtype: object
In [17]:
orders['order_date'].str.slice(0, 7)
Out[17]:
0 2013-07 1 2013-07 2 2013-07 3 2013-07 4 2013-07 ... 68878 2014-07 68879 2014-07 68880 2014-07 68881 2014-07 68882 2014-07 Name: order_date, Length: 68883, dtype: object
In [18]:
orders['order_date'].str.slice(0, 7).str.replace('-', '').astype('int64')
Out[18]:
0 201307 1 201307 2 201307 3 201307 4 201307 ... 68878 201407 68879 201407 68880 201407 68881 201407 68882 201407 Name: order_date, Length: 68883, dtype: int64
In [19]:
orders['order_date'].str.slice(0, 7).str.replace('-', '').astype('int64') == 201401
Out[19]:
0 False 1 False 2 False 3 False 4 False ... 68878 False 68879 False 68880 False 68881 False 68882 False Name: order_date, Length: 68883, dtype: bool
In [20]:
orders[orders['order_date'].str.slice(0, 7).str.replace('-', '').astype('int64') == 201401]
Out[20]:
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
25875 | 25876 | 2014-01-01 00:00:00.0 | 3414 | PENDING_PAYMENT |
25876 | 25877 | 2014-01-01 00:00:00.0 | 5549 | PENDING_PAYMENT |
25877 | 25878 | 2014-01-01 00:00:00.0 | 9084 | PENDING |
25878 | 25879 | 2014-01-01 00:00:00.0 | 5118 | PENDING |
25879 | 25880 | 2014-01-01 00:00:00.0 | 10146 | CANCELED |
… | … | … | … | … |
68789 | 68790 | 2014-01-26 00:00:00.0 | 10302 | CLOSED |
68790 | 68791 | 2014-01-27 00:00:00.0 | 6524 | COMPLETE |
68791 | 68792 | 2014-01-28 00:00:00.0 | 9809 | CANCELED |
68792 | 68793 | 2014-01-30 00:00:00.0 | 5654 | COMPLETE |
68793 | 68794 | 2014-01-31 00:00:00.0 | 6873 | COMPLETE |
5908 rows × 4 columns
In [21]:
orders[orders['order_date'].str.slice(0, 7).str.replace('-', '').astype('int64') == 201401]['order_id'].count()
Out[21]:
5908
Task 2¶
Use order_items data set and compute total revenue generated for a given product_id.
In [22]:
order_items
Out[22]:
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 957 | 1 | 299.98 | 299.98 |
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
4 | 5 | 4 | 897 | 2 | 49.98 | 24.99 |
… | … | … | … | … | … | … |
172193 | 172194 | 68881 | 403 | 1 | 129.99 | 129.99 |
172194 | 172195 | 68882 | 365 | 1 | 59.99 | 59.99 |
172195 | 172196 | 68882 | 502 | 1 | 50.00 | 50.00 |
172196 | 172197 | 68883 | 208 | 1 | 1999.99 | 1999.99 |
172197 | 172198 | 68883 | 502 | 3 | 150.00 | 50.00 |
172198 rows × 6 columns
In [23]:
order_items.query('order_item_product_id == 502')
Out[23]:
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
2 | 3 | 2 | 502 | 5 | 250.0 | 50.0 |
6 | 7 | 4 | 502 | 3 | 150.0 | 50.0 |
19 | 20 | 8 | 502 | 1 | 50.0 | 50.0 |
37 | 38 | 12 | 502 | 5 | 250.0 | 50.0 |
41 | 42 | 14 | 502 | 1 | 50.0 | 50.0 |
… | … | … | … | … | … | … |
172151 | 172152 | 68861 | 502 | 4 | 200.0 | 50.0 |
172173 | 172174 | 68871 | 502 | 4 | 200.0 | 50.0 |
172189 | 172190 | 68880 | 502 | 5 | 250.0 | 50.0 |
172195 | 172196 | 68882 | 502 | 1 | 50.0 | 50.0 |
172197 | 172198 | 68883 | 502 | 3 | 150.0 | 50.0 |
21035 rows × 6 columns
In [24]:
order_items.query('order_item_product_id == 502')['order_item_subtotal'].sum()
Out[24]:
3147800.0
In [25]:
order_items.query('order_item_product_id == 502').order_item_subtotal.sum()
Out[25]:
3147800.0
Task 3¶
Use order_items data set and get total number of items sold as well as total revenue generated for a given product_id.
In [26]:
order_items
Out[26]:
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 957 | 1 | 299.98 | 299.98 |
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
4 | 5 | 4 | 897 | 2 | 49.98 | 24.99 |
… | … | … | … | … | … | … |
172193 | 172194 | 68881 | 403 | 1 | 129.99 | 129.99 |
172194 | 172195 | 68882 | 365 | 1 | 59.99 | 59.99 |
172195 | 172196 | 68882 | 502 | 1 | 50.00 | 50.00 |
172196 | 172197 | 68883 | 208 | 1 | 1999.99 | 1999.99 |
172197 | 172198 | 68883 | 502 | 3 | 150.00 | 50.00 |
172198 rows × 6 columns
In [27]:
order_items_for_product_id = order_items.query('order_item_product_id == 502')
order_items_for_product_id
Out[27]:
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
2 | 3 | 2 | 502 | 5 | 250.0 | 50.0 |
6 | 7 | 4 | 502 | 3 | 150.0 | 50.0 |
19 | 20 | 8 | 502 | 1 | 50.0 | 50.0 |
37 | 38 | 12 | 502 | 5 | 250.0 | 50.0 |
41 | 42 | 14 | 502 | 1 | 50.0 | 50.0 |
… | … | … | … | … | … | … |
172151 | 172152 | 68861 | 502 | 4 | 200.0 | 50.0 |
172173 | 172174 | 68871 | 502 | 4 | 200.0 | 50.0 |
172189 | 172190 | 68880 | 502 | 5 | 250.0 | 50.0 |
172195 | 172196 | 68882 | 502 | 1 | 50.0 | 50.0 |
172197 | 172198 | 68883 | 502 | 3 | 150.0 | 50.0 |
21035 rows × 6 columns
In [28]:
order_items_for_product_id[['order_item_quantity', 'order_item_subtotal']]
Out[28]:
order_item_quantity | order_item_subtotal | |
---|---|---|
2 | 5 | 250.0 |
6 | 3 | 150.0 |
19 | 1 | 50.0 |
37 | 5 | 250.0 |
41 | 1 | 50.0 |
… | … | … |
172151 | 4 | 200.0 |
172173 | 4 | 200.0 |
172189 | 5 | 250.0 |
172195 | 1 | 50.0 |
172197 | 3 | 150.0 |
21035 rows × 2 columns
In [29]:
order_items_for_product_id[['order_item_quantity', 'order_item_subtotal']].sum()
Out[29]:
order_item_quantity 62956.0 order_item_subtotal 3147800.0 dtype: float64
In [30]:
tuple(order_items_for_product_id[['order_item_quantity', 'order_item_subtotal']].sum())
Out[30]:
(62956.0, 3147800.0)
In [31]:
dict(order_items_for_product_id[['order_item_quantity', 'order_item_subtotal']].sum())
Out[31]:
{'order_item_quantity': 62956.0, 'order_item_subtotal': 3147800.0}
Task 4¶
Create a collection with sales and commission percentage. Using that collection compute total commission amount. If the commission percent is None or not present, treat it as 0.
- Each element in the collection should be a tuple.
- First element is the sales amount and second element is commission percentage.
- Commission for each sale can be computed by multiplying commission percentage with sales (make sure to divide commission percentage by 100).
- Some of the records does not have commission percentage, in that case commission amount for that sale shall be 0
In [32]:
transactions = [(376.0, 8),
(548.23, 14),
(107.93, 8),
(838.22, 14),
(846.85, 21),
(234.84,),
(850.2, 21),
(992.2, 21),
(267.01,),
(958.91, 21),
(412.59,),
(283.14,),
(350.01, 14),
(226.95,),
(132.7, 14)]
In [33]:
sales = pd.DataFrame(transactions, columns=['sale_amount', 'commission_pct'])
In [34]:
sales
Out[34]:
sale_amount | commission_pct | |
---|---|---|
0 | 376.00 | 8.0 |
1 | 548.23 | 14.0 |
2 | 107.93 | 8.0 |
3 | 838.22 | 14.0 |
4 | 846.85 | 21.0 |
5 | 234.84 | NaN |
6 | 850.20 | 21.0 |
7 | 992.20 | 21.0 |
8 | 267.01 | NaN |
9 | 958.91 | 21.0 |
10 | 412.59 | NaN |
11 | 283.14 | NaN |
12 | 350.01 | 14.0 |
13 | 226.95 | NaN |
14 | 132.70 | 14.0 |
In [35]:
sales_filled = sales.fillna(0.0)
sales_filled
Out[35]:
sale_amount | commission_pct | |
---|---|---|
0 | 376.00 | 8.0 |
1 | 548.23 | 14.0 |
2 | 107.93 | 8.0 |
3 | 838.22 | 14.0 |
4 | 846.85 | 21.0 |
5 | 234.84 | 0.0 |
6 | 850.20 | 21.0 |
7 | 992.20 | 21.0 |
8 | 267.01 | 0.0 |
9 | 958.91 | 21.0 |
10 | 412.59 | 0.0 |
11 | 283.14 | 0.0 |
12 | 350.01 | 14.0 |
13 | 226.95 | 0.0 |
14 | 132.70 | 14.0 |
In [36]:
(sales_filled['sale_amount'] * (sales_filled['commission_pct'] / 100))
Out[36]:
0 30.0800 1 76.7522 2 8.6344 3 117.3508 4 177.8385 5 0.0000 6 178.5420 7 208.3620 8 0.0000 9 201.3711 10 0.0000 11 0.0000 12 49.0014 13 0.0000 14 18.5780 dtype: float64
In [37]:
(sales_filled['sale_amount'] * (sales_filled['commission_pct'] / 100)).sum()
Out[37]:
1066.5104000000001
In [38]:
(sales_filled['sale_amount'] * (sales_filled['commission_pct'] / 100)).sum().round(2)
Out[38]:
1066.51
]