As part of this session we will see advanced operations such as aggregations, ranking and windowing functions with in each group using clauses such as over, partition by, order by etc. We will also build a solution for problem and run it on multinode cluster.
Aggregations, Ranking and Windowing Functions
Problem Statement – Get top n products per day
Understanding over, partition by and order by clauses
Performing aggregations
Using windowing functions
Ranking with in each partition or group
Development Life Cycle
Development Life Cycle (daily product revenue)
Let us develop the application using Pycharm and run it on the cluster.
Make sure application.properties have required input path and output path along with execution mode
Read orders and order_items data into data frames
Filter for complete and closed orders
Join with order_items
Aggregate to get revenue for each order_date and order_item_product_id
Sort in ascending order by date and then descending order by revenue
Save the output as CSV format
Validate using Pycharm
Ship it to the cluster, run it on the cluster and validate.
Aggregations, Ranking and Windowing Functions
Let us understand APIs related to aggregations, ranking and windowing functions.
There are multiple clauses with in SQL to accomplish these
over
partition by
order by
All aggregate functions, rank functions and windowing functions can be used with over clause to get aggregations per partition or group
It is mandatory to specify over clause
e.g.: rank() over(spec) where spec can be partition by or order by or both
Aggregations – sum, avg, min, max etc
Ranking – rank, dense_rank, row_number etc
Windowing – lead, lag etc
We typically use partition by clause for aggregations and then partition by as well as order by for ranking and windowing functions.
Problem Statement – Get top n products per day
Let us define the problem statement and see the real usage of analytics function.
Problem Statement – Get top N Products Per day
Get daily product revenue code from previous topic
Use ranking functions and get the rank associated based on revenue for each day
Once we get rank, let us filter for top n products.
Understanding over, partition by and order by clauses
Let us understand different clauses required for analytics functions.
Typical syntax – function(argument) over (partition by groupcolumn [order by [desc] ordercolumn])
For aggregations we can define group by using partition by
For ranking or windowing we need to use partition by and then order by. partition by is to group the data and order by is to sort the data to assign rank.
We will not be able to use these any where except for select clause
If we have to filter on these derived fields in select clause, we need to nest the whole query into another query.
Performing aggregations
Let us see how to perform aggregations with in each group.
We have functions such as sum, avg, min, max etc which can be used to aggregate the data.
We need to use over (partition by) to get aggregations with in each group.
Some realistic use cases
Get average salary for each department and get all employee details who earn more than average salary
Get average revenue for each day and get all the orders who earn revenue more than average revenue
Get highest order revenue and get all the orders which have revenue more than 75% of the revenue
Using windowing functions
Let us see details about windowing functions with in each group
We have functions such as lead, lag etc
We need to use partition by and then order by for most of the windowing functions
Some realistic use cases
Salary difference between current and next/previous employee with in each department
Ranking with in each partition or group
Let us talk about ranking functions with in each group.
We have functions like rank, dense_rank, row_number, first, last etc
We need to use partition by and then order by for most of the windowing functions
Some realistic use cases
Assign rank to employees based on salary with in each department
Assign ranks to products based on revenue each day or month
Development Life Cycle
Let us talk about development life cycle.
Take the DailyProductRevenue code which gives us order_date, order_item_product_id and revenue
Add logic using function rank over partition by date and order by revenue in descending order. Make sure to give alias to this new field.
Nest the query into another query – e. g.: select required_fields from (query) query_alias
Add where clause on the query_alias.derived field name
Let us select and assign data related to order_date, order_item_product_id and revenue (after filtering on topN) to a new data frame