As part of this session, we will see advanced operations such as aggregations, ranking, and windowing functions within each group using APIs such as over, partitionBy etc. We will also build a solution to the problem and run it on a multinode cluster.
Window Functions – APIs
Problem Statement – Get top n products per day
Creating Window Spec
Performing Aggregations
Using Windowing Functions
Ranking Functions
Development Life Cycle
Window Functions – APIs
Let us understand APIs related to aggregations, ranking and windowing functions.
Main package org.apache.spark.sql.expressions
It has classes such as Window and WindowSpec
Window have APIs such as partitionBy, orderBy etc
These APIs (such as partitionBy) return WindowSpec object. We can pass WindowSpec object to over on functions such as rank(), dense_rank(), sum() etc
Syntax: rank().over(spec) where spec = Window.partitionBy(‘ColumnName’)
Aggregations – sum, avg, min, max etc
Ranking – rank, dense_rank, row_number etc
Windowing – lead, lag etc
[gist]33c094e471e6a8fe42a3dde30f73d692[/gist]
Problem Statement – Get top n products per day
Let us define the problem statement and see the real usage of analytics function.
Problem Statement – Get top N Products Per day
Get daily product revenue code from the previous topic
Use ranking functions and get the rank associated based on revenue for each day
Once we get rank, let us filter for top n products.
[gist]3d5568660fafaf080071d0cc4a9e5682[/gist]
Creating Window Spec
Let us see how to create Window Spec.
Window have APIs such as partitionBy, orderBy
For aggregations, we can define the group by using partitionBy
For ranking or windowing, we need to use partitionBy and then orderBy. partitionBy is to group the data and orderBy is to sort the data to assign rank.
partitionBy or orderBy returns WindowSpec object
WindowSpec object needs to be passed to over with ranking and aggregate functions.
Performing aggregations
Let us see how to perform aggregations within each group.
We have functions such as sum, avg, min, max etc which can be used to aggregate the data.
We need to create WindowSpec object using partitionBy to get aggregations within each group.
[gist]bf06d68e6da2f2b188525277d2494712[/gist]
Some realistic use cases
Get the average salary for each department and get all employee details who earn more than the average salary
Get average revenue for each day and get all the orders who earn revenue more than average revenue
Get the highest order revenue and get all the orders which have revenue more than 75% of the revenue
Using Windowing Functions
Let us see details about windowing functions within each group
We have functions such as lead, lag, first, last etc
We need to create WindowSpec object using partitionBy and then orderBy for most of the windowing functions
lead and lag take any column using which you want to get information based on partition and order columns.
Some realistic use cases
The salary difference between current and next/previous employee within each department
[gist]4fddb09e33dde0d04f4d8786b7fed37f[/gist]
Ranking Functions
Let us talk about ranking functions within each group.
We have functions like rank, dense_rank, row_number etc
We need to create WindowSpec object using partitionBy and then orderBy for most of the ranking functions
Some realistic use cases
Assign rank to employees based on salary within each department
Assign ranks to products based on revenue each day or month
[gist]5d59c30231a11330e88dfed62b3f598e[/gist]
Development Life Cycle
Let us talk about the development lifecycle.
Take the DailyProductRevenue code which gives us order_date, order_item_product_id, and revenue
Import Window and create a spec to partition by date and order by revenue in descending order.
Use withColumn and assign the rank
Filter data where rank is less than or equal to topN passed as an argument to the program
Drop rank field as we do not want to save the data and then sort in ascending order by date and descending order by revenue