Spark SQL – Analytics Functions or Windowing Functions

As part of this session we will see advanced operations such as aggregations, ranking and windowing functions with in each group using clauses such as over, partition by, order by etc. We will also build a solution for problem and run it on multinode cluster.

  • Aggregations, Ranking and Windowing Functions
  • Problem Statement – Get top n products per day
  • Understanding over, partition by and order by clauses
  • Performing aggregations
  • Using windowing functions
  • Ranking with in each partition or group
  • Development Life Cycle

Development Life Cycle (daily product revenue)

Let us develop the application using Pycharm and run it on the cluster.

  • Make sure application.properties have required input path and output path along with execution mode
  • Read orders and order_items data into data frames
  • Filter for complete and closed orders
  • Join with order_items
  • Aggregate to get revenue for each order_date and order_item_product_id
  • Sort in ascending order by date and then descending order by revenue
  • Save the output as CSV format
  • Validate using Pycharm
  • Ship it to the cluster, run it on the cluster and validate.

Aggregations, Ranking and Windowing Functions

Let us understand APIs related to aggregations, ranking and windowing functions.

  • There are multiple clauses with in SQL to accomplish these
    • over
    • partition by
    • order by
  • All aggregate functions, rank functions and windowing functions can be used with over clause to get aggregations per partition or group
  • It is mandatory to specify over clause
  • e.g.: rank() over(spec) where spec can be partition by or order by or both
  • Aggregations – sum, avg, min, max etc
  • Ranking – rank, dense_rank, row_number etc
  • Windowing – lead, lag etc
  • We typically use partition by clause for aggregations and then partition by as well as order by for ranking and windowing functions.

Problem Statement – Get top n products per day

Let us define the problem statement and see the real usage of analytics function.

  • Problem Statement – Get top N Products Per day
  • Get daily product revenue code from previous topic
  • Use ranking functions and get the rank associated based on revenue for each day
  • Once we get rank, let us filter for top n products.

Understanding over, partition by and order by clauses

Let us understand different clauses required for analytics functions.

  • Typical syntax – function(argument) over (partition by groupcolumn [order by [desc] ordercolumn])
  • For aggregations we can define group by using partition by
  • For ranking or windowing we need to use partition by and then order by. partition by is to group the data and order by is to sort the data to assign rank.
  • We will not be able to use these any where except for select clause
  • If we have to filter on these derived fields in select clause, we need to nest the whole query into another query.

Performing aggregations

Let us see how to perform aggregations with in each group.

  • We have functions such as sum, avg, min, max etc which can be used to aggregate the data.
  • We need to use over (partition by) to get aggregations with in each group.
  • Some realistic use cases
    • Get average salary for each department and get all employee details who earn more than average salary
    • Get average revenue for each day and get all the orders who earn revenue more than average revenue
    • Get highest order revenue and get all the orders which have revenue more than 75% of the revenue

Using windowing functions

Let us see details about windowing functions with in each group

  • We have functions such as lead, lag etc
  • We need to use partition by and then order by for most of the windowing functions
  • Some realistic use cases
    • Salary difference between current and next/previous employee with in each department

Ranking with in each partition or group

Let us talk about ranking functions with in each group.

  • We have functions like rank, dense_rank, row_number, first, last etc
  • We need to use partition by and then order by for most of the windowing functions
  • Some realistic use cases
    • Assign rank to employees based on salary with in each department
    • Assign ranks to products based on revenue each day or month

Development Life Cycle

Let us talk about development life cycle.

  • Take the DailyProductRevenue code which gives us order_date, order_item_product_id and revenue
  • Add logic using function rank over partition by date and order by revenue in descending order. Make sure to give alias to this new field.
  • Nest the query into another query – e. g.: select required_fields from (query) query_alias
  • Add where clause on the query_alias.derived field name
  • Let us select and assign data related to order_date, order_item_product_id and revenue (after filtering on topN) to a new data frame
  • Save the data frame into file