Spark SQL – Analytics or Windowing Functions

As part of this session, we will see advanced operations such as aggregations, ranking, and windowing functions within each group using clauses such as OVER, PARTITION BY etc. We will also build a solution to the problem and run it on a multinode cluster.

  • Window Functions – Overview
  • Problem Statement – Get top n products per day
  • Performing Aggregations
  • Using Windowing Functions
  • Ranking Functions
  • Development Life Cycle

Window Functions – Overview

Let us understand Functions related to aggregations, ranking and windowing functions.

  • We use the functions in SELECT clause.
  • Specification: function() OVER (PARTITION BY column [ORDER BY column])
  • PARTITION BY is used to group the data based on a column.
  • ORDER BY is used to sort the data based on a column.
  • Example: rank() OVER (PARTITION BY department_id ORDER BY salary DESC)
  • Aggregations – sum, avg, min, max etc
  • Ranking – rank, dense_rank, row_number etc
  • Windowing – lead, lag etc
  • Window have APIs such as PARTITION BY, ORDER BY
  • For aggregations, we can define the group by using PARTITION BY
  • For ranking or windowing, we need to use PARTITION BY and then ORDER BY. PARTITION BY is to group the data and ORDER BY is to sort the data to assign rank.
[gist]2557a0c9dd5a90fe6bbae78e4d6666ec[/gist]

Problem Statement – Get top n products per day

Let us define the problem statement and see the real usage of the analytics function.

  • Problem Statement – Get top N Products Per day
  • Get daily product revenue code from the previous topic
  • Use ranking functions and get the rank associated based on revenue for each day
  • Once we get rank, let us filter for top n products.
  • We have already seen how to use join, group by, order by etc to get Daily Product Revenue
[gist]8da0ac33214f83705ccdd21a7f8fb51f[/gist]

Performing aggregations

Let us see how to perform aggregations within each group.

  • We have functions such as sum, avg, min, max etc which can be used to aggregate the data.
  • We need to invoke PARTITION BY to get aggregations within each group.
[gist]537efb37a2c3955e596550abd5fe7880[/gist]
  • Some realistic use cases
    • Get the average salary for each department and get all employee details who earn more than the average salary
    • Get average revenue for each day and get all the orders who earn revenue more than average revenue
    • Get the highest order revenue and get all the orders which have revenue more than 75% of the revenue

Using Windowing Functions

Let us see details about windowing functions within each group

  • We have functions such as lead, lag, first, last etc
  • We need to create spec for most of the ranking functions by specifying grouping column under PARTITION BY clause and then sorting column under ORDER BY clause.
  • lead and lag take the column names or expressions using which we need to get required information.
  • Some realistic use cases
    • The salary difference between current and next/previous employee within each department
[gist]f669afabb48e14b874f611d37304ee7c[/gist]

Ranking Functions

Let us talk about ranking functions within each group.

  • We have functions like rank, dense_rank, row_number, etc
  • We need to create a spec for most of the ranking functions by specifying grouping column under PARTITION BY clause and then sorting column under ORDER BY clause.
  • Some realistic use cases
    • Assign rank to employees based on salary within each department
    • Assign ranks to products based on revenue each day or month
[gist]63d27cc497aaf4178739ecced26e8185[/gist]

Development Life Cycle

Let us talk about the development lifecycle.

  • Take the GetDailyProductRevenueSQL code which gives us order_date, order_item_product_id, and revenue (dailyProductRevenue)
  • Register dailyProductRevenue as temp view daily_product_revenue
  • Write a query against daily_product_revenue PARTITION BY date and ORDER BY revenue in descending order and assign rank using the rank function.
  • Make the query nested and assign the name as inner.
  • Filter data where rank is less than or equal to topN passed as an argument to the program
  • Drop rank field as we do not want to save the data and then sort in ascending order by date and descending order by revenue
  • Save the data frame into a file
[gist]64ec19d6057c707096a713f8f5fe93c7[/gist]