Data Frame Operations – Select Clause and Functions

As part of this session we will be understanding how to select or project data from data frames while applying functions to extract required information

  • Getting Started
  • String Manipulation Functions
  • Using withColumn
  • Using selectExpr
  • Date Manipulation Functions
  • Dropping Columns
  • User Defined Functions – Simple

Getting Started

Before getting into Pre-Defined functions available to process the data, let us make sure we have Data Frames to apply these functions.

  • We can apply toDF on Seq to create Data Frame out of a typical collection.
  • First, we will create Data Frame by name dual with column dummy and value X
  • Also, let us create Data Frame for the orders data set.
  • Most of the Data Frame APIs such as select, where, groupBy etc take column names in the form of strings or of col type.
  • Functions used in these APIs take column names in the form of col type.
  • If we have to add a constant value to the existing values in a column, we need to use lit on top of constant value.
  • We can also use $ instead of col.
  • We pass column names as strings if
[gist]60c358a2591e0a0fb5a865714b455abe[/gist]

String Manipulation Functions

Let us go through some of the important functions using which we can manipulate strings. We should spend enough time to understand how to manipulate strings using available functions.

  • Case conversion functions – lower, upper, initcap
  • Trim functions – trim, rtrim, ltrim
  • Padding functions – lpad, rpad
  • Typecasting – we can use Hive function cast as part of selectExpr to change the data type of data to its original type (eg: date might be a string, but we can extract year part and convert into an integer)
  • getting length

Extracting Data

Let us see how we can extract data from the fields of Data Frame.

  • Extracting data from fixed length records – substring
  • Extracting data from variable length records – split
[gist]85e91cbef24f1e5e445486ba6b621697[/gist]

Using withColumn

We can use withColumn to transform data of a particular column within a Data Frame without impacting other columns.

  • If we want to select all the columns as well as new columns by applying some transformation logic, instead of specifying all the columns with select as well as expression – we can use withColumn.
  • All the other columns will remain untouched.
  • Syntax of withColumn – df.withColumn(‘column_name’, EXPRESSION)
  • In the below example we are going to discard the timestamp from order_date and then giving the column name as order_date. New Data Frame will still have 4 columns but order_date will have dates without timestamps.
[gist]ac9470d1c7ba79a25c77d077af846e74[/gist]

Using selectExpr

selectExpr for advanced transformations like the case when.

  • Even though functions available on Data Frames are robust, sometimes we might have to use traditional SQL style approach while applying functions.
  • We can take care of it using selectExpr. Whatever functions we pass to selectExpr should follow Hive syntax. If we use split, we need to use [] to access elements from the array generated as result for the split.
  • We also have expr which works in similar fashion while using withColumn or while using select.
  • selectExpr(“EXPRESSION”) is alias for select(expr(“EXPRESSION”))
[gist]098a77776b55657c1a195c30ea6e09ed[/gist]

Date Manipulation Functions

We also have to deal with dates very often. There are several functions which can be leveraged to manipulate dates.

  • Data arithmetic – date_add, date_sub, datediff, next_day, last_day, months_between, add_months
  • Getting the first of the month or year – trunc
  • Extracting information – dayofmonth, dayofyear, dayofweek, year, month
  • Formatting date – date_format
  • Typecasting – we can use the cast to convert data type of values of a particular column to its original type.
[gist]b4a4e67a377fb7aad505c121af65733c[/gist]

Dropping Columns

Now let us see how we can drop columns from Data Frame.

  • We can use drop to drop one or more columns from a Data Frame
  • When we use drop, it will create new Data Frame without the columns specified as part of drop.
  • Column Names can be passed as string type or column type. Using column type we can drop only one column at a time.
[gist]e5ceaa9ac7435c4557b438ddc81da33b[/gist]

User Defined Functions – Simple

Let us explore how we can define simple user defined functions and use them as part of Data Frame Operations as well as Spark SQL.

  • There are simple UDFs as well as Aggregated UDFs. For now, we will focus on simple UDFs.
  • We can create variable for function and convert into UDF, using org.apache.spark.sql.functions.udf. It will return object of type org.apache.spark.sql.expressions.UserDefinedFunction
  • Once UDF is created we can use it as part of any transformation function such as select, where, filter, groupBy etc by using Data Frame Native approach.
  • However, if we want to use it as part of SQL style syntax, we need to register UDF as SQL function. We can do so, by using spark.register.udf.
  • Let us see a demo by creating UDF to extract year from date string and convert it to Integer.
[gist]432197637572afbb7458a56ca23c3569[/gist]