As part of this session we will be understanding how to select or project data from data frames while applying functions to extract required information
Getting Started
String Manipulation Functions
Using withColumn
Using selectExpr
Date Manipulation Functions
Dropping Columns
User Defined Functions – Simple
Getting Started
Before getting into Pre-Defined functions available to process the data, let us make sure we have Data Frames to apply these functions.
We can apply toDF on Seq to create Data Frame out of a typical collection.
First, we will create Data Frame by name dual with column dummy and value X
Also, let us create Data Frame for the orders data set.
Most of the Data Frame APIs such as select, where, groupBy etc take column names in the form of strings or of col type.
Functions used in these APIs take column names in the form of col type.
If we have to add a constant value to the existing values in a column, we need to use lit on top of constant value.
We can also use $ instead of col.
We pass column names as strings if
[gist]60c358a2591e0a0fb5a865714b455abe[/gist]
String Manipulation Functions
Let us go through some of the important functions using which we can manipulate strings. We should spend enough time to understand how to manipulate strings using available functions.
Case conversion functions – lower, upper, initcap
Trim functions – trim, rtrim, ltrim
Padding functions – lpad, rpad
Typecasting – we can use Hive function cast as part of selectExpr to change the data type of data to its original type (eg: date might be a string, but we can extract year part and convert into an integer)
getting length
Extracting Data
Let us see how we can extract data from the fields of Data Frame.
Extracting data from fixed length records – substring
Extracting data from variable length records – split
[gist]85e91cbef24f1e5e445486ba6b621697[/gist]
Using withColumn
We can use withColumn to transform data of a particular column within a Data Frame without impacting other columns.
If we want to select all the columns as well as new columns by applying some transformation logic, instead of specifying all the columns with select as well as expression – we can use withColumn.
All the other columns will remain untouched.
Syntax of withColumn – df.withColumn(‘column_name’, EXPRESSION)
In the below example we are going to discard the timestamp from order_date and then giving the column name as order_date. New Data Frame will still have 4 columns but order_date will have dates without timestamps.
[gist]ac9470d1c7ba79a25c77d077af846e74[/gist]
Using selectExpr
selectExpr for advanced transformations like the case when.
Even though functions available on Data Frames are robust, sometimes we might have to use traditional SQL style approach while applying functions.
We can take care of it using selectExpr. Whatever functions we pass to selectExpr should follow Hive syntax. If we use split, we need to use [] to access elements from the array generated as result for the split.
We also have expr which works in similar fashion while using withColumn or while using select.
selectExpr(“EXPRESSION”) is alias for select(expr(“EXPRESSION”))
[gist]098a77776b55657c1a195c30ea6e09ed[/gist]
Date Manipulation Functions
We also have to deal with dates very often. There are several functions which can be leveraged to manipulate dates.
Data arithmetic – date_add, date_sub, datediff, next_day, last_day, months_between, add_months
Getting the first of the month or year – trunc
Extracting information – dayofmonth, dayofyear, dayofweek, year, month
Formatting date – date_format
Typecasting – we can use the cast to convert data type of values of a particular column to its original type.
[gist]b4a4e67a377fb7aad505c121af65733c[/gist]
Dropping Columns
Now let us see how we can drop columns from Data Frame.
We can use drop to drop one or more columns from a Data Frame
When we use drop, it will create new Data Frame without the columns specified as part of drop.
Column Names can be passed as string type or column type. Using column type we can drop only one column at a time.
[gist]e5ceaa9ac7435c4557b438ddc81da33b[/gist]
User Defined Functions – Simple
Let us explore how we can define simple user defined functions and use them as part of Data Frame Operations as well as Spark SQL.
There are simple UDFs as well as Aggregated UDFs. For now, we will focus on simple UDFs.
We can create variable for function and convert into UDF, using org.apache.spark.sql.functions.udf. It will return object of type org.apache.spark.sql.expressions.UserDefinedFunction
Once UDF is created we can use it as part of any transformation function such as select, where, filter, groupBy etc by using Data Frame Native approach.
However, if we want to use it as part of SQL style syntax, we need to register UDF as SQL function. We can do so, by using spark.register.udf.
Let us see a demo by creating UDF to extract year from date string and convert it to Integer.