Let us see how we can use Data Frames to get data from Hive and Relational databases using JDBC and then process data using operations available on Data Frames.
Recap of Hive
Reading and Writing Data – Hive tables
Reading and Writing Data – JDBC
Querying data from Hive tables
Hive DDL and DML – Data Frame Operations
Hive DDL and DML – Spark SQL
Hive Functions – Overview
We will be using labs for the demonstration of interacting with Hive. It is straightforward to connect to Hive from Spark when both of them are integrated. If you want to explore Hive using Spark on your PC, I would recommend to use Cloudera’s QuickStart VM or Hortonworks Sandbox. Setting up Hive and integrating with Spark is a bit tedious (especially on Windows).
Recap of Hive
Let us see a quick recap of Hive.
Hive is used to create databases, tables and process data in Big Data Clusters.
It uses HDFS for the file system. A Hive database or a table or a partition is nothing but a directory in HDFS.
Once we create table metadata will be stored in an RDBMS database configured. We have configured our Hive with MySQL.
Hive also have query engine. A query will be typically compiled into Map Reduce Job.
Once we have tables created, we can run queries using different query engines.
Hive
Tez
Impala
Spark SQL
Presto
and more
Hive support DDL, batch data load as well as inserting query results on top of running SQL type queries.
Let us create Hive Database, create few tables and process using standard Hive Queries before we jump into Spark SQL.
[gist]4ead96242d6fe60b55f39a2a28680d75[/gist]
Reading and Writing Data – Hive tables
Let us see how we can read and write data using Hive Tables leveraging Data Frame Operations.
SparkSession have an API under read to get data directly from Hive Tables.
We can use SparkSession’s (spark) read.table to read data from Hive table.
Data from Hive Table will be loaded into Data Frame. We can now leverage Data Frame Operations to process the data.
[gist]e1a07964492bf2e6df62cd82a01fec7e[/gist]
Processed Data Frame can be saved into a Hive table using multiple APIs under spark.write
saveAsTable – creates a new table in Hive with Parquet file format
insertInto – insert data frame into existing tables
We can use mode to append or overwrite into the table. Modes can be passed as string or org.apache.spark.sql.SaveMode object.
[gist]0570c73439219954972e7b5970358297[/gist]
Reading and Writing Data – JDBC
Let us see how we can read data from a remote database using Spark JDBC.
To connect to a Database, we need
JDBC Jar File, which can be passed using –packages or –jars or as part of the fat jar file
Server ip or DNS alias on which Database is running
Credentials for authentication and authorization to the Database
Database and Table Names we want to read the data from.
We can read data from the remote database into Data Frame using spark.read.jdbc
Database Connectivity information can be passed as options.
We can also use spark.read.format(“jdbc”) along with options.
Once data is processed we can write data back to the remote database using write.jdbc or write.format
We can either append to an existing table or overwrite it, but we will not be able to update or merge table over JDBC.
We can see the list of options in official documentation. Spark JDBC can replace Sqoop to get data from Relational Databases into HDFS and vice-versa.
[gist]578f10203451f69032f76f08872a6918[/gist]
Querying Data from Hive Tables
Apart from reading data from Hive Tables using Data Frame APIs, we can also use spark.sql to read data from Hive Tables as well as to write data to Hive Tables.
spark.sql can be used to issue any valid Hive Command or Query
It will always return a Data Frame
We need to use show to preview the data or collect to print the results
If we issue a complex query, that query will be executed and results will be returned to Data Frame.
Once Data Frame is created we can use all Data Fram Operations.
Let us run select * from orders using spark.sql and preview the results.
We can run any valid query using spark.sql. We will see more examples on Spark SQL in upcoming sections.
Hive DDL and DML – Spark
Let us see how we can perform DDL and DML on top of Hive tables using Spark.
We can use Data Frame Operations or Spark SQL to perform Hive DDL as well as DML.
For Data Frame Operations we can use write.saveAsTable or write.insertInto. We can also use mode to overwrite an existing table or append to it. Options can be used to define additional properties while saving into Hive tables.
We can pick the queries which are executed to create the tables and run them using spark.sql. We need to format them as strings.
[gist]4ead96242d6fe60b55f39a2a28680d75[/gist]
Hive Functions – Overview
Let us go through the details about Hive Functions. For most of the functions we have seen as part of org.apache.spark.sql.functions, we have counterparts in Hive.
We can get the list of functions by using show functions command.
We can get basic usage and syntax of a given function by using describe function command.
As mentioned earlier we can run these commands as part of spark.sql
Make sure to use show(false) to get the complete details.
We can categorize the functions into below groups.
String Manipulation Functions
trim, rtrim, ltrim
lpad, rpad
length
substring, split
Date Manipulation Functions
date_add, date_sub, datediff
date_format
trunc
Type Conversion Functions
cast as
CASE WHEN
and more
Syntax can be different between functions in the package org.apache.spark.sql.functions and functions made available via Hive.