Data Frames – Integration with Hive and JDBC

Let us see how we can use Data Frames to get data from Hive and Relational databases using JDBC and then process data using operations available on Data Frames.

  • Recap of Hive
  • Reading and Writing Data – Hive tables
  • Reading and Writing Data – JDBC
  • Querying data from Hive tables
  • Hive DDL and DML – Data Frame Operations
  • Hive DDL and DML – Spark SQL
  • Hive Functions – Overview
We will be using labs for the demonstration of interacting with Hive. It is straightforward to connect to Hive from Spark when both of them are integrated. If you want to explore Hive using Spark on your PC, I would recommend to use Cloudera’s QuickStart VM or Hortonworks Sandbox. Setting up Hive and integrating with Spark is a bit tedious (especially on Windows).

Recap of Hive

Let us see a quick recap of Hive.

  • Hive is used to create databases, tables and process data in Big Data Clusters.
  • It uses HDFS for the file system. A Hive database or a table or a partition is nothing but a directory in HDFS.
  • Once we create table metadata will be stored in an RDBMS database configured. We have configured our Hive with MySQL.
  • Hive also have query engine. A query will be typically compiled into Map Reduce Job.
  • Once we have tables created, we can run queries using different query engines.
    • Hive
    • Tez
    • Impala
    • Spark SQL
    • Presto
    • and more
  • Hive support DDL, batch data load as well as inserting query results on top of running SQL type queries.
  • Let us create Hive Database, create few tables and process using standard Hive Queries before we jump into Spark SQL.
[gist]4ead96242d6fe60b55f39a2a28680d75[/gist]

Reading and Writing Data – Hive tables

Let us see how we can read and write data using Hive Tables leveraging Data Frame Operations.

  • SparkSession have an API under read to get data directly from Hive Tables.
  • We can use SparkSession’s (spark) read.table to read data from Hive table.
  • Data from Hive Table will be loaded into Data Frame. We can now leverage Data Frame Operations to process the data.
[gist]e1a07964492bf2e6df62cd82a01fec7e[/gist]
  • Processed Data Frame can be saved into a Hive table using multiple APIs under spark.write
    • saveAsTable – creates a new table in Hive with Parquet file format
    • insertInto – insert data frame into existing tables
  • We can use mode to append or overwrite into the table. Modes can be passed as string or org.apache.spark.sql.SaveMode object.
[gist]0570c73439219954972e7b5970358297[/gist]

Reading and Writing Data – JDBC

Let us see how we can read data from a remote database using Spark JDBC.

  • To connect to a Database, we need
    • JDBC Jar File, which can be passed using –packages or –jars or as part of the fat jar file
    • Server ip or DNS alias on which Database is running
    • Credentials for authentication and authorization to the Database
    • Database and Table Names we want to read the data from.
  • We can read data from the remote database into Data Frame using spark.read.jdbc
  • Database Connectivity information can be passed as options.
  • We can also use spark.read.format(“jdbc”) along with options.
  • Once data is processed we can write data back to the remote database using write.jdbc or write.format
  • We can either append to an existing table or overwrite it, but we will not be able to update or merge table over JDBC.
  • We can see the list of options in official documentation. Spark JDBC can replace Sqoop to get data from Relational Databases into HDFS and vice-versa.
[gist]578f10203451f69032f76f08872a6918[/gist]

Querying Data from Hive Tables

Apart from reading data from Hive Tables using Data Frame APIs, we can also use spark.sql to read data from Hive Tables as well as to write data to Hive Tables.

  • spark.sql can be used to issue any valid Hive Command or Query
  • It will always return a Data Frame
  • We need to use show to preview the data or collect to print the results
  • If we issue a complex query, that query will be executed and results will be returned to Data Frame.
  • Once Data Frame is created we can use all Data Fram Operations.
  • Let us run select * from orders using spark.sql and preview the results.
  • We can run any valid query using spark.sql. We will see more examples on Spark SQL in upcoming sections.

Hive DDL and DML – Spark

Let us see how we can perform DDL and DML on top of Hive tables using Spark.

  • We can use Data Frame Operations or Spark SQL to perform Hive DDL as well as DML.
  • For Data Frame Operations we can use write.saveAsTable or write.insertInto. We can also use mode to overwrite an existing table or append to it. Options can be used to define additional properties while saving into Hive tables.
  • We can pick the queries which are executed to create the tables and run them using spark.sql. We need to format them as strings.
[gist]4ead96242d6fe60b55f39a2a28680d75[/gist]

Hive Functions – Overview

Let us go through the details about Hive Functions. For most of the functions we have seen as part of org.apache.spark.sql.functions, we have counterparts in Hive.

  • We can get the list of functions by using show functions command.
  • We can get basic usage and syntax of a given function by using describe function command.
  • As mentioned earlier we can run these commands as part of spark.sql
  • Make sure to use show(false) to get the complete details.
  • We can categorize the functions into below groups.
    • String Manipulation Functions
      • trim, rtrim, ltrim
      • lpad, rpad
      • length
      • substring, split
    • Date Manipulation Functions
      • date_add, date_sub, datediff
      • date_format
      • trunc
    • Type Conversion Functions
      • cast as
    • CASE WHEN
    • and more
  • Syntax can be different between functions in the package org.apache.spark.sql.functions and functions made available via Hive.