Configure Hadoop Ecosystem components – Oozie, Pig, Sqoop and Hue

As part of this section, we will see how to set up Pig and Oozie components and some of the key concepts related to each service.

  • Setup Oozie, Pig, Sqoop and Hue
  • Review Important Properties
  • Schedule an Oozie workflow
  • Run Pig Job
  • Validate Sqoop
  • Overview of Hue

Cluster Topology

We are setting up the cluster on 7+1 nodes. We start with 7 nodes and then we will add one more node later.

  • Gateway(s) and Management Service
    • bigdataserver-1 – Hue Server
  • Masters
    • bigdataserver-2
      • Zookeeper
      • Active/Standby Namenode
    • bigdataserver-3
      • Zookeeper
      • Active/Standby Namenode
      • Active/Standby Resource Manager
      • Impala State Store
      • Oozie Server
    • bigdataserver-4
      • Zookeeper
      • Active/Standby Resource Manager
      • Job History Server
      • Spark History Server
      • Hive Server and Hive Metastore
      • Impala Catalog
  • Slaves or Worker Nodes
    • bigdataserver-5 – Datanode, Node Manager, Impala Daemon
    • bigdataserver-6 – Datanode, Node Manager, Impala Daemon
    • bigdataserver-7 – Datanode, Node Manager, Impala Daemon

Learning Process

We will follow the same standard process to learn while adding any software-based service.

  • Downloading and Installing
    • Downloading is already taken care as part of adding hosts to the cluster. We will add all the services to the cluster using Cloudera Manager.
  • Configuration – We need to understand architecture and plan for the configuration.
    • Architecture – Oozie has three components –  Repository, Server and Client. Pig and Sqoop are clients only tools. Hue has Server supporting web application to provide a unified platform for all high-level tools.
    • Components
      • Oozie Server is the Master Process
      • Repository to store workflow definitions and details
    • Configuration Files
      • Oozie -/etc/oozie/conf/oozie-default.xml and /etc/oozie/conf/oozie-site.xml
    • Log Files
      • Oozie – /var/log/oozie/

Setup Oozie, Pig, Sqoop and Hue

Let us quickly setup all 4 tools using Cloudera Manager in our cluster.

Setup Oozie

First, let us go ahead and setup Oozie in our cluster.

  • Go to the host on which oozie server is going running (In this case it is bigdataserver-3)
  • Install mysql connector – sudo yum install mysql-connector-java
  • Go to mysql database and create database with name oozie (Already taken care while creating the databases initially).
  • Go to the Cloudera Manager Dashboard
  • Click on Add Service in drop down of the cluster
  • Choose Oozie
  • We will be using bigdataserver-3 as Oozie Server.
  • Give the database server, name and password details.
    • Data Base Server – bigdataserver-1.c.<Project-Name>
    • Database Name – oozie
    • Database Username – oozie
    • Password – ******
  • Use the test connection to verify the connection (optional step)
  • And then click on Continue.
  • Review properties (Oozie Server Data Directory and ShareLib Root Directory) and complete the setup process.
  • Oozie also have Web UI. It is dependent on external Java Script library called as Ext JS.
    • Download Ext JS Zip file from Cloudera Archive on the server where Oozie server is running (bigdataserver-3).
    • Unzip the zip file
    • Copy to /var/lib/oozie
    • Change the ownership to oozie on the entire directory recursively
[gist]559bba25c479e64bdefbb8111c2abb44[/gist]
  • Once Ext JS is setup, we can enable Oozie Web Console from Cloudera Manager and restart the server.

Setup Pig and Sqoop

It is very straightforward to setup Pig and Sqoop in our cluster. Both use HDFS as file system and Map Reduce for processing engine. There are no server components for either of them.

  • Pig is automatically available on all the nodes in the cluster.
  • Sqoop 1 can be setup using “Add Service” Option and we need to configure gateway nodes only for Sqoop 1.
  • You need to make sure that JDBC jar file is available on gateway nodes so that Sqoop commands can connect to remote databases to get the data over JDBC.
  • Sqoop 2 is supposed to be better than Sqoop 1, however, it is being deprecated for some unknown reasons. It is not extensively used and hence you can ignore it.

Setup Hue

Hue is not a typical Big Data tool. It actually provides a web interface for all Big Data tools such as Hive, Sqoop, Oozie, Spark etc. It is primarily used by non admin staff of an organization such as Developer, Data Scientists etc.

  • Go to the Cloudera Manager Dashboard
  • Make sure you have installed “hive”
  • Click on Add Service in drop down of the cluster
  • Choose Hue
  • We will be using bigdataserver-1 as Hue Server.
  • Since we are installing only one instance of Hue Server, you can ignore the Load balancer for now.
  • Give the database server, name and password details.
    • Data Base Server – bigdataserver-1.c.<Project-Name>
    • Database Name – hue
    • Database Username – hue
    • Password – *****
  • Review properties and complete the setup process.

Review Important Properties

Let us review property files as well as important properties for all 4 services – Oozie, Pig, Sqoop and Hue.

  • Property Files – Standard Locations
    • Oozie – /etc/oozie/conf
    • Sqoop – /etc/sqoop/conf
    • Pig – /etc/pig/conf
    • Hue – /etc/hue/conf
    • However, with Cloudera there will be only templates actual run time property files will be under /var/run/cloudera-scm-agent/process
    • With respect to Oozie, you will see property files only on bigdataserver-3 where Oozie server is running.
  • For Pig you can review properties by going to the standard location. You will see properties in standard key pair format (not xml).
  • Review these properties from the Cloudera Manager UI.
    • oozie.base.url – http://<hostname>:8080/oozie
    • oozie.service.JPAService.jdbc.username – Username to the repository database
    • oozie.service.JPAService.jdbc.password – Password to the repository database.

Run Sample Oozie job

Here we will see how to run the default Map Reduce job using Oozie

  • We can check the status of Oozie Server by running this command – oozie admin -oozie http://bigdataserver-3:11000/oozie -status
  • Oozie have several sub-commands for different purposes – job, admin etc
  • Create directory oozie_demo under home directory – /home/itversity
  • Copy the oozie example provided by the Cloudera by default to oozie_demo under home directory – /home/itversity/oozie_demo.
  • Untar the examples tar file to get the sample oozie job files.
  • Update the job.properties file with Name, Resource Manager values with examplesRoot.
    • Get nameNode URL from /etc/hadoop/conf/core-site.xml and copy the property value fs.defaultFS
    • Get jobTracker URL and copy the value – # /etc/hadoop/conf/yarn-site.xml and property value yarn.resourcemanager.address
    • Update job.properties – location /home/itversity/oozie_demo/examples/apps/map-reduce

Note: Make sure user have the hdfs direcotory (/user/<user-name>) for the user before proceeding for the next steps.

  • Copy the examples directory from /home/itversity/oozie_demo to Hadoop user location – hadoop fs -put oozie_demo /user/itversity
  • Run the Oozie job and get the job id
  • Using job id we can get job status
  • You can see the success in status to know that your job is succeeded. If not, you can troubleshoot map-reduce job.
  • Validate the output data in the directory defined in the workflow.xml with the property mapred.output.dir
[gist]4f5f068023fc432cfcc8df97874cc678[/gist]
  • Now let us understand what happens when Oozie job is submitted.
    • One or more map reduce jobs will be created for running Oozie Workflow
    • On top of map reduce jobs to run Oozie Workflow, we will also see Map Reduce jobs for the actions submitted.
  • We need to focus on both Map Reduce jobs associated with Oozie Workflow as well as Actions to troubleshoot any issue.

Run Pig Job

Let us see how we can validate Pig job on our cluster. Pig uses HDFS for File System and Map Reduce to process the data.

  • Ensure you have data to validate (in our case we have data in local file system /home/itversity/data).
  • Let us copy data to HDFS.
    • Create directory /user/itversity/data
    • Copy whole directory on local file system /home/itversity/data/retail_db to HDFS location /uesr/itversity/data
  • Let us process the data using sample pig script.
  • Create a pig script (name order_count_by_status.pig)
[gist]15f22e7a45e70cff8f199cd631cc8d15[/gist]
  • Run the pig script – pig order_count_by_status.pig

Validate Sqoop

Let us validate Sqoop by running sample job. We typically use Sqoop to get data from remote RDBMS databases into HDFS and vice versa.

  • Sqoop uses JDBC to connect to remote databases. Hence on the gateway node, where we typically run Sqoop import/export commands, we need to have JDBC jar file.
  • JDBC Jar is based on the database technology used (for e.g.: mysql-connector-java.jar for MySQL and ojdbc12.jar for Oracle)
  • We can validate Database Connectivity using Sqoop commands such as list-databases and list-tables.
[gist]5d5f5d77bb68549ae117b69461a17437[/gist]
  • We can validate whether we are able to perform import using sqoop import command
[gist]529f1f034e5c0530bb11b576d23700a5[/gist]

Overview of Hue

As Hue is setup successfully, let us see an overview of Hue. We will setup an admin account, then add user and then run Hive query using Hue Web Interface.

Setup admin account

  • Once Hue installation is completed you can open the Hue UI from <installed-host-public-ip>:8888
  • Since we have not opened the firewall for 8888 port initially, you can go to Firewall rules by clicking on more options and click on “View Network Details” and add Hue port to web ports rule that we have created previously.
  • Once you have opened the first time, you will see the following message.
    • Since this is your first time logging in, pick any username and password. Be sure to remember these, as they will become your Hue superuser credentials.
    • Enter the username and password which will be an administrator account for the Hue.

Setup user account

  • Add hue user
    • Login into Hue with the admin account
    • Click on the username on the right top and click on “Manage Users”
    • We can Manage both Users as well as Groups from this Page
    • First, let us create Group and assign few permissions related to Hive
    • Click on add user
    • Enter Username and password,
    • Select “Create Home Directory” (This will home directory for the user in HDFS)
    • Other details like first name, last name etc are optional.
    • Typically we assign Users to Groups. By default, it will try to add users to default. If we want to add a user to the new group, then you need to choose an appropriate group. All the permissions mapped to the Group will be assigned to User.
    • Select “Super User” (This is to give the admin permissions for the current new user). Make sure to not to give Super User access to non-admin accounts.
    • And click on create.

Run Hive Query using Hue

Let us also see how we can run Hive queries using Hue.

  • Hive is already set up in our cluster
  • We have created a retail database and then orders as well as order_items tables as part of setting up of Hive and Impala.
  • As we already have tables we can run sample queries in Hive to validate that non-admin staff will be able to use Hue and interact with Big Data tools with out struggling with command line interface.
  • Sample Query: SELECT order_status, count(1) FROM orders GROUP BY order_status;

Share this post