As part of this section, we will see how to set up Pig and Oozie components and some of the key concepts related to each service.
- Setup Oozie, Pig, Sqoop and Hue
- Review Important Properties
- Schedule an Oozie workflow
- Run Pig Job
- Validate Sqoop
- Overview of Hue
Cluster Topology
We are setting up the cluster on 7+1 nodes. We start with 7 nodes and then we will add one more node later.
- Gateway(s) and Management Service
- bigdataserver-1 – Hue Server
- Masters
- bigdataserver-2
- Zookeeper
- Active/Standby Namenode
- bigdataserver-3
- Zookeeper
- Active/Standby Namenode
- Active/Standby Resource Manager
- Impala State Store
- Oozie Server
- bigdataserver-4
- Zookeeper
- Active/Standby Resource Manager
- Job History Server
- Spark History Server
- Hive Server and Hive Metastore
- Impala Catalog
- bigdataserver-2
- Slaves or Worker Nodes
- bigdataserver-5 – Datanode, Node Manager, Impala Daemon
- bigdataserver-6 – Datanode, Node Manager, Impala Daemon
- bigdataserver-7 – Datanode, Node Manager, Impala Daemon
Learning Process
We will follow the same standard process to learn while adding any software-based service.
- Downloading and Installing
- Downloading is already taken care as part of adding hosts to the cluster. We will add all the services to the cluster using Cloudera Manager.
- Configuration – We need to understand architecture and plan for the configuration.
- Architecture – Oozie has three components – Repository, Server and Client. Pig and Sqoop are clients only tools. Hue has Server supporting web application to provide a unified platform for all high-level tools.
- Components
- Oozie Server is the Master Process
- Repository to store workflow definitions and details
- Configuration Files
- Oozie -/etc/oozie/conf/oozie-default.xml and /etc/oozie/conf/oozie-site.xml
- Log Files
- Oozie – /var/log/oozie/