As part of this section, we will see how to set up the Hive in the Cloudera distribution. We will also understand important concepts related to Hive.
- Setup Hive and Impala
- Validating Hive and Impala
- Components and Properties of Hive
- Troubleshooting Hive issues
- Hive Commands and Queries – Overview
- Different Query Engines
- Components and Properties of Impala
- Running Queries using Impala – Overview
Hive QL is an abstraction over Map Reduce like Pig and interface is more similar to SQL. Impala is also a Query Engine but instead of Map Reduce it uses different approach while processing data by running queries. We can accelerate the application development in Hadoop ecosystem using hive by writing queries which will generate Map Reduce code to process data.
Cluster Topology
We are setting up the cluster on 7+1 nodes. We start with 7 nodes and then we will add one more node later.
- Gateway(s) and Management Service
- bigdataserver-1
- Masters
- bigdataserver-2
- Zookeeper
- Active/Standby Namenode
- bigdataserver-3
- Zookeeper
- Active/Standby Namenode
- Active/Standby Resource Manager
- Impala State Store
- bigdataserver-4
- Zookeeper
- Active/Standby Resource Manager
- Job History Server
- Spark History Server
- Hive Server and Hive Metastore
- Impala Catalog
- bigdataserver-2
- Slaves or Worker Nodes
- bigdataserver-5 – Datanode, Node Manager, Impala Daemon
- bigdataserver-6 – Datanode, Node Manager, Impala Daemon
- bigdataserver-7 – Datanode, Node Manager, Impala Daemon
Learning Process
We need to follow the standard process while setting up any software.
- Downloading and Installing – already taken care as part of adding hosts to the cluster.
- Configuration – we need to understand architecture and plan for the configuration.
- Hive has three components – DDL or Physical Modeling, Copying data (Load and Insert) and Querying data (Hive QL).
- Configure Hive meta store server – Setup Database or use an existing database (MySql database is already created in the previous step.)
- Configuration Files – hive-site.xml and .hiverc
- Impala have both masters and slave components as it does not use map reduce. Impala State Store and Impala Catalog Server are master components while Impala Daemons (impalad) are slaves.
- Service logs
- Where ever hive is running, it will create a hive.log file under /tmp/’username’ location
- Service log files are saved under /var/log/hive