Hive uses HDFS for storage, different processing engines to process the data. As it works on HDFS and processing engines such as Map Reduce, Spark, YARN (Tez) etc we do not have any daemon processes running on worker nodes.
- Data storage – HDFS
- Processing Engine – Map Reduce or Tez or Spark SQL
- Metastore – Stores structure of tables. Metastore is used by non-Hive Query Engines as well.
- Cloudera’s Impala
- Hortonworks’s Tez
- MapR Drill
- Spark SQL
- Presto
- and more
- Metastore Server – to connect to the Metastore Database
- Hive Server – to facilitate external applications to connect to the Hive and generate reports leveraging data in Hive Tables.
- Query engine – jar files that will be deployed on all the nodes added as gateways.
- Generates Java code at runtime using underlying distributed processing APIs
- Compiles and build jar at runtime based on underlying distributed engine
- Submit as one or more jobs to process the data using the underlying distributed engine.
- For now, we will be focusing on Map Reduce. But we can run with any of the frameworks mentioned earlier.
Configuration Files and Important Properties
Now let us look into important configuration files related to Hive and some important properties.
- As Hive uses HDFS for storage and can work with different frameworks to process the data, Hive inherits behavior from core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml etc.
- Also, Hive itself, have properties file called hive-site.xml.