Apache Hive Interview Questions: Mastering Data Warehousing on Hadoop

This comprehensive guide prepares you for Apache Hive interviews by covering a wide range of topics from fundamental concepts to advanced techniques. We explore Hive's architecture, data types, query optimization strategies (partitioning, bucketing, indexing), and the use of HiveQL. This resource provides detailed answers to frequently asked Hive interview questions, including those on managed vs. external tables, the Hive metastore, and various operating modes. Learn about data storage locations, collection data types, and how to perform tasks such as renaming tables and managing metadata. This guide equips you with the knowledge to confidently answer a broad spectrum of Hive interview questions.

Top Apache Hive Interview Questions and Answers

What is Apache Hive?

Apache Hive is a data warehouse system built on top of Hadoop. It allows you to query and analyze large datasets stored in Hadoop Distributed File System (HDFS) using SQL-like queries (HiveQL). Hive translates these queries into MapReduce jobs, making it easy to work with big data without needing to write complex MapReduce code. It's open-source and supports data definition (DDL), data manipulation (DML), and user-defined functions.

When to Use Hive

Hive is well-suited for:

Building data warehouse applications.
Working with static (unchanging) data.
Managing and querying very large datasets.
Using SQL-like queries instead of writing MapReduce code.
Applications with high latency tolerance.

Hive Client Application Support

Hive supports client applications written in Java, PHP, Python, C, Ruby, and other languages that can communicate with the Hive server.

Renaming a Hive Table

Use the following command to rename a table:

SQL Syntax


ALTER TABLE table_name RENAME TO new_table_name;

Types of Hive Tables

Hive offers two main table types:

Managed tables: Hive manages both the data and schema.
External tables: Hive manages the schema but not the data (data is stored externally).

Managed vs. External Tables in Hive

Managed Table	External Table
Hive manages both the data and the schema. Data is deleted when the table is dropped.	Hive manages only the schema; the data resides externally. Data is not deleted when the table is dropped.

Hive Operating Modes

Hive operates in two primary modes depending on the Hadoop cluster's size and configuration:

Local mode: Used for smaller datasets and testing, runs on a single machine.
MapReduce mode: Used for larger datasets processed in parallel across a Hadoop cluster.

When to Use MapReduce Mode in Hive

Use MapReduce mode when:

Processing large datasets.
Working with a distributed Hadoop cluster.
Requiring parallel query execution for better performance.

Hive and OLTP Systems

Hive is not suitable for OLTP (Online Transaction Processing) systems because it doesn't support row-level inserts or updates efficiently. It's optimized for analytical queries on large datasets (OLAP).

Hive Architecture Components

User Interface (CLI, Beeline, etc.): How users interact with Hive.
Driver: Manages the query execution process.
Compiler: Translates HiveQL into executable plans.
Optimizer: Improves the efficiency of the execution plan.
Execution Engine (MapReduce, Tez, Spark): Executes the query plan.
Metastore: Stores metadata about tables, databases, partitions, etc.

Hive Table Data Storage Location

By default, Hive table data is stored in /user/hive/warehouse in HDFS. This location can be customized.

Main Components of Hive

Hive Clients: Tools for interacting with Hive (CLI, JDBC).
Hive Services: The Hive server and related services.
Hive Storage and Computing: HDFS and the execution engine (MapReduce, Tez, Spark).

Changing the Location of a Managed Table

You can change the location of a managed table using the LOCATION clause in HiveQL.

Hive Metastore

The Hive metastore is a database (typically Derby or MySQL) that stores metadata about Hive tables, partitions, and other objects.

Local vs. Remote Metastores

Local Metastore	Remote Metastore
Runs in the same JVM as the Hive server.	Runs in a separate JVM.

Databases Supported by Hive

Derby (for single-user mode).
MySQL (for multi-user mode).

Metastore Sharing in Hive

A single metastore is typically not shared across multiple users for security reasons.

Why Hive Doesn't Store Metadata in HDFS

Hive uses a relational database for the metastore (like MySQL or Derby) to improve performance compared to storing metadata in HDFS. HDFS is optimized for large files and is less efficient for frequent read/write operations required for metadata.

Hive Operating Modes

Local Mode
Distributed Mode
Pseudo-Distributed Mode

Partitions in Hive

Partitions divide a table into smaller, manageable subsets based on a column value. This improves query performance by reducing the amount of data that needs to be scanned.

Benefits of Partitioning

Partitioning improves query performance by enabling Hive to scan only the relevant partitions, rather than the entire table.

Dynamic Partitioning in Hive

Dynamic partitioning automatically creates partitions during data loading. This is useful when you don't know all partition values in advance.

Hive Collection Data Types

ARRAY
MAP
STRUCT

`SORT BY` vs. `ORDER BY` in Hive

SORT BY sorts data within each reducer, while ORDER BY requires a single reducer, making SORT BY more efficient for very large datasets.

Hive TIMESTAMP Data Type

The TIMESTAMP data type stores data in the java.sql.Timestamp format.

Hive Variables

Hive variables allow you to define and use variables within your HiveQL scripts.

Executing Unix Shell Commands from Hive

You can run shell commands from Hive using the ! prefix (e.g., !ls).

Executing Hive Queries from a Script

Use the SOURCE command to execute a HiveQL script from a file (e.g., SOURCE /path/to/script.hql).

Deleting DBPROPERTIES in Hive

There is no direct way to delete individual DBPROPERTIES in Hive.

`.hiverc` File

The .hiverc file contains commands that are automatically executed when you start Hive.

Schema on Read

Schema on read means schema enforcement happens only during data retrieval, not during data writing. This allows for more flexible data storage.

Checking for Partition Existence

Use SHOW PARTITIONS table_name PARTITION (partition_column='value') to check if a partition exists.

Bucketing in Hive

Bucketing divides data into buckets based on the hash value of a column. This further optimizes queries, particularly for joins.

Listing Databases Starting with 'C'

Use SHOW DATABASES LIKE 'C%'.

Hive Bucketing Formula

Hive uses the formula hash(column) % num_buckets to assign rows to buckets.

Indexing in Hive

Indexing speeds up query performance by creating indexes on specific columns.

Java Class for Input Record Encoding

org.apache.hadoop.mapred.TextInputFormat handles input record encoding.

Java Class for Output Record Encoding

The org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Java class handles the encoding of output records written to files storing Hive table data.

HCatalog in Hive

HCatalog (now part of Hive) provides a way to share Hive metadata and data with other Hadoop tools. It allows external systems to access Hive's data warehouse without needing to directly interact with Hive.

Hive vs. HBase: Key Differences

Feature	Hive	HBase
Type	Data warehouse system; query engine.	NoSQL database; key-value store.
Query Language	SQL-like (HiveQL).	Doesn't use SQL; uses its own API.
Data Processing	Batch processing.	Real-time processing.
Data Model	Schema-based (schema defined before data is stored).	Schema-less (schema is flexible).
Latency	Higher latency.	Lower latency.
Use Cases	OLAP (Online Analytical Processing).	OLTP (Online Transaction Processing).

Hive Variables

Hive variables are used to store and reuse values within HiveQL scripts. This improves code readability and maintainability.

ObjectInspector in Hive

The ObjectInspector in Hive is used to inspect the structure of data, allowing Hive to understand complex data types and handle them appropriately. It's crucial for handling various data formats and structures.

UDFs (User-Defined Functions) in Hive

UDFs are custom functions written in Java (or other languages) that extend Hive's functionality. They enable you to perform operations not directly supported by Hive's built-in functions.

Types of Joins in Hive

Hive supports various join types:

JOIN (inner join)
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN

Consider two tables, CUSTOMERS and ORDERS:

CUSTOMERS Table

ID	NAME	AGE	ADDRESS	SALARY
1	Alex	21	New York	2000.00
2	Aryan	22	Delhi	3000.00
3	Neetu	23	Lucknow	3500.00
4	Raj	24	Kanpur	2600.00
5	Priya	25	Ludhiana	3200.00
6	Robert	26	London	4000.00
7	Julia	27	Paris	2700.00

ORDERS Table

OID	DATE	CUSTOMER_ID	AMOUNT
102	2009-10-08 00:00:00	3	3000
100	2009-10-08 00:00:00	3	1500
101	2009-11-20 00:00:00	2	1560
103	2008-05-20 00:00:00	4	2060

Example Join Queries:

JOIN


SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

LEFT OUTER JOIN


SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c LEFT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

RIGHT OUTER JOIN


SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

FULL OUTER JOIN


SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c FULL OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

Follow On

TutorialsArena

Apache Hive Interview Questions: Mastering Data Warehousing on Hadoop

Top Apache Hive Interview Questions and Answers

What is Apache Hive?

When to Use Hive

Hive Client Application Support

Renaming a Hive Table

SQL Syntax

Types of Hive Tables

Managed vs. External Tables in Hive

Hive Operating Modes

When to Use MapReduce Mode in Hive

Hive and OLTP Systems

Hive Architecture Components

Hive Table Data Storage Location

Main Components of Hive

Changing the Location of a Managed Table

Hive Metastore

Local vs. Remote Metastores

Databases Supported by Hive

Metastore Sharing in Hive

Why Hive Doesn't Store Metadata in HDFS

Hive Operating Modes

Partitions in Hive

Benefits of Partitioning

Dynamic Partitioning in Hive

Hive Collection Data Types

`SORT BY` vs. `ORDER BY` in Hive

Hive TIMESTAMP Data Type

Hive Variables

Executing Unix Shell Commands from Hive

Executing Hive Queries from a Script

Deleting DBPROPERTIES in Hive

`.hiverc` File

Schema on Read

Checking for Partition Existence

Bucketing in Hive

Listing Databases Starting with 'C'

Hive Bucketing Formula

Indexing in Hive

Java Class for Input Record Encoding

Java Class for Output Record Encoding

HCatalog in Hive

Hive vs. HBase: Key Differences

Hive Variables

ObjectInspector in Hive

UDFs (User-Defined Functions) in Hive

Types of Joins in Hive

CUSTOMERS Table

ORDERS Table

JOIN

LEFT OUTER JOIN

RIGHT OUTER JOIN

FULL OUTER JOIN