Creating Tables in Hive: Internal vs. External Tables
Learn how to create tables in Hive, the data warehouse system built on Hadoop. This guide covers the differences between internal (managed) and external tables, explaining data storage, lifecycle management, and best practices for choosing the right table type.
Creating Tables in Hive
Introduction to Hive Tables
Hive, a data warehouse system built on top of Hadoop, provides a SQL-like interface for querying data. You create tables in Hive to organize and manage your data. Hive supports two main table types: internal and external tables.
Internal Tables (Managed Tables)
Internal tables, also known as managed tables, store data in a subdirectory within the Hive warehouse directory (typically `/user/hive/warehouse`). Hive manages the lifecycle of the data in these tables. If you drop an internal table, both the table's schema (structure) and its data are deleted. Internal tables are less flexible for sharing data with other tools because the data is owned and managed by Hive. They are often better suited for smaller datasets.
Creating an Internal Table
CREATE TABLE employees (
EmployeeID INT,
FirstName STRING,
LastName STRING,
Salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
This creates a table named 'employees'. The `ROW FORMAT DELIMITED FIELDS TERMINATED BY ','` clause specifies that the data is comma-separated.
Adding Comments and Properties
CREATE TABLE employees (
EmployeeID INT COMMENT 'Employee ID',
FirstName STRING COMMENT 'First Name',
LastName STRING COMMENT 'Last Name',
Salary FLOAT COMMENT 'Salary'
)
COMMENT 'Employee Information'
TBLPROPERTIES ('createdBy'='John Doe', 'createdAt'='2024-03-08');
This adds comments to the table and columns and sets table properties.
Creating a Table Like Another Table
CREATE TABLE new_employees LIKE employees;
Creates a new table with the same schema as an existing table.
External Tables
External tables store data in a location *outside* of the Hive warehouse directory. Hive only manages the table's metadata (schema); it doesn't manage the data itself. If you drop an external table, only the table's metadata is removed; the data remains in its original location. External tables offer greater flexibility for data sharing and management because Hive doesn't control the underlying data.
Creating an External Table
- Create a directory on HDFS: (e.g., `hdfs dfs -mkdir /mydata`)
- Copy data to HDFS: (e.g., `hdfs dfs -put /path/to/mydata.csv /mydata`)
- Create the external table:
CREATE EXTERNAL TABLE external_employees (
EmployeeID INT,
FirstName STRING,
LastName STRING,
Salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/mydata';