Optimizing Hive Performance with Bucketing
Learn how bucketing in Hive organizes large datasets into smaller, manageable units for improved query performance. Understand how this technique complements partitioning and uses hashing to distribute data efficiently across buckets.
Bucketing in Hive
Understanding Bucketing in Hive
Bucketing in Hive is a technique for organizing large datasets into smaller, more manageable units called buckets. It's similar to partitioning but offers additional benefits, especially when partitioning alone isn't sufficient or practical. Bucketing can be used independently or in conjunction with partitioning (dividing partitions further into buckets).
How Bucketing Works
Bucketing uses a hashing technique to distribute data across buckets. The process involves calculating the remainder (modulo operation) of the hash of a specified column's values divided by the desired number of buckets. The result determines which bucket each row is assigned to.
Example: Creating and Using Bucketed Tables
- Select Database: Choose the database for your tables:
- Create Dummy Table: Create a temporary table to hold your data:
- Load Data: Load data into the dummy table. (Remember to replace the path with your actual data file location):
- Enable Bucketing: Enable bucketing in Hive:
- Create Bucketed Table: Create a bucketed table. This example uses the `Id` column for bucketing and creates 3 buckets:
- Insert Data: Insert data from the dummy table into the bucketed table:
HiveQL Command (Selecting Database)
hive> use showbucket;
HiveQL Command (Creating Dummy Table)
hive> CREATE TABLE emp_demo (Id INT, Name STRING, Salary FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
HiveQL Command (Loading Data)
hive> LOAD DATA LOCAL INPATH '/home/codegyani/hive/emp_details' INTO TABLE emp_demo;
HiveQL Command (Enabling Bucketing)
hive> SET hive.enforce.bucketing = true;
HiveQL Command (Creating Bucketed Table)
hive> CREATE TABLE emp_bucket (Id INT, Name STRING, Salary FLOAT)
CLUSTERED BY (Id) INTO 3 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
HiveQL Command (Inserting Data)
hive> INSERT OVERWRITE TABLE emp_bucket SELECT * FROM emp_demo;
Data is now distributed across the three buckets based on the hash of the `Id` column modulo 3. You can then query specific buckets using the modulo operation (e.g., rows with Id % 3 = 0
are in bucket 0).