TutorialsArena

Optimizing Hive Performance with Bucketing

Learn how bucketing in Hive organizes large datasets into smaller, manageable units for improved query performance. Understand how this technique complements partitioning and uses hashing to distribute data efficiently across buckets.



Bucketing in Hive

Understanding Bucketing in Hive

Bucketing in Hive is a technique for organizing large datasets into smaller, more manageable units called buckets. It's similar to partitioning but offers additional benefits, especially when partitioning alone isn't sufficient or practical. Bucketing can be used independently or in conjunction with partitioning (dividing partitions further into buckets).

How Bucketing Works

Bucketing uses a hashing technique to distribute data across buckets. The process involves calculating the remainder (modulo operation) of the hash of a specified column's values divided by the desired number of buckets. The result determines which bucket each row is assigned to.

Example: Creating and Using Bucketed Tables

  1. Select Database: Choose the database for your tables:
  2. HiveQL Command (Selecting Database)
    
    hive> use showbucket;
      
  3. Create Dummy Table: Create a temporary table to hold your data:
  4. HiveQL Command (Creating Dummy Table)
    
    hive> CREATE TABLE emp_demo (Id INT, Name STRING, Salary FLOAT)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ',';
      
  5. Load Data: Load data into the dummy table. (Remember to replace the path with your actual data file location):
  6. HiveQL Command (Loading Data)
    
    hive> LOAD DATA LOCAL INPATH '/home/codegyani/hive/emp_details' INTO TABLE emp_demo;
      
  7. Enable Bucketing: Enable bucketing in Hive:
  8. HiveQL Command (Enabling Bucketing)
    
    hive> SET hive.enforce.bucketing = true;
      
  9. Create Bucketed Table: Create a bucketed table. This example uses the `Id` column for bucketing and creates 3 buckets:
  10. HiveQL Command (Creating Bucketed Table)
    
    hive> CREATE TABLE emp_bucket (Id INT, Name STRING, Salary FLOAT)
    CLUSTERED BY (Id) INTO 3 BUCKETS
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ',';
      
  11. Insert Data: Insert data from the dummy table into the bucketed table:
  12. HiveQL Command (Inserting Data)
    
    hive> INSERT OVERWRITE TABLE emp_bucket SELECT * FROM emp_demo;
      

Data is now distributed across the three buckets based on the hash of the `Id` column modulo 3. You can then query specific buckets using the modulo operation (e.g., rows with Id % 3 = 0 are in bucket 0).