TutorialsArena

Understanding ORDER BY and SORT BY in HiveQL

Learn the key differences between the `ORDER BY` and `SORT BY` clauses in HiveQL and how they impact query performance. This guide explains global sorting vs. partitioned sorting and provides best practices for optimizing your Hive queries.



HiveQL ORDER BY and SORT BY Clauses

Introduction to ORDER BY and SORT BY

In HiveQL (Hive's query language), both `ORDER BY` and `SORT BY` clauses sort query results. However, they differ significantly in how they perform sorting and their impact on query performance. Understanding this difference is crucial for optimizing your Hive queries.

ORDER BY Clause

The `ORDER BY` clause sorts the entire result set globally. All data is processed by a single reducer, meaning that a single reducer performs a global sort on the entire dataset, making it less efficient for large datasets. Using `LIMIT` with `ORDER BY` can improve performance for smaller result sets.

Syntax


SELECT column1, column2, ...
FROM table_name
ORDER BY column1 [ASC|DESC], column2 [ASC|DESC], ...;
      

Example using ORDER BY

Let's assume an 'Employees' table (replace with your table):

Employees Table

EmployeeID Name Salary
1Alice60000
2Bob50000
3Charlie70000

SELECT * FROM Employees ORDER BY Salary DESC;
      

EmployeeID | Name    | Salary
-----------+---------+--------
         3 | Charlie | 70000
         1 | Alice   | 60000
         2 | Bob     | 50000
      

SORT BY Clause

The `SORT BY` clause sorts data *locally* within each reducer. This means each reducer sorts its portion of the data independently. The final result may not be globally sorted, only sorted within each reducer's output. This approach is generally more efficient for large datasets because the sorting workload is distributed across multiple reducers.

Syntax


SELECT column1, column2, ...
FROM table_name
SORT BY column1 [ASC|DESC], column2 [ASC|DESC], ...;
      

Example using SORT BY


SELECT * FROM Employees SORT BY Salary DESC;
      

(The output will be partially sorted.  The order might vary depending on the number of reducers and data distribution.  It's not guaranteed to be fully sorted globally like ORDER BY.)