Indexing in DBMS: Optimizing Data Retrieval

Learn about the concept of indexing in database management systems (DBMS) and its importance in improving data retrieval performance. Explore the different types of indexes, including primary, secondary, clustered, and non-clustered indexes, and understand their characteristics and applications.



DBMS - Indexing

Data is stored in the form of records, each of which contains a key field for unique identification. Indexing is a data structure technique that enables efficient retrieval of records from database files based on specific attributes. It functions similarly to an index in a book, allowing for quicker searches.

Types of Indexing

Indexing is defined based on its attributes and can be classified into the following types:

Primary Index

A primary index is created on an ordered data file, where the ordering is based on a key field, typically the primary key of the relation.

Secondary Index

A secondary index may be generated from a candidate key, which has a unique value in every record, or from a non-key with duplicate values.

Clustering Index

A clustering index is defined on an ordered data file, but the ordering is based on a non-key field.

Ordered Indexing

Ordered indexing can be categorized into two types:

Dense Index

In a dense index, there is an index record for every search key value in the database, which enhances search speed. However, it requires more storage space for the index records, each containing a search key value and a pointer to the actual record on the disk.

Sparse Index

In a sparse index, index records are not created for every search key. Each index record contains a search key and a pointer to the actual data. To find a record, the system follows the index to the expected data location. If the record is not found, a sequential search is initiated until the desired data is located.

Multilevel Index

Multilevel indexing comprises search-key values and data pointers, stored on disk alongside the actual database files. As the database size increases, so does the index size. Keeping index records in main memory is crucial for speeding up search operations. A single-level index may become too large to fit in memory, resulting in multiple disk accesses.

Multilevel indexing helps break the index into smaller indices, allowing the outermost level to fit into a single disk block, which can easily reside in main memory.

B+ Tree

A B+ tree is a balanced binary search tree structured in a multi-level index format. Its leaf nodes represent actual data pointers, ensuring all leaf nodes are at the same height, which maintains balance. Additionally, the leaf nodes are linked in a linked list format, allowing for both random and sequential access.

Structure of B+ Tree

  • Internal Nodes:
    • Contain at least ⌈n/2⌉ pointers, except for the root node.
    • Can contain a maximum of n pointers.
  • Leaf Nodes:
    • Contain at least ⌈n/2⌉ record pointers and key values.
    • Can contain a maximum of n record pointers and key values.
    • Each leaf node has a block pointer pointing to the next leaf node, forming a linked list.

B+ Tree Operations

Insertion

B+ trees are filled from the bottom, with entries made at the leaf nodes. If a leaf node overflows, it is split into two parts:

  1. Partition at i = ⌊(m+1)/2⌋.
  2. First i entries are stored in one node, while the remaining entries are moved to a new node.
  3. The ith key is duplicated in the parent of the leaf.

If a non-leaf node overflows, it is split similarly:

  1. Partition at i = ⌈(m+1)/2⌉.
  2. Entries up to i are kept in one node; the rest are moved to a new node.

Deletion

Deletion in B+ trees occurs at the leaf nodes. The target entry is located and deleted. If the entry is in an internal node, it is deleted and replaced with an entry from the left position. After deletion, an underflow check is performed:

  • If underflow occurs, entries are distributed from the left node.
  • If distribution is not possible from the left, then it is attempted from the right.
  • If neither direction allows for distribution, the node merges with its left and right neighbors.