Data Recovery in DBMS: Restoring Data After Failures

Learn about the importance of data recovery in database management systems (DBMS). Explore different types of failures, including transaction failures and system crashes, and understand the techniques used to recover lost data and ensure data integrity.

DBMS - Data Recovery

Crash Recovery

A DBMS is a highly complex system that executes hundreds of transactions every second. The durability and robustness of a DBMS depend on its intricate architecture and the underlying hardware and system software. If the system fails or crashes amid transactions, it is expected to follow certain algorithms or techniques to recover lost data.

Failure Classification

To identify where the problem has occurred, failures can be categorized as follows:

Transaction Failure

A transaction is said to fail when it cannot execute or reaches a point from which it cannot progress further. This scenario, termed transaction failure, impacts only a few transactions or processes.

Reasons for transaction failure may include:

Logical Errors: Occur when a transaction cannot complete due to code errors or internal error conditions.
System Errors: Arise when the database system terminates an active transaction because it cannot be executed or must stop due to some system condition, such as deadlocks or resource unavailability.

System Crash

External problems can lead to abrupt system stops and crashes. For example, power supply interruptions may cause hardware or software failures, including operating system errors.

Disk Failure

In the early days of technology, disk failures were a common problem, with hard disk drives or storage drives failing frequently. Disk failures can include:

Formation of bad sectors
Inaccessibility to the disk
Disk head crashes

Storage Structure

The storage system can be divided into two categories:

Volatile Storage

As the name suggests, volatile storage cannot survive system crashes. These storage devices are typically located very close to the CPU and are often embedded onto the chipset itself. Examples include main memory and cache memory, which are fast but can only store a limited amount of information.

Non-Volatile Storage

Non-volatile storage is designed to survive system crashes and generally has a large data storage capacity, albeit with slower access speeds. Examples include hard disks, magnetic tapes, flash memory, and battery-backed RAM.

Recovery and Atomicity

During a crash, multiple transactions may be in progress and various files may be open for modifying data items. Transactions consist of atomic operations, which according to the ACID properties of DBMS, must maintain atomicity. This means either all operations are executed, or none are.

Upon recovering from a crash, the DBMS should ensure the following:

Check the states of all transactions that were being executed.
If a transaction was in the middle of an operation, the DBMS must ensure the atomicity of that transaction.
Determine whether the transaction can be completed or needs to be rolled back.
Ensure that no transactions leave the DBMS in an inconsistent state.

Two techniques can assist a DBMS in recovering while maintaining the atomicity of transactions:

Maintaining logs of each transaction and writing them onto stable storage before modifying the database.
Utilizing shadow paging, where changes are made in volatile memory before updating the actual database.

Log-based Recovery

A log is a sequence of records that maintains the actions performed by a transaction. It is crucial that logs are written prior to actual modifications and stored on stable, failsafe media.

Log-based recovery operates as follows:

The log file is kept on stable storage media.
When a transaction starts execution, it writes a log entry, such as <Tn, Start>.
When the transaction modifies an item X, it logs: <Tn, X, V1, V2>, indicating that Tn has changed the value of X from V1 to V2.
Upon transaction completion, it logs: <Tn, commit>.

The database can be modified through two approaches:

Deferred Database Modification: All logs are written to stable storage, and the database is updated when the transaction commits.
Immediate Database Modification: Each log entry is followed by an actual database modification, meaning the database is updated immediately after every operation.

Recovery with Concurrent Transactions

When multiple transactions are executed in parallel, their logs become interleaved. During recovery, it can be challenging for the recovery system to backtrack all logs. To address this, most modern DBMS utilize the concept of "checkpoints."

Checkpoint

Maintaining logs in real-time can consume all available memory space. Over time, log files may grow too large to manage effectively. A checkpoint is a mechanism that removes previous logs from the system and stores them permanently on a storage disk. It indicates a point before which the DBMS was in a consistent state, with all transactions committed.

Recovery Process

When a system with concurrent transactions crashes and recovers, the recovery process behaves as follows:

The recovery system reads the logs backward from the end to the last checkpoint.
It maintains two lists: an undo-list and a redo-list.
If the recovery system encounters a log entry <Tn, Start> followed by <Tn, Commit> or just <Tn, Commit>, it adds the transaction to the redo-list.
If it sees a log with <Tn, Start> but no commit or abort log, it places the transaction in the undo-list.

All transactions in the undo-list are undone, and their logs are removed. Conversely, all transactions in the redo-list are redone before their logs are saved.

Follow On

TutorialsArena