Next: , Previous: , Up: Top   [Contents][Index]


17 Crash Tolerance

Crash tolerance is a new (as of release 1.21) feature that can be enabled at compile time, and used in environments with appropriate support from the OS and the filesystem. As of version 1.24, this means a Linux kernel 5.12.12 or later and a filesystem that supports reflink copying, such as XFS, BtrFS, or OCFS2. If these prerequisites are met, crash tolerance code will be enabled automatically by the configure script when building the package.

The crash-tolerance mechanism, when used correctly, guarantees that a logically consistent (see Database consistency) recent state of application data can be recovered following a crash. Specifically, it guarantees that the state of the database file corresponding to the most recent successful gdbm_sync call can be recovered.

If the new mechanism is used correctly, crashes such as power outages, OS kernel panics, and (some) application process crashes will be tolerated. Non-tolerated failures include physical destruction of storage devices and corruption due to bugs in application logic. For example, the new mechanism won’t help if a pointer bug in your application corrupts GDBM’s private in-memory data which in turn corrupts the database file.

In the following sections we will describe how to enable crash tolerance in your application and what to do if a crash occurs.

The design rationale of the crash tolerance mechanism is described in detail in the article, Crashproofing the Original NoSQL Key-Value Store, by Terence Kelly, ACM Queue magazine, July/August 2021, available from the ACM Digital Library. If you have difficulty retrieving this paper, please contact the author at tpkelly@acm.org, tpkelly@cs.princeton.edu, or tpkelly@eecs.umich.edu.


Next: , Previous: , Up: Top   [Contents][Index]