Single-Page-Recovery (SPR) is a novel feature to recover a single page without going through the entire restart procedure.
More...
Single-Page-Recovery (SPR) is a novel feature to recover a single page without going through the entire restart procedure.
Key Benefits
Suppose the situation where we have 999,999 pages that are intact and flushed since the previous checkpoint. Now, while running our database, we updated one page, flushed it to disk, evicted from bufferpool, then later re-read it from disk, finding that the page gets corrupted.
The time to recover the single page in traditional database would be hours, analyzing the logs, applying REDOs and UNDOs. Single-Page-Recovery, on the other hand, finishes in milliseconds. It fetches a single page from the backup, collects only relevant transactional logs for the page, and applies them to the page.
Terminology
TODO: A few words below should have more mathematic definitions.
- A page in this module refers to a fixable page. It does NOT refer to other types of pages (e.g., page-allocation bitmap pages, store-node pages).
- Page-LSN of page X, or PageLSN(X), refers to the log sequence number of the latest update to the logical page X, regardless of whether the page is currently in the buffer pool or its cleanliness.
- Recorded-Page-LSN of page X in data source Y, or RecordedPageLSN(X, Y), refers to the log sequence number of the latest update to a particular image of page X stored in data source Y (e.g., bufferpool, media, backup as of Wed, backup as of Thr, etc)
- Expected-Minimum-LSN of page X, or EMLSN(X), refers to the Page-LSN of X at the time of its latest eviction from the buffer pool when the page X was dirty. EM-LSN of X is stored in X's parent page (in media or bufferpool), except in the case of a root page.
Invariants
- EMLSN(X) <= PageLSN(X)
- EMLSN(X) = PageLSN(X) only when X had no updates since previous write-out.
- EMLSN(X) > 0
Algorithm Overview
Single page recovery is invoked only when a page is brought into the buffer pool from media. When we bring a page, X, into the buffer pool, we compare RecordedPageLSN(X, media) to EMLSN(X) in the parent page image in bufferpool. If we see that that EMLSN(X) > RecordedPageLSN(X, media), then we will invoke single page recovery.
Given a parent page P with a child page C, EMLSN(C) is updated only when the child page C is evicted from the buffer pool. This is in keeping with the invariant EMLSN(C) <= PageLSN(C). Given a parent page P with a child page C, updating EMLSN(C) is a logged system transaction that does change PageLSN(P).
The buffer pool will evict only complete subtrees. Evicting a page from the buffer pool means that all of that page's children must have already been evicted.
- The buffer pool evicts pages bottom-up (leaf pages first, then parents, then grandparents, etc).
- Single page recovery works top-down — a page cannot be recovered until all of its ancestors have been recovered.
- Conversely, bringing a page into the buffer pool requires that all of that page's ancestors must also be in the buffer pool.
Initially, the log records to be recovered will fit in memory. Ultimately, we intend to create a recovery index (so that the log records won't have to fit into memory).
Per-Page LSN Chain
To maintain the per-page LSN chain mentioned above, we have baseLogHeader::_page_prv in logrec_t and multi_page_log_t::_page2_prv. These properties are populated in log manager when we call xct_t::give_logbuf(). We use these fields to collect relevant logs in log_core::_collect_single_page_recovery_logs().
EMLSN in B-tree pages
We also have EMLSN fields for all page pointers in B-tree pages, which is required to tell "from where we should start collecting relevant per-page logs" in case of Single-Page-Recovery. See the setters/getters of btree_page_h linked to this HTML. We also run a system transaction to update the EMLSN whenever we evict the child page from bufferpool. See page_evict_t and log_page_evict().
References
More details are given in the following papers.
- G. Graefe et al., "Definition, Detection, and Recovery of Single-page Failures,
a Fourth Class of Database Failures". VLDB'12.
- G. Graefe et al., "Self-diagnosing and self-healing indexes". DBTest'12.
- G. Graefe et al., "Foster B-trees". TODS'12.
§ DECLARE_TLS()
| DECLARE_TLS |
( |
block_alloc< generic_page > |
, |
|
|
scratch_space_pool |
|
|
) |
| |
Page buffers used while Single-Page-Recovery as scratch space.
§ dump_page_lsn_chain() [1/4]
| void restart_thread_t::dump_page_lsn_chain |
( |
std::ostream & |
o, |
|
|
const PageID & |
pid, |
|
|
const lsn_t & |
max_lsn |
|
) |
| |
|
static |
Pretty-prints the content of log file to the given stream in a way we can easily debug single-page recovery.
Defined in log_spr.cpp. This is for debugging, so performance is not guaranteed and also not thread-safe.
- Parameters
-
| [in] | o | Stream to which to write the information. |
| [in] | pid | If given, we only dump logs relevant to the page. |
| [in] | max_lsn | If given, we only dump logs required to recover the page up to this LSN. We omit the logs after that. |
§ dump_page_lsn_chain() [2/4]
| void ss_m::dump_page_lsn_chain |
( |
std::ostream & |
o, |
|
|
const PageID & |
pid, |
|
|
const lsn_t & |
max_lsn |
|
) |
| |
|
static |
Pretty-prints the content of log file to the given stream in a way we can easily debug single-page recovery.
This is for debugging, so performance is not guaranteed and also not thread-safe.
- Parameters
-
| [in] | o | Stream to which to write the information. |
| [in] | pid | If given, we only dump logs relevant to the page. |
| [in] | max_lsn | If given, we only dump logs required to recover the page up to this LSN. We omit the logs after that. |
§ dump_page_lsn_chain() [3/4]
| void ss_m::dump_page_lsn_chain |
( |
std::ostream & |
o, |
|
|
const PageID & |
pid |
|
) |
| |
|
static |
Pretty-prints the content of log file to the given stream in a way we can easily debug single-page recovery.
This is for debugging, so performance is not guaranteed and also not thread-safe.
- Parameters
-
| [in] | o | Stream to which to write the information. |
| [in] | pid | If given, we only dump logs relevant to the page. |
§ dump_page_lsn_chain() [4/4]
| void ss_m::dump_page_lsn_chain |
( |
std::ostream & |
o | ) |
|
|
static |
Pretty-prints the content of log file to the given stream in a way we can easily debug single-page recovery.
This is for debugging, so performance is not guaranteed and also not thread-safe.
- Parameters
-
| [in] | o | Stream to which to write the information. |
§ emlsn_address()
Returns the pointer to Expected Child LSN of pid0, foster-child, or real child.
§ get_foster_emlsn()
| const lsn_t & btree_page_h::get_foster_emlsn |
( |
| ) |
const |
|
inline |
Returns the Expected Child LSN of foster-child.
§ get_pid0_emlsn()
| const lsn_t & btree_page_h::get_pid0_emlsn |
( |
| ) |
const |
|
inline |
Returns the Expected Child LSN of pid0.
§ page_prev_lsn()
| const lsn_t & logrec_t::page_prev_lsn |
( |
| ) |
const |
|
inline |
Returns the LSN of previous log that modified this page.
§ set_emlsn_general()
Sets the Expected Child LSN of pid0, foster-child, or real child.
This method is not protected by exclusive latch but still safe because EMLSN are not viewed/updated by multi threads.
§ set_page_prev_lsn()
| void logrec_t::set_page_prev_lsn |
( |
const lsn_t & |
lsn | ) |
|
|
inline |
Sets the LSN of previous log that modified this page.
§ _child_emlsn
| lsn_t btrec_t::_child_emlsn |
|
private |
Expected child LSN for the _child (only in interior node).
§ _page2_prv
| lsn_t multi_page_log_t::_page2_prv |
_page_prv for another page touched by the operation.
§ _page_prv
| lsn_t baseLogHeader::_page_prv |
For per-page chains of log-records. Note that some types of log records (split, merge) impact two pages. The page_prev_lsn is for the "primary" page.
§ btree_foster_emlsn
| lsn_t btree_page_data::btree_foster_emlsn |
|
protected |
Expected-Minimum LSN for the foster-child pointer. 0 if this page doesn't have foster child.
§ btree_pid0_emlsn
| lsn_t btree_page_data::btree_pid0_emlsn |
|
protected |
Expected-Minimum LSN for the first child pointer. 0 if this page is leaf or left-most.