Checkpoints

Tree walk order

Checkpointer walks OrioleDB trees in LNR-order. The walk is divided into steps. The result of each step is a message, which determines the next step. The possible messages are given below.

WalkDownwards – the last step found an in-memory downlink within an internal page. The next step should be to visit a page on the lower level and start processing it.
WalkUpwards – the last step finished processing the page. The next step should continue processing the parent.
WalkContinue – continue working with the current page after releasing a lock. That happens when the checkpointer has to wait for the concurrent operation.

The picture below is the example of OrioleDB B-tree walking by the checkpointer.

Checkpoint walk the tree

Checkpointer comes to the root page n1 with the WalkDownwards message (step 1), then checks the first downlink l1. Since l1 is the in-memory downlink, the checkpointer moves to the leaf page n2 with the WalkDownwards message (step 2). After flushing n2, checkpointer comes back to n1 with the WalkUpwards message (step 3) and continues iteration over n1 downlinks. Similarly, it walks down to the n3 and back via the l2 downlink (steps 4 and 5). l3 appears to be IO in-progress downlink, and the checkpointer has to unlock the n1 page, and wait till IO is completed and continue with WalkContinue message (step 6). After relocking n1, checkpointer finds l3 to be an on-disk downlink and copies it "as is". Finally, checkpointer walks down to the n4 and back via the l4 downlink (steps 7 and 8). Then n1 is done, the checkpointer finishes the walk with the WalkUpwards message.

Checkpoint state

While the checkpointer is writing children of non-leaf page, concurrent splits and merges could happen. Therefore, the checkpoint state contains images of non-leaf pages under checkpointing as its reconstructed as the checkpointer visits the downlinks. If there are no concurrent changes to the non-leaf page, the reconstructed state finally matches the page state. Otherwise, the reconstructed page state could not even match to page state at any moment of time but always match the history of the checkpointer tree walk.

Note that the reconstructed state does not contain in-memory downlinks. In-memory downlinks are replaced with on-disk downlinks as we wrote the children's pages.

The picture below represents an example of a checkpoint state.

Checkpoint state 1

The root page n1, internal page n3, and leaf page n8 are currently under checkpointing. The downlink l2 is written to the reconstructed state. The downlink l3 is the next downlink in the reconstructed state. Its key is known, but the link is not because the corresponding children were not written yet. Similarly, downlink l6 is written, downlink l7 is the next downlink, and downlink l8 is not processed yet.

Let us imagine that the following event happened:

Page n7 was written by checkpointer.
Page n8 was split into n8 and n9.
Pages n6 and n7 were merged. The result is marked as n67.
Checkpointer has written page n8 and started processing page n9.

The resulting state is given in the picture below. Note that the reconstructed page image contains links l6 and l7 (as we visited them before the merge) but contains l8 and the next downlink corresponding to l9 (as we visited those downlinks after the split).

Checkpoint state 1

Autonomous non-leaf pages

If the non-leaf page under checkpointing gets modified concurrently, it becomes an "autonomous" non-leaf page. Autonomous pages work with the rules below.

If the page is marked as "autonomous", all its parents to the root are also marked as "autonomous".
If the page has an associated on-disk location, this association is cleared. The corresponding location is marked as free space at the current checkpoint.
The autonomous page will be processed until its hikey is met, disregarding how many pages will be visited to meet this target (due to concurrent insertion, it could be many pages).
Even if the initial page corresponding to the autonomous page has been split. The page holding the initial hikey is tracked. The merge, which would remove that hikey, is prevented.
If the autonomous page is full, but the corresponding hikey is not yet met, current contents are flushed to the disk (and parent got the corresponding downlink with WalkUpwards), but processing of the autonomous page continues till the hikey is met.
When flushing autonomous, the corresponding "on-disk" location is marked as free for the future checkpoint.

Checkpointer messages

Consider more details regarding the checkpointer messages we enumerated above.

WalkDownwards

This message has the parameters below.

The number and change count of the in-memory page to be visited.
Low key ("lokey"). The lokey is from the parent page downlink, or it is the parent page lokey if the downlink is the first on the page.

The checkpointer has to process the referenced page. After the page is processed, the WalkUpwards message must be returned. If the referenced page is non-leaf, more messages will be issued during its processing, but finally, there must be WalkUpwards of the referenced page.

There might be a failure due to concurrent operations: the in-memory page might have a different change count. In this case, the corresponding WalkUpwards should return the invalid downlink. Also, in a failure case, WalkUpwards message should go just after WalkDownwards: once we start processing the non-leaf page, we must finish it.

WalkUpwards

This message has the parameters below.

On-disk downlink. This link might be invalid, as described above.
Next key. That is actually a hikey of the page written. It might not match the subsequent downlink of the parent page due to concurrent splits and merges. On mismatch, the parent page must be marked "autonomous".
Flag indicating that parent page must be marked as "dirty". This flag is set when the page has been written to the new place after the previous checkpoint. This flag is not set if the page and its children will not be modified then. The parent must be marked as "dirty" to be written and reflect the new on-disk downlink.
The flag indicates that we must save the existing next downlink on the parent page. That happens to the autonomous page when the current reconstructed image is finished: we have the page written and need to insert a new downlink to the parent, but we still need to visit the same next downlink.

This message indicates that the child page has been processed, and the parent needs to add the downlink. If there is no parent, we have processed the root and now have a pointer to the new root on-disk location.

WalkContinue

This message has no parameters. It just indicates that checkpointer must continue processing the same page with the same next downlink. That happens when the checkpointer has to wait for the concurrent operation. Such as meeting IO in-process downlink and having to release the log and wait till the IO is finished.

Sequential buffers

What are sequential buffers?

Sequential buffers (seq bufs) are lightweight, file-backed streaming I/O abstractions used by the checkpointer and by ordinary backends that write B-tree pages. Instead of holding all checkpoint metadata in shared memory, seq bufs stream data to and from on-disk files using two in-memory OrioleDB pages as a double-buffer. While one page is being filled (or drained) by the caller, the other can be flushed to disk or pre-fetched in the background, giving sequential throughput without occupying large amounts of shared memory.

Each seq buf is identified by a SeqBufTag – a (datoid, relnode, checkpointNumber, type) tuple. The type field distinguishes the two kinds of files:

'm' (map file) – the checkpoint map that records on-disk page locations.
't' (temporary file) – the temporary tracking file used during the checkpoint walk.

The in-memory state that must be shared between the checkpointer and writer backends lives in SeqBufDescShared, which is embedded directly in the B-tree meta page (BTreeMetaPage). Per-backend state such as the open file descriptor lives in SeqBufDescPrivate, which is stored in the tree descriptor (BTreeDescr) and is private to each backend.

Why are sequential buffers used?

Each checkpointable B-tree keeps three groups of seq bufs:

freeBuf – On entry to a new checkpoint the checkpointer opens this buffer to read the list of free disk extents recorded by the previous checkpoint. As the current checkpoint writes pages it can reuse those extents, avoiding unnecessary file growth. There is exactly one freeBuf per tree (no dual array) because it is replaced atomically at the start of each checkpoint: the new file is put in place before the old one is removed.
nextChkp[2] – The checkpointer opens one of these two slots to write the checkpoint map file for the current checkpoint. Every time a B-tree page is flushed to disk, its location is appended to the map. The next checkpoint will read this map via a freeBuf so that it knows where each page lives without having to walk the whole tree again.
tmpBuf[2] – The checkpointer opens one of these two slots to write a temporary file that tracks every page written during the checkpoint walk. After the walk finishes, this file is sorted, duplicates are removed, and it drives the hole-punching pass that reclaims unused space inside the data file. The file is deleted once the checkpoint is complete.

Why are there two slots (the `[2]` arrays)?

OrioleDB checkpoints run concurrently with normal DML. To avoid serialising checkpoint N against the initialisation of checkpoint N+1, each of the arrays (nextChkp, tmpBuf, datafileLength, partsInfo) is indexed by checkpointNumber % 2.

At any moment exactly one slot is "active" – the slot currently being written by the in-progress checkpoint – while the other slot either still holds data from the previous checkpoint (needed for recovery until the new checkpoint is verified) or is idle.

Checkpoint N   → uses slot  N % 2
Checkpoint N+1 → uses slot (N+1) % 2   (the other slot)

This ping-pong scheme means checkpoint N+1 can begin allocating pages and initialising its seq buf files while checkpoint N is still finalising, without any shared state collision.

The two dirty flags on the meta page (dirtyFlag1, dirtyFlag2) support the same scheme. dirtyFlag1 is cleared at the start of the checkpoint. dirtyFlag2 provides an extra generation so that a modification racing the clear of dirtyFlag1 is never silently lost. If both flags are false when the checkpointer is about to start processing a tree, the tree has not changed since the last checkpoint; the freshly initialised seq buf files are closed and removed immediately without writing any data.

When are sequential buffers removed?

Seq buf in-memory pages are returned to the page pool as soon as the buffer is finalised. This happens in one of the following situations:

Checkpoint completes successfully – tmpBuf pages are freed after post-processing; nextChkp pages are freed after the map file header is written and the file is renamed.
Tree descriptor is evicted – when a tree descriptor is reclaimed from the descriptor cache, btree_finalize_private_seq_bufs() flushes and frees all in-memory pages belonging to that descriptor's active seq bufs.
Tree is dropped – checkpointable_tree_free() closes all seq buf file descriptors, freeing the OS resources.

Seq buf on-disk files have a longer lifetime than the in-memory pages:

The tmp file ('t') is deleted once the checkpoint that created it has finished post-processing.
The map file ('m') from checkpoint N is kept until checkpoint N+1 has been completed and its own map file is in place. This ensures that point-in-time recovery can always find the latest clean checkpoint.
The free-extent file is replaced atomically: the new file is written and renamed into place before the old one is unlinked.
If a tree was not modified between two checkpoints (both dirty flags are false), the seq buf files initialised for that checkpoint are closed immediately and removed without writing any data.

Checkpoints

Tree walk order​

Checkpoint state​

Autonomous non-leaf pages​

Checkpointer messages​

WalkDownwards​

WalkUpwards​

WalkContinue​

Sequential buffers​

What are sequential buffers?​

Why are sequential buffers used?​

Why are there two slots (the [2] arrays)?​

When are sequential buffers removed?​