Skip to main content

Ordered Insertion Optimization in OrioleDB

· 5 min read
Alexander Korotkov
Creator of OrioleDB

When many sessions try to insert into the same B-tree leaf page, classic exclusive page locking serializes progress and wastes time on sleep/wake cycles. We’re introducing a batch page insertion path that lets the session holding the page lock insert for itself and its neighbors. The result: dramatically reduced lock waits, and big gains at high client counts (2X throughput boost starting from 64 clients in our benchmark).

The problem: hot leaves and lock convoys

In OrioleDB beta12, inserts into a B-tree leaf are performed under an exclusive page lock. If several backends target the same leaf page, they queue, sleep, and wake up in turn.

out.gif

That’s correct and simple — but not cheap:

  • Every process waiting for the lock pays the cost of sleep → wake → re-check just to insert a tiny tuple.
  • If the leaf is hot (e.g., time-ordered keys, skew, or tight key ranges), we get a huge queue on that single page.
  • The lock hand-off cadence, not raw CPU, becomes the bottleneck.

OrioleDB fastpath search

· 4 min read
Alexander Korotkov
Creator of OrioleDB

When you optimize the CPU time of a transactional database management system, it comes down to one question: how fast can you read a page without breaking consistency? In this post, we explore how OrioleDB avoids locks, trims memory copies, and — starting with beta12 — even bypasses both copying and tuple deforming altogether for fixed-length types during intra-page search. This means that not only are memory copies skipped, but the overhead of reconstructing tuples is also eliminated. The result: an even faster read path, with no manual tuning required.

Why page copies matter … and why they hurt

Every time a PostgreSQL backend descends an OrioleDB B-tree, it needs a consistent view of the target page. Instead of locking the whole page, OrioleDB keeps a 32-bit state word in the page header.

The low bits represent a change count that increments with every data modification; the high bits hold lightweight flags for "exclusive" and "read-blocked" states. A reader copies the necessary bytes, then re-reads the state, and retries if the counter has changed without using locks, yet achieving perfect consistency.

The following pseudo-code illustrates how to copy a consistent page image using a state variable for synchronization.

def read_page(page, image):
while true:
state = read_state(page)
if state_is_blocked(state):
continue

copy_data(image, page)

newState = read_state(page)
if state_is_blocked(state) or get_change_count(newState) != get_change_count(state):
continue

return

OrioleDB beta12: features and benchmarks

· 5 min read
Alexander Korotkov
Creator of OrioleDB

Since our last public update, OrioleDB has continued to evolve with a series of new releases. These updates refine the core engine, extend functionality, and improve performance across a range of workloads. Together, they move us closer to a beta release and lay the groundwork for broader adoption.

What is OrioleDB?

OrioleDB is a PostgreSQL storage extension that implements a custom Table Access Method as a drop‑in replacement for the default Heap storage engine. It is designed to address scalability bottlenecks in PostgreSQL’s buffer manager and reduces the WAL, enabling better utilization of modern multi-core CPUs and high‑performance storage systems.

By rethinking core components such as MVCC, page caching, and checkpoints, OrioleDB improves throughput and predictability in transactional workloads without altering PostgreSQL’s user-facing behavior.

What’s New in OrioleDB?

Building on this foundation, recent releases have introduced several user-facing enhancements:

  • Support for non‑B-tree index types on OrioleDB tables.
  • Support for rewinding recent changes in the database.
  • Support for tablespaces.
  • fillfactor support for OrioleDB tables and indexes.
  • orioledb_tree_stat() SQL function for space utilization statistics.
  • Support for tables with more than 32 columns.

These additions improve OrioleDB/PostgreSQL compatibility and provide more flexibility for workloads with diverse schema and indexing requirements.

The differences between OrioleDB and Neon

· 5 min read
Alexander Korotkov
Creator of OrioleDB

In a recent Hacker News discussion, there was some confusion about the differences between OrioleDB and Neon. Both look alike at first glance. Both promise a "next‑gen Postgres". Both have support for cloud‑native storage.

This post explains how the two projects differ in practice. And importantly, OrioleDB is more than an undo log for PostgreSQL.

The Core Differences

OrioleDB

OrioleDB is a Postgres extension. It implements a Table Access Method to replace the default storage method (Heap), providing the following key features:

  1. MVCC based on UNDO, which prevents bloat as much as possible;
  2. IO-friendly copy‑on‑write checkpoints with very compact row‑level WAL;
  3. An effective shared memory caching layer based on squizzled pointers.

Neon

Neon uses the default Table Access Method (Heap) and replaces the storage layer. The WAL is written to the safekeepers, and blocks are read from page servers backed by object storage, providing instant branching and scale-to-zero capabilities.

Bridged Indexes in OrioleDB: architecture, internals & everyday use?

· 4 min read
Alexander Korotkov
Creator of OrioleDB

Since version beta10 OrioleDB supports building indexes other than B-tree. Bridged indexes are meant to support these indexes on OrioleDB tables.

1. Why OrioleDB needs a “bridge”

OrioleDB stores its table rows inside a B-tree built on a table primary key and keeps MVCC information in an undo log, so it can’t simply plug PostgreSQL’s existing Index Access Methods (GiST, GIN, SP-GiST, BRIN, …) into that structure. While PostgreSQL's Index Access Methods:

  • reference a 6-byte ctid (block number and offset in the heap) -- not a logical key;
  • keep every live version of a row in the index, leaving visibility checks to the executor;
  • support inserts only in the index and rely on VACUUM for physical deletion.

OrioleDB indexes, in contrast, are MVCC-aware: they point to the rows via primary-key values and support logical updates/deletes directly in the index. To remain heap-free while still allowing users build the rich ecosystem of non-B-tree indexes, OrioleDB introduces a bridge index layer.

Illustration of a bridge index

Why PostgreSQL needs a better API for alternative table engines?

· 9 min read
Alexander Korotkov
Creator of OrioleDB

For a long time now, PostgreSQL has had an extensible Index Access Method API (called AM), which has stood the test of time and enabled numerous robust extensions to provide their own index types. For example: rum, pgvector, bloom, zombodb and others. PostgreSQL 12 introduced the Table AM API, promising equivalent flexibility for table access methods.

Despite PostgreSQL's Table AM API being available since version 12 and ongoing criticisms of its built-in storage engine — particularly the MVCC model ([1], [2], [3]) — it remains surprising that no fully featured transactional storage engine has yet emerged purely as an extension.

Since the table AM and index AM APIs are tightly coupled, this is an issue for both implementations.

The features most in demand for alternative PostgreSQL table engines are:

  1. Alternative MVCC implementations, e.g.., UNDO-log-based storages. The motivation to provide this is well-discussed in Uber blog post, Andy Pavlo blog post, and many other sources.
  2. Non-heap-like storages. For example, in index-organized tables, the index is not an optional addition to the table that speeds up requests but a necessary layer the table storage uses internally. Consequently, the table tuple is a part of a complex data structure and can’t be addressed by a fixed-length address like page number and offset number. It requires to be addressed by variable length identifier like index key.

The API extension to provide #2 is more or less understandable. This could be done by replacing ctid with an arbitrary sequence of bytes in all the APIs. However, #1 seems rather complex and requires much clarification.

As an example that illustrates the motivation for the changes to the table and index AM API comes OrioleDB. It is an extension providing a table access method we developed to address many well-known shortcomings of the built-in storage engine. However, OrioleDB is not yet a drop-in extension; it requires several patches to the PostgreSQL Core.

Besides these two things, which will be discussed below, there are numerous needs for further API improvements, like pointer squizzling and alternative WAL-logging, that are outside the scope of this post.

OrioleDB beta7: Benchmarks

· 7 min read
Alexander Korotkov
Creator of OrioleDB
Pavel Borisov
PostgreSQL contributor

OrioleDB is a storage extension for PostgreSQL which uses PostgreSQL's pluggable storage system. Designed as a drop-in replacement for PostgreSQL's existing Heap storage, OrioleDB aims to overcome scalability bottlenecks and fully utilize modern hardware capabilities. By integrating seamlessly with PostgreSQL, it offers improved performance, efficiency, and scalability without sacrificing the robustness and reliability that PostgreSQL is known for.

Today we’re releasing OrioleDB version beta7. This marks a significant step in delivering a high-performance, next-generation storage engine for Postgres users. OrioleDB is designed to extract the full potential of modern hardware, offering superior performance, efficiency, and scalability.

OrioleDB design choices

OrioleDB is built from the ground up to leverage modern hardware, reduce maintenance needs, and enhance distributed capabilities. The key technical decisions forming the foundation of OrioleDB are:

  1. Elimination of Buffer Mapping and Lock-less Page Reading: In OrioleDB, in-memory pages are directly linked to storage pages, eliminating the need for buffer mapping and its associated bottlenecks. Additionally, in-memory page reading is performed without atomic operations, enabling lock-less access. Together, these design choices significantly elevate vertical scalability for PostgreSQL.
  2. MVCC Based on the UNDO Log Concept: OrioleDB employs a Multi-Version Concurrency Control (MVCC) mechanism based on an undo log. Old versions of tuples are evicted into undo logs—forming undo chains—rather than causing bloat in the main storage system. Page-level undo records allow the system to promptly reclaim space occupied by deleted tuples. Combined with page merging, these mechanisms eliminate bloat in most cases. As a result, dedicated vacuuming of tables is unnecessary, removing a common cause of system performance degradation and database outages.
  3. Copy-on-Write Checkpoints and Row-Level WAL: OrioleDB utilizes copy-on-write checkpoints to provide structurally consistent snapshots of data at all times. This approach is friendly to modern SSDs and enables row-level Write-Ahead Logging (WAL). Row-level WAL is easy to parallelize (already implemented), compact, and suitable for active-active multi-master configurations (planned).

Benchmarking OrioleDB vs PostgreSQL Heap

To illustrate the performance characteristics of OrioleDB we used TPC-C benchmarking, a complex test simulating real database workloads that is considered a modern standard in database applications.

PostgreSQL: No More VACUUM, No More Bloat

· 5 min read
Alexander Korotkov
Creator of OrioleDB

PostgreSQL, a powerful open-source object-relational database system, has been lauded for its robustness, functionality, and flexibility. However, it is not without its challenges – one of which is the notorious VACUUM process. However, the dawn of a new era is upon us with OrioleDB, a novel engine designed for PostgreSQL that promises to eliminate the need for the resource-consuming VACUUM.

OrioleDB is now in Beta

· One min read
Alexander Korotkov
Creator of OrioleDB
Pavel Borisov
PostgreSQL contributor

Long story short, OrioleDB alpha version was released more than year ago. More than 200 bugs were fixed since then. Now, OrioleDB reached beta stage. That means we recommend OrioleDB for pre-production testing. The most interesting workloads for testing could include: high transaction throughput, high volume of updates, high volume of in-memory operations, lock bottlenecks and other extreme cases.