Skip to main content

Decoupled storage and compute

One of the features that OrioleDB provides is the ability to decouple storage and compute. This is achieved by storing the data in a separate storage layer, such as S3, and running the compute layer on a separate instance. This allows for better scalability and flexibility in managing the data and compute resources.

Experimental feature

This feature is currently experimental and should be used with caution.

S3 storage is not required to use OrioleDB. It is an optional feature that can be enabled by setting the appropriate configuration parameters.

S3 database storage

OrioleDB has experimental support for the storage of all tables and materialized views data in the S3 bucket. It is useful for splitting compute and data storage instances, for increasing data safety, and for scaling and changing the architecture of compute instances preserving all data.

Local storage implements caching of the data most often accessed. Also, it ensures that adding and updating data will be done at the speed of writing to local storage, rather than the S3 transfer rate. Data are synced with S3 asynchronously. However, all requirements of data integrity are ensured for all the data on S3 storage as well. So you can re-connect to the S3 bucket by another empty PostgreSQL instance (initialized with the utility described below) with the OrioleDB extension configured to use S3 with this bucket and get back all the data from S3 in the state of the last PostgreSQL checkpoint.

To use S3 functionality, the following parameters should be set before creating orioledb tables and materialized views:

  • orioledb.s3_mode -- whether to use S3 mode. It could be on and off. The default is off
  • archive_library = 'orioledb' -- set it to use s3 mode
  • archive_mode = on -- set it to use S3 mode
  • orioledb.s3_region -- specify S3 region, where the S3 bucket is created.
  • orioledb.s3_host -- access endpoint address for S3 bucket (without https:// prefix). E.g. mybucket.s3-accelerate.amazonaws.com
  • orioledb.s3_accesskey -- specify AWS access key to authenticate the bucket.
  • orioledb.s3_secretkey -- specify AWS secret key to authenticate the bucket.
  • orioledb.s3_num_workers -- specify the number of AWS workers syncing data to S3 bucket. More workers could make sync faster. 20 - is a recommended value that is enough in most cases.
  • orioledb.s3_desired_size -- This parameter defines the total desired size of OrioleDB tables on the local storage. Once this limit is exceeded, OrioleDB's background workers will begin evicting local data to the S3 bucket. This mechanism ensures efficient use of local storage and seamless data transfer to S3. Effective support for this limit requires a filesystem that supports sparse files.
  • max_worker_processes -- PostgreSQL limit for maximum number of workers. Should be set to accommodate extra orioledb.s3_num_workers and all other Postgres workers. To start set it to orioledb.s3_num_workers plus the previous max_worker_processes value.

After setting the GUC parameters above restart the postmaster. Then all tables and materialized views created using orioledb will be synced with the S3 bucket.

CREATE TABLE s3_test
(
id int8 NOT NULL,
value1 float8 NOT NULL,
value2 text NOT NULL,
PRIMARY KEY(id)
) USING orioledb

In S3 mode all the tables and materialized views created with using orioledb are synchronized with S3 incrementally. That is, only modified blocks are to be put into S3 bucket. The table/materialized view not created with using orioledb will be saved completely at every checkpoint. So, it's recommended to use S3 mode when you store the majority of your data using the OrioleDB engine.

For best results, it's recommended to turn on Transfer acceleration in General AWS S3 bucket settings (endpoint address will be given with s3-accelerate.amazonaws.com suffix) and have the bucket and compute instance within the same AWS region. Even better is to use Directory AWS bucket within the same AWS region and sub-region as the compute instance.

As mentioned above S3 mode is currently experimental. The major limitations of this mode are the following.

  1. While OrioleDB tables and materialized views are stored incrementally in the S3 bucket, the history is kept forever. There is currently no mechanism to safely remove the old data.
  2. In the primary/replica setup, each should have a separate S3 bucket.
  3. The user is responsible for ensuring that only one instance is working with the same S3 bucket. Connecting multiple database instances simultaneously to the same bucket leads to data corruption.

All of the limitations above are temporary and will be removed in further releases.

S3 loader utility

The S3 loader utility allows getting data from the S3 bucket to any local machine into the specified directory.

To use it you need to install boto3 and testgres into your python:

pip install boto3 testgres

Run the script with the same parameters as from your S3 Postgres cluster config:

  • AWS_ACCESS_KEY_ID - same as orioledb.s3_accesskey
  • AWS_SECRET_ACCESS_KEY - same as orioledb.s3_secretkey
  • AWS_DEFAULT_REGION - same as orioledb.s3_region
  • --endpoint - same as orioledb.s3_host (full URL with https:// prefix) E.g --endpoint=https://mybucket.s3-accelerate.amazonaws.com or --endpoint=https://mybucket.s3.amazonaws.com
  • --bucket-name - S3 bucket name from orioledb.s3_host E.g --bucket-name=mybucket
  • --data-dir - destination directory on the local machine you want to write data to. E.g. --data-dir=mydata/
  • --verbose - optionally print extended info.

AWS_ACCESS_KEY_ID=<your access key> AWS_SECRET_ACCESS_KEY='<your secret key>' AWS_DEFAULT_REGION=<your region> python orioledb_s3_loader.py --endpoint=https://<your-bucket-endpoint> --bucket-name=<your-bucket-name> --data-dir='orioledb_data' --verbose