All posts

s3q: S3-backed queue

s3q is a queue backed by AWS S3. It is available as a Rust and Python library.

On this page

This article is an introduction to s3q, a tool I have been working on. s3q is a queue backed by AWS S3. It stores state in a SQLite database file, caches that file on local disk, and writes the updated file durably to AWS S3 after every durable write.

flowchart LR app[Application] subgraph s3q[s3q] direction LR cache[Local SQLite Cache] s3[(Object Storage<br/>S3)] cache --> s3 end app -->|uses| cache

There are many systems built with object storage as the primary durable layer. Some of them are:

Common techniques include batching or group commit, append-oriented write paths, and background compaction or reconciliation to keep write cost and write amplification under control.

These techniques are required to as object stores have relatively high write latency and non-trivial per-operation cost. The advantage is that object stores provide excellent durability and scale out of the box.

The Turbopuffer team also built an internal queue system using object storage.

I created s3q to understand the constraints of using an object storage as a durable store for a write-heavy workload like a queue.

Architecture

SQLite as the file format

The Turbopuffer team used a JSON file as the file format. I chose SQLite as the query engine and file format. The main design decisions are:

  • s3q reads and writes the whole SQLite file to object storage.
  • The SQLite database file is cached on local disk.
  • ETag is used to verify that the copy has the most recent changes.
  • snapshot() reads the latest version from object storage.
  • sync() writes a new version to object storage using Compare-And-Set (CAS).
  • There are two durability modes:
    • Local: The application is responsible for making sure it is in sync with the remote copy.
    • Durable:
      • Reads are guaranteed to be from the latest version.
      • Writes are guaranteed to be durable.
    • More details are provided in the next section.

I chose SQLite for a couple of reasons:

  • I had already built a queue using SQLite in pgqrs.
  • SQLite as the foundation makes it easier to support enhancements such as:
    • durable execution.
    • CDC to store changes rather than shipping the complete file on every write.
    • health checks, visibility timeouts, and other enhanced features inspired by pgmq.

Durability Modes

A change to the queue is durable if the SQLite file with the change has been written to object storage successfully.

s3q exposes two S3 durability modes:

  • Durable: write operations synchronize to S3 before returning
  • Local: writes stay in the local SQLite cache until the application explicitly calls sync()

Durable

In Durable mode, a write operation returns once the new SQLite revision has been written to object storage. The operation mutates local SQLite state, then publishes the updated database file to S3 using the previous ETag as a CAS precondition.

If that upload fails or conflicts, the API call returns an error. So a successful enqueue/dequeue in Durable mode means the new queue state is already in object storage.

sequenceDiagram participant s3q participant Local as Local SQLite cache participant S3 as S3 object s3q->>Local: write transaction Local-->>s3q: local commit s3q->>S3: conditional PUT with previous ETag alt PUT succeeds S3-->>s3q: new ETag s3q-->>s3q: mark local state clean s3q-->>s3q: return success else PUT fails or conflicts S3-->>s3q: error s3q-->>s3q: return error end

This is the simpler mode for an application because a successful API call is guaranteed to be durable. It is slower because every write pays object storage latency.

Local

Local mode is an experimental mode that gives applications explicit control over replication boundaries. Applications can take control if the workload patterns are not yet supported by other modes. For example, at the time of writing this post, group commits are not supported. The application has the opportunity to group commits in this mode to improve throughput.

A write to the queue returns immediately and does not write it to object storage. The application has to call sync() to write the latest revision to object storage.

A read does not get the latest ETag to check if the local copy is stale.

sequenceDiagram participant Writer participant Local as Writer local SQLite cache participant S3 as S3 object participant Follower Writer->>Local: write transaction Local-->>Writer: local commit Writer-->>Writer: return success immediately Note over Writer,S3: remote state is still unchanged Writer->>S3: sync() with previous ETag S3-->>Writer: new ETag Follower->>S3: snapshot() S3-->>Follower: latest SQLite revision

Two details are important:

  • sync() publishes only if the local state is dirty, and it uses ETag compare-and-set to avoid a blind overwrite.
  • snapshot() refuses to run if the local store has unsynced writes. That prevents a refresh from silently discarding local changes.

Where This Queue Fits

An object-storage-backed queue is usable when the coordination tax of queue operations stays small relative to the work being dispatched. As a practical rule of thumb, for queue overhead fraction:

  • < 5%: strong fit
  • 5% to 20%: tradeoff if portability or operational simplicity matters
  • > 20%: poor fit

Queue Operations

The primary queue operations are:

  • enqueue
  • dequeue or claim
  • archive or complete

In Durable mode, each of these operations requires a durable write to object storage. A fully completed job therefore has at least 3 durable mutations. Retries, visibility extensions, and conflicts will push that number higher.

The important characteristics of object storage as a durable layer that determine the overhead of queue operations are:

  • Write latency of object storage.
  • Cost or number of PUT/GET/HEAD API calls.

Fit Calculator

The calculator below treats workload fit as a function of:

  • mutations per completed job
  • producers and consumers mutating the shared queue state
  • batch commit size
  • S3 latency envelope
  • SQLite object size
  • arrival rate
  • dispatch budget
  • useful work time per job

The simulator uses a simple model. That is good enough to understand the main trade-offs:

queue_time_per_job = (mutations / batch_commit_size) * s3_latency * contention_penalty
capacity = 1000 / queue_time_per_job

queue_overhead_fraction = queue_time_per_job / (queue_time_per_job + service_time)
load_ratio = arrival_rate / capacity

Simulated Fit Explorer

When does an S3-backed queue make sense?

This is a lightweight calculator for understanding which queue workloads are a fit. Use the controls to vary mutations, writers, batching, arrival rate, dispatch budget, and S3 latency, then inspect how queue overhead, capacity, and write cost move.

Fit Classification

Tradeoff

Quick verdict on whether queue overhead stays small enough to justify the design.

tradeoff

Queue Time / Job

27 ms

Delay added by queue mechanics before useful work can really begin.

Capacity

37.0 jobs/s

Throughput ceiling before backlog stops being occasional and becomes normal.

Queue Overhead

2.6%

Share of each job spent paying queue tax instead of doing useful work.

Load Ratio

0.14

How close the incoming workload is to permanently saturating the queue.

PUTs / Job

0.30

Direct proxy for per-job S3 write cost and bytes moved.

Queue Overhead vs Useful Work

The same queue becomes much more reasonable once useful work dominates queue time. Lines show batch sizes `1`, `10`, and `50`.

batch 1 batch 10 batch 50

Load Ratio vs Arrival Rate

This shows how close the queue is to saturation. Below `0.7` is healthy, `0.7` to `0.9` is a tradeoff zone, and `1.0` means backlog must grow.

batch 1 batch 10 batch 50

Workload Fit Map

Each cell combines queue overhead, load pressure, and dispatch budget. The live crosshair marks the selected workload.

strong fit tradeoff poor fit

Importance of Group Commit

Batching, or group commit, is the most important feature. This is in line with every system that uses object storage as the primary durable layer.

Without batching:

  • every job pays S3 latency on its own
  • queue time per job stays high
  • capacity stays low
  • small, fast jobs look like a bad fit

With moderate batching:

  • one durable publish can cover multiple jobs
  • queue time per job drops quickly
  • capacity goes up
  • the queue starts to look reasonable once useful work is at least 10x to 20x the queue time

In practice

Good fits:

  • coarse durable jobs such as exports, ETL partitions, report generation, and media processing
  • moderate-rate async jobs where batching is possible and payloads are mostly references

Poor fits:

  • real-time dispatch
  • hot shared coordination points
  • many-writer queues with tiny units of work
  • workloads whose dispatch budget is already close to raw object storage latency

Future enhancements

The current system design of s3q is intentionally simple:

  • It is a library
  • It overwrites the SQLite database for every durable write.
  • It leaves group commit, the most important feature, to the application.

Immediate improvements to make s3q more usable are:

  • Support group commits in the library.
  • Add a service that producers and consumers can use for queue operations.
    • This reduces the number of writers.
    • This increases the scope for larger group commits.
  • If the size of the SQLite database becomes large, investigate writing CDC streams.