This article is an introduction to s3q, a tool I have been working on.
s3q is a queue backed by AWS S3. It stores state in a
SQLite database file, caches that file on local disk, and writes the updated file
durably to AWS S3 after every durable write.
flowchart LR
app[Application]
subgraph s3q[s3q]
direction LR
cache[Local SQLite Cache]
s3[(Object Storage<br/>S3)]
cache --> s3
end
app -->|uses| cacheThere are many systems built with object storage as the primary durable layer. Some of them are:
Common techniques include batching or group commit, append-oriented write paths, and background compaction or reconciliation to keep write cost and write amplification under control.
These techniques are required to as object stores have relatively high write latency and non-trivial per-operation cost. The advantage is that object stores provide excellent durability and scale out of the box.
The Turbopuffer team also built an internal queue system using object storage.
I created s3q to understand the constraints of using an object storage as a durable store for a write-heavy workload like a queue.
Architecture
SQLite as the file format
The Turbopuffer team used a JSON file as the file format. I chose SQLite as the query engine and file format. The main design decisions are:
- s3q reads and writes the whole SQLite file to object storage.
- The SQLite database file is cached on local disk.
ETagis used to verify that the copy has the most recent changes.snapshot()reads the latest version from object storage.sync()writes a new version to object storage using Compare-And-Set (CAS).- There are two durability modes:
- Local: The application is responsible for making sure it is in sync with the remote copy.
- Durable:
- Reads are guaranteed to be from the latest version.
- Writes are guaranteed to be durable.
- More details are provided in the next section.
I chose SQLite for a couple of reasons:
- I had already built a queue using SQLite in pgqrs.
- SQLite as the foundation makes it easier to support enhancements such as:
- durable execution.
- CDC to store changes rather than shipping the complete file on every write.
- health checks, visibility timeouts, and other enhanced features inspired by pgmq.
Durability Modes
A change to the queue is durable if the SQLite file with the change has been written to object storage successfully.
s3q exposes two S3 durability modes:
Durable: write operations synchronize to S3 before returningLocal: writes stay in the local SQLite cache until the application explicitly callssync()
Durable
In Durable mode, a write operation returns once the new SQLite revision has
been written to object storage.
The operation mutates local SQLite state, then publishes the updated database
file to S3 using the previous ETag as a CAS precondition.
If that upload fails or conflicts, the API call returns an error.
So a successful enqueue/dequeue in Durable mode means the new queue state is already in object storage.
sequenceDiagram
participant s3q
participant Local as Local SQLite cache
participant S3 as S3 object
s3q->>Local: write transaction
Local-->>s3q: local commit
s3q->>S3: conditional PUT with previous ETag
alt PUT succeeds
S3-->>s3q: new ETag
s3q-->>s3q: mark local state clean
s3q-->>s3q: return success
else PUT fails or conflicts
S3-->>s3q: error
s3q-->>s3q: return error
endThis is the simpler mode for an application because a successful API call is guaranteed to be durable. It is slower because every write pays object storage latency.
Local
Local mode is an experimental mode that gives applications explicit control
over replication boundaries. Applications can take control if the workload
patterns are not yet supported by other modes.
For example, at the time of writing this post, group commits are not supported.
The application has the opportunity to group commits in this mode to improve throughput.
A write to the queue returns immediately and does not write it to object storage.
The application has to call sync() to write the latest revision to object storage.
A read does not get the latest ETag to check if the local copy is stale.
sequenceDiagram
participant Writer
participant Local as Writer local SQLite cache
participant S3 as S3 object
participant Follower
Writer->>Local: write transaction
Local-->>Writer: local commit
Writer-->>Writer: return success immediately
Note over Writer,S3: remote state is still unchanged
Writer->>S3: sync() with previous ETag
S3-->>Writer: new ETag
Follower->>S3: snapshot()
S3-->>Follower: latest SQLite revisionTwo details are important:
sync()publishes only if the local state is dirty, and it usesETagcompare-and-set to avoid a blind overwrite.snapshot()refuses to run if the local store has unsynced writes. That prevents a refresh from silently discarding local changes.
Where This Queue Fits
An object-storage-backed queue is usable when the coordination tax of queue operations stays small relative to the work being dispatched. As a practical rule of thumb, for queue overhead fraction:
- < 5%: strong fit
- 5% to 20%: tradeoff if portability or operational simplicity matters
- > 20%: poor fit
Queue Operations
The primary queue operations are:
enqueuedequeueor claimarchiveor complete
In Durable mode, each of these operations requires a durable write to object storage.
A fully completed job therefore has at least 3 durable mutations.
Retries, visibility extensions, and conflicts will push that number higher.
The important characteristics of object storage as a durable layer that determine the overhead of queue operations are:
- Write latency of object storage.
- Cost or number of PUT/GET/HEAD API calls.
Fit Calculator
The calculator below treats workload fit as a function of:
- mutations per completed job
- producers and consumers mutating the shared queue state
- batch commit size
- S3 latency envelope
- SQLite object size
- arrival rate
- dispatch budget
- useful work time per job
The simulator uses a simple model. That is good enough to understand the main trade-offs:
queue_time_per_job = (mutations / batch_commit_size) * s3_latency * contention_penalty
capacity = 1000 / queue_time_per_job
queue_overhead_fraction = queue_time_per_job / (queue_time_per_job + service_time)
load_ratio = arrival_rate / capacity
Simulated Fit Explorer
When does an S3-backed queue make sense?
This is a lightweight calculator for understanding which queue workloads are a fit. Use the controls to vary mutations, writers, batching, arrival rate, dispatch budget, and S3 latency, then inspect how queue overhead, capacity, and write cost move.
Fit Classification
Tradeoff
Quick verdict on whether queue overhead stays small enough to justify the design.
tradeoffQueue Time / Job
27 ms
Delay added by queue mechanics before useful work can really begin.
Capacity
37.0 jobs/s
Throughput ceiling before backlog stops being occasional and becomes normal.
Queue Overhead
2.6%
Share of each job spent paying queue tax instead of doing useful work.
Load Ratio
0.14
How close the incoming workload is to permanently saturating the queue.
PUTs / Job
0.30
Direct proxy for per-job S3 write cost and bytes moved.
Queue Overhead vs Useful Work
The same queue becomes much more reasonable once useful work dominates queue time. Lines show batch sizes `1`, `10`, and `50`.
Load Ratio vs Arrival Rate
This shows how close the queue is to saturation. Below `0.7` is healthy, `0.7` to `0.9` is a tradeoff zone, and `1.0` means backlog must grow.
Workload Fit Map
Each cell combines queue overhead, load pressure, and dispatch budget. The live crosshair marks the selected workload.
Importance of Group Commit
Batching, or group commit, is the most important feature. This is in line with every system that uses object storage as the primary durable layer.
Without batching:
- every job pays S3 latency on its own
- queue time per job stays high
- capacity stays low
- small, fast jobs look like a bad fit
With moderate batching:
- one durable publish can cover multiple jobs
- queue time per job drops quickly
- capacity goes up
- the queue starts to look reasonable once useful work is at least
10xto20xthe queue time
In practice
Good fits:
- coarse durable jobs such as exports, ETL partitions, report generation, and media processing
- moderate-rate async jobs where batching is possible and payloads are mostly references
Poor fits:
- real-time dispatch
- hot shared coordination points
- many-writer queues with tiny units of work
- workloads whose dispatch budget is already close to raw object storage latency
Future enhancements
The current system design of s3q is intentionally simple:
- It is a library
- It overwrites the SQLite database for every durable write.
- It leaves group commit, the most important feature, to the application.
Immediate improvements to make s3q more usable are:
- Support group commits in the library.
- Add a service that producers and consumers can use for queue operations.
- This reduces the number of writers.
- This increases the scope for larger group commits.
- If the size of the SQLite database becomes large, investigate writing CDC streams.