April 28, 2026 15 min read

Optimizing a handrolled S3 client

Building s3z from scratch in Rust. The wins, the reverts, and every wall I hit between.

#s3#rust#perf

S3, AWS’s object storage, is proprietary. The wire protocol, on the other hand, is just HTTP plus SigV4 plus a smattering of XML. So at some point I asked myself, in the cocky way only a Rust programmer can be cocky: how hard can this be? Surely I can build a high-performance S3 client from scratch.

Spoiler: the protocol is the easy part. Everything else (connection pools, multipart races, signing on retry, the syscall pattern your kernel actually likes) is where the bodies are buried. This is the story of s3z: a from-scratch Rust S3 client and library with Python and Node bindings, benchmarked against mc, s5cmd, and aws-cli across four S3-compatible backends.

If you’d rather skim than read, the repo has the plots. They are humbling, and pretty.

The smallest possible thing that could fail

The very first commit was types and config. No networking, no IO, nothing. The whole point was to nail down the shape of the API before any byte ever hit a socket.

1d7ee4e  feat: add core types, config, auth, and error handling

ObjectKey, Bucket, Config, CredentialSource, a typed error tree. The plan was simple:

Core types and a sane error story. No anyhow in the library. Only typed errors. anyhow is fine for CLIs; library users want to match.
An HTTP layer that signs and retries. SigV4 is fiddly but well-specified.
A transfer engine on top. Multipart, batch, fan-out. The actual interesting bit.
A CLI, mostly to dogfood and partly because I wanted to type s3z up ./dir and have it Just Work.
Bindings. Once Rust works, expose it to Python and Node.
A benchmark harness. Because every README claims “blazing fast” and almost none of them prove it.

The HTTP layer came next: percent-encoded ObjectKey (RFC 3986; S3 is very particular here), signed request builders for both standard and UNSIGNED-PAYLOAD streaming uploads, XML parsing, and exponential backoff retry classifying 5xx and 429 as retryable. Nothing surprising; everything tedious.

8c53943  feat: add HTTP layer, ObjectKey encoding, request signing, response parsing, retry

The transfer engine, or how I learned to stop worrying and love bounded channels

The commit I’m proudest of from the early work:

c5aa1d4  feat: add transfer engine and upload operation

A naive S3 uploader looks like this:

for file in files {
    client.put_object(file).await?;
}

That works. It is also approximately useless against a remote bucket, because you’re paying one full round-trip per file, sequentially, no pipelining, no concurrency. The upload engine in s3z is a producer-consumer pipeline:

A part scheduler plans multipart uploads up front. File size in, part offsets and sizes out.
A bounded mpsc channel feeds parts to a fixed pool of upload workers. Bounded so the scheduler can’t outrun the network and blow memory; fixed pool so we have a hard ceiling on concurrent in-flight requests.
An AbortGuard wraps every multipart upload. If anything panics, returns Err, or the future gets dropped mid-flight, the guard calls AbortMultipartUpload on the way down. S3 will happily charge you for the storage of orphaned parts otherwise. Don’t be that person.
Batch upload walks directories recursively and feeds the engine.

This sounds clean. It is mostly clean. It also broke in several creative ways before I could trust it.

Pitfall #1: the part the scheduler dropped on the floor

Once I had real benchmarks pushing real load, this fell out:

c90f1c7  fix: multipart race, stale SigV4 on retry, and hardening

The original worker loop used try_recv in a hot poll. The intent was “grab a job if there is one, otherwise check for completions.” What actually happened: under load, the scheduler pushed parts into the channel faster than the workers could drain, and try_recv’s racy poll would occasionally see “empty” right after a push because of cross-thread visibility timing. Parts got dropped. Uploads silently completed with missing chunks. S3 returned 400 InvalidPart.

Fix: replace try_recv with tokio::select! racing job reception against task completion. No more racy polling; the runtime now actually waits for either side to be ready.

The same commit fixed an even more embarrassing bug. SigV4 signatures include a timestamp. The retry loop was re-using the same signed request on each attempt, so if the first try took 6 seconds and got a 503, the retry sent a request with a stale timestamp, which S3 helpfully rejected as RequestTimeTooSkewed. The retry-of-the-retry failed for the same reason. Fix: re-sign on every attempt. Obvious in hindsight. Obvious in every retry loop I will ever write again.

Pitfall #2: my connection pool was a benchmark

At some point I shipped this:

f445a68  perf(core): warm connection pool automatically on client init

The idea: pre-establish 256 TCP connections at client construction so the first wave of uploads doesn’t stall on sequential TLS handshakes. I felt clever. I made the constructor async. I patted myself on the back.

A handful of commits later:

39cf333  perf(core): remove eager connection pool warmup

The warmup added ~250ms to ~1s of startup overhead on every single invocation. For a CLI tool you might run a hundred times an hour, that’s a tax you pay forever. Worse, it produced bimodal timing distributions across every operation in the benchmark. Half the runs warmed cleanly, half got stuck behind a flaky handshake, and the bench output looked like a Rorschach test.

Reqwest’s pool grows on demand at ~2ms per new connection, which is laughably small compared to S3’s round-trip latency. So I ripped my own optimization out. The lesson: measure before optimizing, and measure after. The bimodal spikes vanished. Throughput was unaffected. The constructor went back to being sync.

This is the first of two “I was wrong, here is the revert” commits I want to highlight rather than hide. Performance work is mostly being wrong in interesting ways.

Pitfall #3: 1.5 GiB of RAM for a single-file upload

62aeede  perf(core): stream single-put uploads from disk instead of buffering

For files below the multipart threshold (50 MiB at the time), single_put was doing fs::read(path) and shoving the whole thing into a Vec<u8>. With 32 workers, against a remote endpoint where each upload sits in flight for hundreds of milliseconds, peak RSS hit 1,566 MiB. For uploading some 40 MB files. Embarrassing.

Switched to a 256 KiB ReaderStream matching the multipart path. Peak RSS: 85 MiB. No throughput regression. The signing scheme changed too: single-put now uses UNSIGNED-PAYLOAD (the entire chunked-payload SigV4 dance is genuinely annoying, and UNSIGNED-PAYLOAD is fine over HTTPS, which you’re using, right?).

Pitfall #4: TCP, Nagle, and other things you forgot existed

71773f8  perf(core): fix worker starvation, add retry jitter, tune TCP

Four problems in one commit, each its own minor headache:

Worker starvation again. Even after the try_recv fix, there was a path where workers could idle while completions queued up. recv was blocking on the job channel without simultaneously polling the completion side. Unified into a single select! racing both. Concurrency slots stayed saturated.
Thundering herds on retry. When many workers hit a throttling response at once, they backed off by the same exponential amount and retried in lockstep, getting throttled together again. Added equal-jitter backoff. No new dependency; it’s let jitter = rng.gen_range(0..=base); and the retry math.
Nagle and delayed ACK. With 256 KiB streaming writes, Nagle’s algorithm and delayed ACK conspire to stall every write by ~40ms while the kernel hopes more data shows up. tcp_nodelay on.
ALPN negotiation on cold connections. http1_only skips the ALPN dance. S3 is HTTP/1.1 anyway; there’s nothing to negotiate.

None of these are exotic. All of them are the kind of thing you only find by actually running the workload under load and watching what the kernel is doing.

Pitfall #5: the tokio blocking pool is not your write thread

974411a  perf(transfer): replace spawn_blocking with dedicated writer thread

For multipart downloads, every range part writes its chunks back to the destination file via pwrite. The first implementation dispatched each pwrite through tokio::task::spawn_blocking. That works. It is also, for a 1 GB download, ~10,000 spawn_blocking calls through tokio’s shared blocking pool. A pool that other parts of your program (DNS, filesystem ops, anything else) are also sharing.

When the OS page cache decided to flush under concurrent write pressure, the whole thing cascaded. 5 GB download times bimodally distributed between 4 seconds and 29 seconds. Same input, same network, wildly different timings.

The fix is structural: route all pwrite calls for a given file through a single dedicated std::thread per file. Async download tasks send buffered chunks over a bounded std::sync::mpsc channel; the writer thread drains the channel with sequential pwrites. No cross-thread scheduling per write, no contention with tokio’s shared pool, no cascading stalls.

5 GB download variance after the change: 4 to 6 seconds. Stable.

A related win in the same area:

ac47c1f  perf(transfer): buffer pwrite calls and uncap download part size

Buffering 512 KiB of network chunks before issuing pwrite cut syscalls from ~64K/GB to ~2K/GB. And I removed the 256 MiB upper cap on download part sizes. S3 imposes no Range limit, so a 10 GB file at concurrency=8 now downloads as 8 × 1.25 GB streams instead of 40 × 256 MB streams. Fewer Range requests, each with its own signing and round-trip cost. Throughput up.

Pitfall #6: my part sizing was a constant when it should have been a function

e3ae055  feat(transfer): dynamic part sizing based on file size and concurrency
f2a72e5  refactor(transfer): split part sizing for upload and download
c8e27aa  perf(transfer): dynamic per-file concurrency based on size

The original part size was hardcoded to 50 MiB. Fine for some files. Terrible for others:

A 50 MB file gets one part. No pipelining benefit. Should just be a single PUT.
A 10 GB file at 50 MB parts gets 200 parts. Mostly fine, but with 8 workers you’ve got a 200-deep queue and no benefit beyond ~16 in-flight.
A 500 GB file would blow past the S3 10,000-part limit. (Asserted with a panic now: assert!(num_parts <= 10_000), because silently producing invalid multipart uploads is the kind of UX I refuse to ship.)

The heuristic landed as: target concurrency × 2 parts per file for uploads (variable PUT latency benefits from queue depth), and concurrency × 1 for downloads (the server pushes data immediately; fewer Range requests wins). Clamped 8 MiB to 256 MiB on the upload side.

Then a separate observation: concurrency itself shouldn’t be fixed per-file either. A 128 MB upload with 8 streams just allocates 8 buffers and underutilizes them. Tiered the per-file concurrency: ≤256 MiB → 2 streams, 256 MiB to 2 GiB → 4, >2 GiB → 8. Small-file uploads dropped from 79 to 92 MB RSS to 30 to 39 MB with no throughput regression. Big files still saturate the link.

Pitfall #7: a backend that demanded a Content-Length

This one is just funny.

63ff803  fix: set explicit Content-Length and sign empty bodies correctly

I’d set up a local docker-compose with four S3-compatible backends (MinIO, RustFS, SeaweedFS, and Garage), specifically so I could catch protocol assumptions before they shipped. RustFS rejected every PUT I sent it with UnexpectedContent. Took me an embarrassing amount of time to realize it wanted an explicit Content-Length header on every signed request, even for bodies where reqwest would have filled it in automatically downstream.

Worse, my empty-body signing was lazily treating empty bodies as UNSIGNED-PAYLOAD. RustFS’s stricter SigV4 validator wanted the real SHA256 of zero bytes (e3b0c4..., burn that hash into your memory if you do any S3-adjacent work). Both fixed. All four backends green.

I’d also love to tell you all four backends stayed in the benchmark forever. They did not:

55d77d3  chore(bench): remove rustfs due to high instability

RustFS hung randomly under bench load. After a half-day of trying to characterize the hang, I removed it. The benchmark pinned all four images, but stability is a feature you can’t add by pinning. Bench profile down to MinIO, SeaweedFS, Garage. Move on.

The benchmark harness was its own project

002e7a0  feat(bench): add benchmark framework with pluggable tools and operations
85ee761  feat(bench): regression detection and pluggable operations

I genuinely did not expect to spend this much time on the bench. But the project’s whole thesis is “this is faster”, so the bench has to be trustworthy.

Design decisions that paid for themselves:

Tool auto-discovery. Drop a bench/tools/aws.py and it’s picked up. Same for operations (bench/operations/upload.py, download.py, list.py). Adding s5cmd was one file with no orchestrator changes.
Warmup runs. Cold-start noise is a real problem for short benchmarks. One warmup per cell.
Welch’s t-test plus noise floor gate for regression detection. bench compare exits non-zero if the post-change distribution is statistically different from baseline and the change exceeds the measured noise floor. Tight inner loop: bench:save to snapshot, edit, bench:dev, bench:compare.
Adaptive sample counts. High-variance cells get more samples until the CI tightens; low-variance cells finish quickly. Wall-clock for a dev run stays at ~3 to 5 min, full at ~10 to 15 min.

The same setup almost lied to me about one tool:

0083423  fix(bench): mc region signing, dev profile tuning, dynamic plot labels

mc (MinIO’s client) was being unfairly penalized in the bench because Garage’s strict region check was rejecting requests where AWS_REGION wasn’t set in mc’s environment. The other tools picked it up via different env vars. I almost shipped a chart showing mc “losing” because of a config bug on my side. Lesson: if your tool looks anomalously bad, suspect your harness first.

Bindings: where the borrow checker meets the FFI boundary

90b3deb  feat(python): add PyO3 bindings for s3z
6a8ffd8  feat(node): add NAPI-RS bindings for s3z

PyO3 and NAPI-RS are both excellent. They are also both opinionated about what types cross the FFI boundary, and the rough edges live exactly where you’d expect: anywhere a Rust type holds a lifetime.

The clearest example is the list paginator. I’d originally written:

pub struct ListPaginator<'a> {
    client: &'a S3Client,
    continuation_token: Option<String>,
    // ...
}

This is clean Rust. It is also unusable across FFI, because Python and Node have no way to reason about that 'a. The fix:

2b0548f  feat(core): add paginated list API with owned ListPaginator

Owned copies of client internals, no lifetime. Safe to hold across await points, safe to pass to Python, safe to hand to NAPI. The cost is a couple of Arc::clones on construction. The gain is the API works in every language.

The Node binding also had a sharp edge that took an afternoon to track down:

block_on_async helper to avoid nested-runtime panics when called from bun/node async contexts.

NAPI-RS runs on its own tokio runtime. If you call a function that internally calls tokio::runtime::Runtime::new().block_on(...), you get a panic about nested runtimes. The fix is a small helper that detects an existing runtime and dispatches accordingly.

Also: npm rejected the package name s3z as “too similar to existing packages.” It’s now scoped as @jae_aeich/s3z. Naming things, twice.

The reverts and the things I almost shipped

To keep myself honest, here are the “I was wrong” moments, in order:

Eager connection pool warmup (f445a68 → 39cf333). Shipped a “perf” change that introduced bimodal startup latency on every invocation. Removed it. The on-demand pool is faster in the only metric that matters: end-to-end CLI time.
try_recv for the worker poll loop (c5aa1d4 → c90f1c7). Looked like a clever way to avoid blocking. Was actually a way to silently drop parts. Switched to tokio::select!.
Reusing signed requests across retry attempts (8c53943 → c90f1c7). Stale SigV4 timestamps. Re-sign every attempt.
spawn_blocking for every pwrite (7ca495f → 974411a). Worked. Caused 7× variance in tail latency. Dedicated writer thread per file.
Buffering entire files in memory for single-put uploads (c5aa1d4 → 62aeede). 1.5 GiB RSS for a workload that should fit in 100 MB.
Fixed 50 MiB part size (c5aa1d4 → e3ae055 → f2a72e5). Worked for some files, was terrible for the rest. Dynamic sizing with separate upload and download heuristics.
Blanket clippy::restriction (initial commits → d2533ac). Enabling ~80 lints wholesale meant 35 #[expect] annotations suppressing standard Rust idioms. Down to 8 carefully chosen restriction lints, 8 expects.

There’s also one I almost shipped and didn’t. A version where run_pool(workers=0) returned Ok(vec![]) because of an if workers == 0 { return ... } early-return I’d written defensively. Caught at 2am, suspicious about a benchmark cell:

9bbdf29  fix(core): panic on zero workers/concurrency, fix misleading docs

If a caller passes workers=0 they have a bug. Silently doing nothing hides it. Assert and panic with a clear message. The same commit fixed upload_multipart(concurrency=0) silently sending invalid XML to S3, same class of bug.

This is the only category of “defensive” code I actually believe in: turn invalid inputs into loud failures, not quiet no-ops.

What the numbers actually say

The committed reference run lives at benchmarks/2026-04-26_2031_full_55d77d3/, and the plots are regenerated from it on every mise run bench:plot. Roughly:

Upload throughput. s3z saturates the dockerized links and matches or beats s5cmd (the fastest of the established tools) across MinIO, SeaweedFS, and Garage.
Download throughput. Same story, with the dedicated-writer-thread change making the variance the more interesting story than the median.
Listing. Paginated ListObjectsV2 with the channel-driven work pool. The API matters more here than raw speed, but the numbers are fine.
Memory. Peak RSS for the small-file workload sits at ~30 to 40 MB after the streaming and per-file-concurrency fixes.

Three numbers I find satisfying:

Change	Before	After
Single-put RSS (40 MB files, 32 workers)	1,566 MiB	85 MiB
Small-upload RSS (128 MB workload)	79 to 92 MB	30 to 39 MB
5 GB download variance	4 to 29 s (bimodal)	4 to 6 s (stable)

None of these came from cleverness. They came from running the bench, reading the chart, and asking “why is this distribution bimodal?” repeatedly until it wasn’t.

Things I’d tell past-me

Write the bench before you write the third optimization. I wrote it after the fifth. I would have caught the connection-pool warmup regression sooner if I’d had it sooner.
Bounded channels everywhere. If your producer is faster than your consumer, your only choices are bound it, buffer it forever, or drop it. Unbounded buffers are not actually a choice; they’re “drop it, but later, in a memory allocator stall.”
Re-sign on every retry. I will never write a retry loop again without first asking what the request body and headers depend on.
Multiple backends in your local stack pays for itself almost immediately. The Content-Length thing would have shipped without RustFS in the loop, and some user, somewhere, would have hit it on a backend I’d never tested against.
Permissive licenses or bust. s3z is MIT. So is every dependency I pulled in. I checked.

Its the friends we made along the way

The repo is at github.com/jaeaeich/s3z. The CLI installs with one curl. The library is on crates.io. Python wheels are on PyPI. The Node package is @jae_aeich/s3z on npm. There’s an Axum example, a FastAPI example, and an Elysia example, all three speaking the same protocol over Scalar-rendered OpenAPI docs and using s3z directly (Rust) or via FFI (Python, Node).

I learned more writing s3z than I have building anything in a long time, mostly because every optimization I shipped immediately taught me something else was wrong. If you’ve made it this far and want to break s3z in a creative new way, please do. Issues welcome.