back
13 min read

Running large-scale compute on Kubernetes

I started Poiesis after looking at TESK during GSoC and thinking how hard could it be. Pretty hard, it turns out. The story of v0.1, what it cost me, and why v0.2 deletes most of it.

#kubernetes#tes#ga4gh#python#architecture

I started looking at TESK during a Google Summer of Code project. TESK is the reference Kubernetes implementation of the GA4GH Task Execution Service. Roughly, it is “Kubernetes Jobs, but with a stable, spec-compliant REST API on top so workflow engines like Nextflow and Snakemake can submit work to a cluster without caring how that cluster is wired.”

Reading TESK’s source, I had the same thought I always have when I read someone else’s code: I could do this differently. Most of the time that thought is wrong; sometimes it’s right; either way, the only honest way to find out is to write it.

So I wrote Poiesis. v0.1 shipped, ran tasks, and worked well enough to be useful. v0.2, which is what this post is about, deletes most of the v0.1 architecture and replaces it with something genuinely smaller. This is the story of how the first design met reality, what it cost, and why the new one collapses most of the moving parts into a single Kubernetes Pod.

What a TES task actually is

Before any of the architecture makes sense, the contract:

That’s the whole thing. Strict sequencing, shared filesystem between Executors, inputs in front, outputs at the back, predictable terminal states. The complication is that you’re running all of this on Kubernetes, which has its own model of containers and lifecycles that doesn’t quite line up with the TES model.

v0.1: the obvious design

When I sat down to write Poiesis, the obvious thing to build was a small fleet of components that each owned one part of the lifecycle. Anyone who has been around batch systems will recognise the pattern: an orchestrator, a couple of workers, a message broker, and a database.

The v0.1 fleet:

Drawing that as a diagram in your head: API talks to Mongo and creates a Torc. Torc creates a PVC, launches TIF, waits on Redis. TIF runs, publishes done. Torc launches Texam, waits on Redis. Texam launches N executor Jobs, watches each, publishes done. Torc launches TOF, waits on Redis. TOF runs, publishes done. Torc writes terminal state to Mongo and exits.

It worked. It ran real tasks. I added a Helm chart, OIDC auth, dynamic K8s job config, glob-style input/output paths, multi-arch images, and at some point a Nextflow integration guide. Tasks went in; results came out.

But the design had a tax I kept paying.

What v0.1 actually cost

Three things, mostly:

1. Pod count per task was 4 + N. A single task with three Executors started seven Pods: Torc + TIF + Texam + 3 executor Jobs + TOF. Every one of those paid its own scheduling delay, image-pull delay, and container-start delay. For a workflow engine submitting hundreds of small tasks, this was the difference between “interactive” and “batchy.” For an operator, it meant the cluster’s scheduler and admission quotas were being pressured by orchestration overhead, not by actual user work.

2. Redis was on the critical path. Every task waited on Redis pub/sub messages to advance through its lifecycle. If Redis hiccuped, tasks froze. Redis is fine software, but it is stateful software, and adding stateful software to the hot path of a thing whose whole job is to schedule other software is a particular kind of mistake. It also meant operators had to run a Redis they understood, monitor it, back it up if they cared, and add it to the compliance surface if they were in a regulated environment. Nobody installs a TES because they wanted to also install a Redis.

3. Torc dying mid-task was an outage class nobody owned. Torc was the only process that knew a task was in flight. If a Torc Pod died after launching Texam but before Texam published a terminal message, the children kept running, finished, published into the void, and the task record in Mongo sat in RUNNING forever. There was no reconciler. I wrote a “monitor” Job that papered over the worst cases, but the truth is that the architecture had a hole and the hole was structural.

There were smaller paper cuts too. MongoDB could not enforce the TES schema, so subtle drift in document shape accumulated over months. The auth story was bolted on. Cancellation took effect “when the next pub/sub message arrived,” which was sometimes never. Adding Kueue for batch admission was a non-starter because Kueue admits Jobs, not orchestration trees.

None of this made v0.1 broken. It made v0.1 expensive to operate, which over the lifetime of the project is the same thing.

The realisation that changed everything

I’d been mentally accepting the 4 + N Pod count as a cost of doing business, because how else would you run N sequenced containers with inputs and outputs? You’d need to launch them in sequence, you’d need something watching, you’d need somewhere to coordinate from. That’s just what orchestration looks like.

Except it isn’t. Kubernetes has a primitive for “run a list of containers strictly sequentially, abort on first failure, share a filesystem between them.” It’s called init containers.

And once I saw that, I couldn’t unsee the rest:

The match was not approximate. It was exact. The init-container semantics of the platform Poiesis was running on were already the execution semantics of the spec Poiesis was implementing. I’d been writing a distributed orchestration layer to do, badly and over Redis, what one Pod manifest could do natively.

There was one missing piece. I still needed something inside the Pod that could watch the init containers progress and write state transitions to the database, because the API server outside the Pod couldn’t observe init-container statuses in real time. A normal sidecar wouldn’t work, because sidecars start after init containers finish, and by then the executors are already done.

But Kubernetes 1.29 had shipped native sidecar containers: init containers with restartPolicy: Always. A native sidecar starts before subsequent init containers, stays alive for the whole Pod, and is signalled to terminate after the main containers exit. It is exactly the shape of “a thing that watches the executors progress and records what happened.” And by 2026, the K8s version floor of 1.29 is no longer aspirational; any cluster a TES deployment realistically targets has been there for over a year.

That was the whole redesign.

v0.2: the TaskPod

In v0.2, one task is one Kubernetes Pod, wrapped in a Job. The Pod is composed of:

All containers share one Task PVC sized to TesResources.disk_gb, owned by the Job via ownerReferences so Kubernetes itself garbage-collects the PVC when the Job is deleted. No application-level cleanup code on the disposal path; if Kubernetes is up, cleanup happens.

Outside the per-task Pod, there is exactly one global piece left:

TCtl is not on the happy path. The TRec writes terminal state itself in the normal case. TCtl is there to cover the cases the TRec physically cannot self-report: the node died, the TRec itself got OOM-killed, the Pod was evicted before it could finalise. Every comparable Kubernetes-native batch system (Argo Workflows, Tekton, Volcano, the native Job controller) has this same shape: a per-task agent that does the work, and a global backstop reconciler for the cases the agent can’t speak for itself.

The new pod count per task is one. The new components map cleanly: API + TaskPod + TCtl. The new dependencies are Kubernetes and Postgres. Nothing else.

What got deleted

In the same redesign, three components left the codebase:

The repo has a single commit titled refactor: delete v0.1 scaffolding flagged by vulture. Vulture is a Python dead-code detector. After the rewrite, it found a lot of dead code, because most of the v0.1 control-flow scaffolding no longer had anything to do.

Postgres, and why the document-store thing finally bit me

The other half of v0.2 was the persistence rewrite.

MongoDB was fine. MongoDB ran. MongoDB stored documents. The trouble was that MongoDB had no opinion about the shape of those documents. The GA4GH TES schema is precise (Tasks have specific fields, Executors are sequenced and have specific fields, Logs nest in specific ways), and at no point during v0.1 was anything actually enforcing that schema at the persistence layer. The Pydantic models in the API enforced it on the way in; nothing enforced it on storage. Subtle drift accumulated. An auditor asked me, late in v0.1’s life, what the source of truth for the schema was, and the honest answer was “the application code, which has had three contributors and four refactors.” That answer is not acceptable in a regulated environment, and the entire point of GA4GH compliance is that you can hand the system to someone in a regulated environment.

v0.2 is on Postgres, with a fully relational schema, foreign keys, NOT NULL constraints, enum types for terminal states, and a task_log / executor_log / system_log split that matches the spec one-to-one. Migrations are real migrations, not “the code probably writes the new shape eventually.” Concurrent writes are now safe by virtue of the database’s existing concurrency story, not by a layer of optimistic application-side logic.

This was unglamorous work. It is also the thing I will be quietly grateful for every time an auditor asks for a schema diagram.

The non-obvious decisions

Some calls along the way that mattered more than they look:

The TRec gets RBAC to watch its own Pod, and nothing else. Specifically, resourceNames: [<this-pod-name>] on a Role scoped to its namespace. A user-submitted task cannot use the TRec’s credentials to enumerate the rest of the cluster. The blast radius is exactly one Pod, by Kubernetes’ own enforcement.

The TCtl doesn’t emit a heartbeat. An earlier sketch had the TRec emitting heartbeats and a sweeper killing stale tasks. That was wrong: a database-only sweeper has no way to know why a task terminated. OOMKilled versus Evicted versus Error matters to the caller, and only the K8s API knows the difference. The TCtl asks Kubernetes directly via an informer, and writes the real reason into Postgres. The happy path never touches TCtl; the unhappy path gets a better answer than a heartbeat could give.

pause as the main container. This is a small, slightly silly detail, but: Kubernetes requires containers: to be non-empty. All the real work happens in init containers. The main container is gcr.io/google_containers/pause, which sleeps forever doing nothing, and a small ack from the TRec causes the Job to complete after the last init container finishes. I am not the first person to use pause this way, but it makes me happy every time I see it in the manifest.

ttlSecondsAfterFinished on the Job, ownerReferences from PVC to Job. Cleanup is two Kubernetes mechanisms, configured once in the manifest. No reaper code, no janitor cronjob, no “deleted from the application but is the PVC still there?” failure modes. The platform owns it.

FastAPI replaced connexion. Boring rewrite, no perf story. The motivation was that connexion’s OpenAPI-first model was costing more in friction than it was buying in spec adherence, and FastAPI’s Pydantic-native shape matched how the rest of the codebase was already structured. Sometimes a refactor is just paying back the cost of an early framework choice.

What v0.2 buys, concretely

For the operator:

For the TES client:

For me:

The thing I’d tell past-me

It would have been very easy to keep adding features to v0.1. The architecture worked. The hot patches all landed. The operators using it weren’t complaining loudly. They were quietly putting up with the operational tax because TES on Kubernetes wasn’t a crowded space and the alternatives weren’t obviously better.

The right call was to stop and ask: if I were starting today, knowing what the K8s platform now gives me, would I build this the same way? The answer was no, and the gap between “no” and “yes” was so large that the right move was to redesign, not patch.

Most rewrites are mistakes. The rewrites that aren’t mistakes are the ones where the platform you’re built on has gained a primitive that subsumes one of your components. Init containers were always there. Native sidecars (K8s 1.29) closed the last gap. Once both existed, what Poiesis v0.1 was doing was no longer “engineering”. It was “carrying around the old way of doing it.”

The repo is at github.com/jaeaeich/poiesis, Apache 2.0. Helm chart is in deployment/helm.

If you’ve made it this far and you run TES on Kubernetes, or you have an opinion about how you’d run it, open an issue. I am genuinely interested in the cases I haven’t seen yet.