Running large-scale compute on Kubernetes
I started Poiesis after looking at TESK during GSoC and thinking how hard could it be. Pretty hard, it turns out. The story of v0.1, what it cost me, and why v0.2 deletes most of it.
I started looking at TESK during a Google Summer of Code project. TESK is the reference Kubernetes implementation of the GA4GH Task Execution Service. Roughly, it is “Kubernetes Jobs, but with a stable, spec-compliant REST API on top so workflow engines like Nextflow and Snakemake can submit work to a cluster without caring how that cluster is wired.”
Reading TESK’s source, I had the same thought I always have when I read someone else’s code: I could do this differently. Most of the time that thought is wrong; sometimes it’s right; either way, the only honest way to find out is to write it.
So I wrote Poiesis. v0.1 shipped, ran tasks, and worked well enough to be useful. v0.2, which is what this post is about, deletes most of the v0.1 architecture and replaces it with something genuinely smaller. This is the story of how the first design met reality, what it cost, and why the new one collapses most of the moving parts into a single Kubernetes Pod.
What a TES task actually is
Before any of the architecture makes sense, the contract:
- A Task is a sequenced list of Executors.
- An Executor is “run this container image with this command and this env, with these inputs available on disk, and capture stdout/stderr.”
- Executors run strictly sequentially. Executor 2 must not start until Executor 1 has finished successfully. If Executor 1 fails, Executors 2..N must not run.
- The Task also has inputs to be staged in from object storage (S3, HTTP) before any Executor runs, and outputs to be uploaded back after they finish.
- The Task terminates in one of four states:
COMPLETE,EXECUTOR_ERROR,SYSTEM_ERROR,CANCELED.
That’s the whole thing. Strict sequencing, shared filesystem between Executors, inputs in front, outputs at the back, predictable terminal states. The complication is that you’re running all of this on Kubernetes, which has its own model of containers and lifecycles that doesn’t quite line up with the TES model.
v0.1: the obvious design
When I sat down to write Poiesis, the obvious thing to build was a small fleet of components that each owned one part of the lifecycle. Anyone who has been around batch systems will recognise the pattern: an orchestrator, a couple of workers, a message broker, and a database.
The v0.1 fleet:
- API server. FastAPI on top of (originally)
connexion, talking GA4GH TES REST. Stateless, horizontally scalable. Wrote Task records to MongoDB. - Torc (Task Orchestrator). A long-lived Pod, one per task, that owned the task lifecycle. Created the PVC, then launched the three worker Jobs in sequence, then cleaned up.
- TIF (Task Input Filer). A K8s Job that staged inputs from S3/HTTP onto the shared task PVC.
- Texam (Task Executor and Monitor). A K8s Job that itself launched one K8s Job per TES Executor, in sequence, and watched each one.
- TOF (Task Output Filer). A K8s Job that uploaded outputs from the PVC after all executors finished.
- MongoDB. Document store for Task records, Executor records, logs.
- Redis. Pub/sub. Each worker published a terminal message on a channel keyed by task ID; Torc subscribed and blocked until it received the messages it expected.
Drawing that as a diagram in your head: API talks to Mongo and creates a Torc. Torc creates a PVC, launches TIF, waits on Redis. TIF runs, publishes done. Torc launches Texam, waits on Redis. Texam launches N executor Jobs, watches each, publishes done. Torc launches TOF, waits on Redis. TOF runs, publishes done. Torc writes terminal state to Mongo and exits.
It worked. It ran real tasks. I added a Helm chart, OIDC auth, dynamic K8s job config, glob-style input/output paths, multi-arch images, and at some point a Nextflow integration guide. Tasks went in; results came out.
But the design had a tax I kept paying.
What v0.1 actually cost
Three things, mostly:
1. Pod count per task was 4 + N. A single task with three Executors started seven Pods: Torc + TIF + Texam + 3 executor Jobs + TOF. Every one of those paid its own scheduling delay, image-pull delay, and container-start delay. For a workflow engine submitting hundreds of small tasks, this was the difference between “interactive” and “batchy.” For an operator, it meant the cluster’s scheduler and admission quotas were being pressured by orchestration overhead, not by actual user work.
2. Redis was on the critical path. Every task waited on Redis pub/sub messages to advance through its lifecycle. If Redis hiccuped, tasks froze. Redis is fine software, but it is stateful software, and adding stateful software to the hot path of a thing whose whole job is to schedule other software is a particular kind of mistake. It also meant operators had to run a Redis they understood, monitor it, back it up if they cared, and add it to the compliance surface if they were in a regulated environment. Nobody installs a TES because they wanted to also install a Redis.
3. Torc dying mid-task was an outage class nobody owned. Torc was the only process that knew a task was in flight. If a Torc Pod died after launching Texam but before Texam published a terminal message, the children kept running, finished, published into the void, and the task record in Mongo sat in RUNNING forever. There was no reconciler. I wrote a “monitor” Job that papered over the worst cases, but the truth is that the architecture had a hole and the hole was structural.
There were smaller paper cuts too. MongoDB could not enforce the TES schema, so subtle drift in document shape accumulated over months. The auth story was bolted on. Cancellation took effect “when the next pub/sub message arrived,” which was sometimes never. Adding Kueue for batch admission was a non-starter because Kueue admits Jobs, not orchestration trees.
None of this made v0.1 broken. It made v0.1 expensive to operate, which over the lifetime of the project is the same thing.
The realisation that changed everything
I’d been mentally accepting the 4 + N Pod count as a cost of doing business, because how else would you run N sequenced containers with inputs and outputs? You’d need to launch them in sequence, you’d need something watching, you’d need somewhere to coordinate from. That’s just what orchestration looks like.
Except it isn’t. Kubernetes has a primitive for “run a list of containers strictly sequentially, abort on first failure, share a filesystem between them.” It’s called init containers.
And once I saw that, I couldn’t unsee the rest:
- TES says Executors run sequentially. K8s init containers run sequentially.
- TES says abort on first failure. K8s init containers abort on first failure.
- TES Executors share a working volume. Init containers in the same Pod share volumes.
- TES has inputs that must arrive before Executors, and outputs that must run after. Init containers run in declared order.
The match was not approximate. It was exact. The init-container semantics of the platform Poiesis was running on were already the execution semantics of the spec Poiesis was implementing. I’d been writing a distributed orchestration layer to do, badly and over Redis, what one Pod manifest could do natively.
There was one missing piece. I still needed something inside the Pod that could watch the init containers progress and write state transitions to the database, because the API server outside the Pod couldn’t observe init-container statuses in real time. A normal sidecar wouldn’t work, because sidecars start after init containers finish, and by then the executors are already done.
But Kubernetes 1.29 had shipped native sidecar containers: init containers with restartPolicy: Always. A native sidecar starts before subsequent init containers, stays alive for the whole Pod, and is signalled to terminate after the main containers exit. It is exactly the shape of “a thing that watches the executors progress and records what happened.” And by 2026, the K8s version floor of 1.29 is no longer aspirational; any cluster a TES deployment realistically targets has been there for over a year.
That was the whole redesign.
v0.2: the TaskPod
In v0.2, one task is one Kubernetes Pod, wrapped in a Job. The Pod is composed of:
- TIF init container(s) that stage inputs onto a single Task PVC.
- N executor init containers, one per TES Executor, sequenced by Kubernetes’ init-container ordering.
- TOF init container(s) that upload outputs.
- The TRec (“Task Recorder”). A native sidecar (the K8s 1.29 trick) that runs for the lifetime of the Pod, watches its own Pod through the K8s API, and writes state transitions and per-executor logs to Postgres as the init containers progress.
- A
pausemain container that exists only because Kubernetes requirescontainers:to be non-empty when you have init containers. It does nothing.
All containers share one Task PVC sized to TesResources.disk_gb, owned by the Job via ownerReferences so Kubernetes itself garbage-collects the PVC when the Job is deleted. No application-level cleanup code on the disposal path; if Kubernetes is up, cleanup happens.
Outside the per-task Pod, there is exactly one global piece left:
- TCtl (“Task Controller”). A small Deployment, 2 or 3 replicas, leader-elected via
coordination.k8s.io/leases. Runs a K8s informer on Pods labelledpoiesis.io/task. If a TaskPod reaches a terminal Pod event and Postgres has no terminal write for it, TCtl writes one, including the K8s-reported termination reason (OOMKilled,Evicted,Error) as a first-class field.
TCtl is not on the happy path. The TRec writes terminal state itself in the normal case. TCtl is there to cover the cases the TRec physically cannot self-report: the node died, the TRec itself got OOM-killed, the Pod was evicted before it could finalise. Every comparable Kubernetes-native batch system (Argo Workflows, Tekton, Volcano, the native Job controller) has this same shape: a per-task agent that does the work, and a global backstop reconciler for the cases the agent can’t speak for itself.
The new pod count per task is one. The new components map cleanly: API + TaskPod + TCtl. The new dependencies are Kubernetes and Postgres. Nothing else.
What got deleted
In the same redesign, three components left the codebase:
- Torc. Its only job was launching things in sequence and waiting on Redis messages. Init-container ordering does both of those, for free, with stronger guarantees.
- Texam. Its only job was launching one Job per Executor and watching them. Same story. Each Executor is now just an entry in the init-container list of the TaskPod.
- Redis. With no cross-Pod coordination remaining, there is nothing for Redis to do. It is gone from the chart, the docs, the security surface, and the operator’s mental model.
The repo has a single commit titled refactor: delete v0.1 scaffolding flagged by vulture. Vulture is a Python dead-code detector. After the rewrite, it found a lot of dead code, because most of the v0.1 control-flow scaffolding no longer had anything to do.
Postgres, and why the document-store thing finally bit me
The other half of v0.2 was the persistence rewrite.
MongoDB was fine. MongoDB ran. MongoDB stored documents. The trouble was that MongoDB had no opinion about the shape of those documents. The GA4GH TES schema is precise (Tasks have specific fields, Executors are sequenced and have specific fields, Logs nest in specific ways), and at no point during v0.1 was anything actually enforcing that schema at the persistence layer. The Pydantic models in the API enforced it on the way in; nothing enforced it on storage. Subtle drift accumulated. An auditor asked me, late in v0.1’s life, what the source of truth for the schema was, and the honest answer was “the application code, which has had three contributors and four refactors.” That answer is not acceptable in a regulated environment, and the entire point of GA4GH compliance is that you can hand the system to someone in a regulated environment.
v0.2 is on Postgres, with a fully relational schema, foreign keys, NOT NULL constraints, enum types for terminal states, and a task_log / executor_log / system_log split that matches the spec one-to-one. Migrations are real migrations, not “the code probably writes the new shape eventually.” Concurrent writes are now safe by virtue of the database’s existing concurrency story, not by a layer of optimistic application-side logic.
This was unglamorous work. It is also the thing I will be quietly grateful for every time an auditor asks for a schema diagram.
The non-obvious decisions
Some calls along the way that mattered more than they look:
The TRec gets RBAC to watch its own Pod, and nothing else. Specifically, resourceNames: [<this-pod-name>] on a Role scoped to its namespace. A user-submitted task cannot use the TRec’s credentials to enumerate the rest of the cluster. The blast radius is exactly one Pod, by Kubernetes’ own enforcement.
The TCtl doesn’t emit a heartbeat. An earlier sketch had the TRec emitting heartbeats and a sweeper killing stale tasks. That was wrong: a database-only sweeper has no way to know why a task terminated. OOMKilled versus Evicted versus Error matters to the caller, and only the K8s API knows the difference. The TCtl asks Kubernetes directly via an informer, and writes the real reason into Postgres. The happy path never touches TCtl; the unhappy path gets a better answer than a heartbeat could give.
pause as the main container. This is a small, slightly silly detail, but: Kubernetes requires containers: to be non-empty. All the real work happens in init containers. The main container is gcr.io/google_containers/pause, which sleeps forever doing nothing, and a small ack from the TRec causes the Job to complete after the last init container finishes. I am not the first person to use pause this way, but it makes me happy every time I see it in the manifest.
ttlSecondsAfterFinished on the Job, ownerReferences from PVC to Job. Cleanup is two Kubernetes mechanisms, configured once in the manifest. No reaper code, no janitor cronjob, no “deleted from the application but is the PVC still there?” failure modes. The platform owns it.
FastAPI replaced connexion. Boring rewrite, no perf story. The motivation was that connexion’s OpenAPI-first model was costing more in friction than it was buying in spec adherence, and FastAPI’s Pydantic-native shape matched how the rest of the codebase was already structured. Sometimes a refactor is just paying back the cost of an early framework choice.
What v0.2 buys, concretely
For the operator:
- One Pod per task at peak, not
4 + N. - Kubernetes and Postgres. Nothing else stateful.
- No Redis on the critical path.
- All resources labelled
poiesis.io/task=<id>. Find any task’s work with one selector. - The chart enforces the K8s 1.29 floor via
kubeVersioninChart.yaml. Installs fail fast on incompatible clusters.
For the TES client:
- Tasks start faster, because there is one Pod scheduling event, not seven.
GET /tasks/{id}reflects reality within seconds, because the TRec writes through as init containers progress.CANCELEDactually means cancelled, including mid-executor, with audit-trail-accurate intent.SYSTEM_ERRORcarries a real termination reason from the K8s API, not a guess from the application.
For me:
- Half the code. The half that’s left is doing real work.
- A schema I can point at.
- A test surface I can reason about, because there’s no Redis to mock and the orchestration model is “Kubernetes ran this Pod.”
The thing I’d tell past-me
It would have been very easy to keep adding features to v0.1. The architecture worked. The hot patches all landed. The operators using it weren’t complaining loudly. They were quietly putting up with the operational tax because TES on Kubernetes wasn’t a crowded space and the alternatives weren’t obviously better.
The right call was to stop and ask: if I were starting today, knowing what the K8s platform now gives me, would I build this the same way? The answer was no, and the gap between “no” and “yes” was so large that the right move was to redesign, not patch.
Most rewrites are mistakes. The rewrites that aren’t mistakes are the ones where the platform you’re built on has gained a primitive that subsumes one of your components. Init containers were always there. Native sidecars (K8s 1.29) closed the last gap. Once both existed, what Poiesis v0.1 was doing was no longer “engineering”. It was “carrying around the old way of doing it.”
The repo is at github.com/jaeaeich/poiesis, Apache 2.0. Helm chart is in deployment/helm.
If you’ve made it this far and you run TES on Kubernetes, or you have an opinion about how you’d run it, open an issue. I am genuinely interested in the cases I haven’t seen yet.