I built a TES server because I wanted to understand one
Poiesis v0.1 is out. A GA4GH Task Execution Service that runs on Kubernetes, written from scratch in Python. Here is what it does, why I bothered, and the design I landed on.
I spent the summer working on TESK, the reference Kubernetes implementation of the GA4GH Task Execution Service, as part of a Google Summer of Code project at ELIXIR Cloud. The work was good. The codebase taught me a lot. By the end of it I had also accumulated a stack of opinions about how I would do things differently if I were starting from scratch.
There is exactly one way to find out whether those opinions are any good, which is to start from scratch.
So I did. The result is Poiesis, a GA4GH TES implementation written in Python, designed Kubernetes-native from the first commit, released this week as v0.1.0. This is the story of what TES is, why anyone should care, and the architecture I landed on.
What is TES, and why bother
If you have never run a bioinformatics workflow you can probably skip this section. If you have, you already know the pain.
A research group runs an aligner, a variant caller, three QC steps, and an annotator. Each step is a container. The steps are wired together by a workflow engine (Nextflow, Snakemake, Cromwell, what have you). The whole pipeline needs to run on whatever cluster the institution has: a slurm queue, an HPC scheduler, a Kubernetes cluster in a cloud, sometimes all three depending on funding cycles.
The workflow engine and the cluster do not speak the same language. The engine knows about “tasks”. The cluster knows about jobs, pods, slurm scripts. Every workflow engine has historically grown its own per-backend adaptor for every cluster manager, which is a combinatorial explosion that nobody actually wants to maintain.
GA4GH (the Global Alliance for Genomics and Health) standardised the Task Execution Service to put a single REST API in the middle. The engine submits “tasks” to a TES server. The TES server is responsible for running those tasks on whatever cluster it sits on top of. Add one TES server per cluster type, and any workflow engine that speaks TES can target any cluster, without N×M adaptors.
The contract is small and precise:
- A Task is a sequenced list of Executors.
- An Executor is “run this container image with this command and this env, with these inputs available on disk, and capture stdout/stderr.”
- Executors run strictly sequentially. Executor 2 must not start until Executor 1 has finished successfully. If Executor 1 fails, Executors 2..N must not run.
- The Task also has inputs to be staged in from object storage (S3, HTTP) before any Executor runs, and outputs to be uploaded back after they finish.
- The Task terminates in one of four states:
COMPLETE,EXECUTOR_ERROR,SYSTEM_ERROR,CANCELED.
That’s it. Strict sequencing, shared filesystem between Executors, inputs in front, outputs at the back, predictable terminal states.
The complication, of course, is that you’re running it on Kubernetes. And Kubernetes has its own model of containers and lifecycles, which does not quite line up with the TES model, which is where the actual engineering happens.
Why not just use TESK
The honest answer: you can. TESK works. It runs in production at multiple institutions. If you have a TES-shaped problem today and need a thing that runs, install TESK.
But while reading the codebase, I kept noticing things I wanted to do differently. The data model carried some legacy from earlier iterations. The Kubernetes integration was conservative in a few places where I thought it could lean into newer primitives. The operator surface (Helm chart, secrets, RBAC) was something I wanted to design from first principles rather than evolve.
None of those are criticisms of TESK. They are the kinds of itches you accumulate when you spend months inside a codebase. The right thing to do with that kind of itch is to either send patches or build the thing you wish existed, learn what was actually hard, and bring the lessons back. I chose to build, because I wanted to see what the design space looked like with no constraints from continuity.
The name Poiesis comes from the Greek for “the act of bringing something into being”. It seemed like a reasonable name for a thing whose job is to bring tasks into being on a cluster. It is also a name that nobody else had taken on PyPI, which is half the battle with naming software.
The architecture
Building a TES server on Kubernetes means answering, concretely, every one of these questions:
- Where do tasks live while they are running?
- How do executors share a filesystem?
- How is sequential execution enforced?
- Who watches the executors and writes their state somewhere?
- How does the API server know what is happening inside a running task?
- What happens if any of those moving parts dies?
Here is the v0.1 answer.
A task runs as a small fleet of Kubernetes Jobs, coordinated by a long-lived orchestrator process.
When the API receives a CreateTask, it writes a Task record to MongoDB, allocates a PVC sized to the task’s requested disk, and creates a Pod called Torc (Task Orchestrator). Torc owns the task lifecycle from this point on.
Torc’s job is to launch three workers in sequence:
- TIF (Task Input Filer). A Kubernetes Job that mounts the task PVC and stages all declared inputs onto it. Pulls from S3, HTTP, or the inline content field, depending on the input’s URL scheme. When TIF finishes successfully, Torc moves on.
- Texam (Task Executor and Monitor). A Kubernetes Job whose only purpose is to launch and watch the user’s executors. Texam reads the task’s executor list and creates one Kubernetes Job per executor, in sequence, mounting the same PVC. Each executor runs the user’s container image, executes their command, writes to the shared volume, and exits. Texam waits for each executor to terminate before launching the next, and aborts immediately if any executor returns non-zero. When the last executor finishes, Texam exits.
- TOF (Task Output Filer). Mirror image of TIF. Mounts the PVC, walks the task’s output list, uploads files to wherever the URL says they should go. Glob patterns are expanded here. When TOF finishes, Torc writes the terminal state to Mongo and exits, and Kubernetes garbage-collects the rest.
Coordination between Torc and its children happens over Redis pub/sub. Each child publishes a terminal message on a channel keyed by the task ID. Torc subscribes and blocks until it gets the messages it expects. Redis was the natural choice because it is operationally cheap, supports the exact “fan-in messages from N workers” pattern we need, and the existing TES ecosystem already runs alongside Redis for similar reasons.
The component vocabulary is intentional. Torc is the orchestrator, TIF is the input filer, Texam is the executor-and-monitor, TOF is the output filer. They are nouns, they appear in logs, they appear in pod names, they appear in metric labels. Once you have the vocabulary the architecture is easy to talk about, which matters more than the code itself once a team is operating the thing.
The pieces I am proud of
A few decisions feel right.
Sequencing happens in Torc, not in the Kubernetes manifest. I considered shoving inputs, executors, and outputs into one big Pod with N init containers, but the executor count is dynamic and user-supplied, and there are a lot of corners (cleanup, restarts, executor-level failure surface) where having each phase as its own Kubernetes resource gives me sharper edges to reason about. Splitting it into Jobs costs Pod startup time, but it buys clean, observable phase boundaries and a per-phase failure story. I will take observable over fast on a system whose tasks usually run for minutes to hours anyway.
The data model is Pydantic from end to end. Every request, every internal model, every persistence object goes through Pydantic. The GA4GH TES schema is the source of truth, code-generated where possible, and the API will not accept a task that doesn’t validate. This catches more bugs than I want to admit.
Authentication is OIDC-first. Generic OIDC resource server. Works with Keycloak, works with any institutional IdP, works with the hyperscalers. There is a dummy auth mode for local development that is loudly noisy in the logs so nobody ships it by accident. No bespoke username/password story; identity is somebody else’s job, and the institutions Poiesis is built for already have an IdP.
Helm chart from day one. Not “we will have a Helm chart in a future release”. The chart is the supported install path. Bare manifests are a development convenience. This forced me to think about config surface, secrets, RBAC, and resource limits as first-class design decisions instead of afterthoughts.
Filer strategies. Inputs and outputs go through a FilerStrategy abstraction. S3Filer, HttpFiler, ContentFiler are the v0.1 implementations. Adding a new storage backend is one file. The first contributor who shows up wanting GCS support has a clearly-shaped place to put their work.
Glob patterns in outputs. A real bioinformatics task often produces “everything under /work/results/”, not a known list of paths. Outputs accept glob patterns and expand them at upload time. This is the kind of feature that sounds small until you try to use a TES server without it, at which point it is the entire ballgame.
Dynamic Kubernetes job configuration. Resources, node selectors, tolerations, security contexts can all be parameterised per task through the API’s backend_parameters. Operators can lock down defaults in the chart; users can request overrides within whatever envelope the operator allows. This is the kind of thing that sounds like a yak-shave until the first time someone needs to run a GPU executor on a tainted node, and then it is suddenly the only thing that matters.
The pieces I know are going to bite me
I am not naive about the architecture. A few things are going to need attention:
The pod count per task is 4 + N. Torc, TIF, Texam, TOF, plus one per executor. A user submitting a task with five executors pays for nine Pod scheduling events. Cluster operators with tight admission quotas are going to feel this. I do not have a clean answer yet; the natural alternative (one Pod per task with init containers) has its own constraints around the executor watcher’s lifetime.
Redis is a stateful dependency on the critical path. If Redis hiccups, tasks freeze. Operators have to run a Redis they understand and add it to their compliance surface. I am paying this cost because the alternative (a custom controller, or per-task heartbeats into the database) would have been substantially more code in v0.1. It is a real cost, and at some point a serious operator is going to ask me to remove it.
MongoDB cannot enforce the TES schema. Pydantic enforces it on the way in. Nothing enforces it on the way back out. If the schema drifts in code, persisted documents drift quietly along with it. The audit-trail story is “trust the application code”, which is a fine answer for a v0.1 demo and a less-fine answer for a regulated environment.
Cancellation latency is whatever the next Redis message is. If you cancel a task during a long executor, the cancel takes effect when that executor finishes, not when you pressed the button. There is a mitigation in the code, but the mitigation is “best effort”, which is the polite English for “sometimes.”
Kueue integration is a non-starter. Kueue admits Jobs. The Poiesis unit-of-work at the moment is “the Torc Pod and everything it eventually creates”, which is not a single Job. Operators who want Kueue-style batch admission cannot have it in v0.1. This bothers me more than the other items because it is structural, not incremental.
I am writing these down on the public internet partly because they are real and partly because the next time I sit with this codebase I want to read this list and decide which of them is worth fixing first.
What v0.1 is and isn’t
It is:
- Spec-compliant against the GA4GH TES 1.1 schema.
- Tested against Nextflow as a TES backend, in a real cluster, with a real pipeline.
- Installable with one
helm installand a Mongo connection string. - Documented with end-to-end deploy guides for MinIO-backed local clusters and a generic OIDC story.
- Apache 2.0 licensed.
It isn’t:
- Battle-hardened. v0.1 means v0.1. The thing has run real tasks; it has not yet run a million.
- Optimised for tiny tasks. The
4 + NPod count is what it is. If you submit 10,000 one-second tasks per minute, you are going to have a bad time. Submit 100 minute-to-hours-long tasks per minute and you will be fine. - Multi-tenant in the strict sense. There is one Postgres-less Mongo. There is one task namespace. Operators wanting strict per-tenant isolation should run multiple Poiesis instances.
The repo lives at github.com/jaeaeich/poiesis. The Helm chart is in deployment/helm. The docs are linked from the README. The Nextflow guide is the fastest way to convince yourself this thing works.
If you run TES on Kubernetes, or you are thinking about it, or you have opinions about how it should work, please open an issue. I have spent enough months alone inside this codebase that I am genuinely interested in hearing where my taste is wrong.