Skip to main content

Building a BPMN engine on Temporal

· 11 min read
Richard Bízik
Founder of QuantumBPM

When we started designing QuantumBPM's process engine, we ran into the same wall every workflow tool runs into: how do you make execution durable? Survive worker crashes, network partitions, weeks-long timers, replay after restart, exactly-once activity semantics? The honest answer turned out to be: don't build that yourself.

This post is about the bet we made — running BPMN on top of Temporal — what it cost us, what it bought us, and why we think Temporal developers are the audience that benefits the most from a BPMN layer sitting above their existing investment.

The bet

A BPMN engine is two distinct pieces of software glued together. One half is a durable execution substrate — the code that survives crashes, replays workflow history deterministically, schedules timers across days and weeks, fans out to workers, retries with backoff, and keeps state consistent under concurrent execution. The other half is the BPMN-specific machinery — element semantics, scope hierarchies, compensation handlers, message correlation, the modeler, the operations UI, the replay scrubber, version migration, RBAC, audit history.

Camunda's bet was to build both halves. Camunda 8's Zeebe is a Raft-based event-sourced workflow broker built on top of Atomix — a from-scratch durable execution engine designed specifically for BPMN. It's a defensible bet, but it isn't a free one.

Two consequences of that bet are worth being specific about, because they shaped our thinking.

The substrate is yours forever, including the parts you'd rather not maintain. Zeebe's consensus layer is a Java fork of the original Atomix. Upstream Atomix moved to Go some years ago and is now an entirely different project — leaving Camunda effectively the maintainer of the Java consensus library underneath their engine. Every layer you build yourself is a layer you maintain yourself, indefinitely. The unglamorous layers (consensus, log replication, storage formats) accumulate maintenance the same way the glamorous ones do.

Decoupling the engine from observability creates structural drift. Camunda 8's runtime is decoupled from Operate, their separate operations product, and synced through Elasticsearch exporters. Those exporters are eventually consistent: events flow from the broker through exporters into Elasticsearch, and Operate reads from Elasticsearch. Operators regularly report Operate falling behind reality, showing stale state, or missing events when the broker has already moved on. That isn't a bug — it's a structural property of an architecture that splits the engine and the operational view into separate products communicating asynchronously.

Our bet was different:

Durable execution is a solved problem. BPMN semantics are not. Build on Temporal and put our engineering effort into the layer where the actual product lives.

Temporal is funded, well-staffed, and runs in production at Uber, Snap, Coinbase, HashiCorp, Stripe, and Datadog among many others. The storage layer, the replay engine, the event sourcing, the SDK story across seven languages — none of that is ours to maintain. Our team gets to spend its time on the BPMN spec, on the modeler, on the operations UI, on simulation, on DMN integration. The layers users actually see.

The shape of the architecture also avoids the observability drift problem. Our operations UI reads BPMN process state by replaying directly from Temporal's workflow history (GetProcessStateFromHistory) — there's no separate exporter pipeline, no Elasticsearch sync, no "the engine says one thing and the UI says another." If Temporal has the event, we have the event. If we don't, neither does Temporal.

That's the bet. The rest of this post is what fell out of it.

What Temporal gives us, for free

Mapping BPMN concepts onto Temporal primitives turned out to be remarkably clean:

  • A BPMN process instance is a Temporal workflow.
  • Service tasks, user tasks, script tasks become activities or signal-receiving blocks inside the workflow.
  • Messages and signals become Temporal signals.
  • Timer events become Temporal timers.
  • Call activities become Temporal child workflows.
  • Variables live in workflow-local state, replayed deterministically with the workflow.
  • Boundary events and event subprocesses are scoped goroutines (workflow.Go) tied to the parent's cancellation context.

Once that mapping clicked, a long list of capabilities came along essentially for free:

  • Durable execution. A worker crash mid-process is invisible to the engine. The instance resumes from the next event in its history.
  • Long-running is the default. A process instance waiting two weeks for a customer document doesn't pin a database row or a goroutine in any non-obvious way. Temporal sized for one workflow scales to the next million the same way.
  • Event-sourced history. Every step is recorded. This is what powers our replay scrubber in the operations UI — given any process instance, we can reconstruct the BPMN-level state at any point in its execution from GetProcessStateFromHistory.
  • Polyglot workers. A service task can be implemented as a Temporal activity in any of Temporal's seven SDK languages (Go, Java, Python, TypeScript, .NET, PHP, Ruby), or — for languages without a Temporal SDK, or workers that need to live outside the Temporal pool — as an HTTP poll-based external worker in any language with an HTTP client.
  • Standard storage. Postgres, MySQL, or Cassandra — pick one. There is no proprietary log format to operate, no custom storage engine to babysit.

That's the substrate. None of it is novel; Temporal has been doing it well for years. The point is what we don't have to build.

What we had to build on top

Temporal alone is not a BPMN engine. It is a primitive — a set of building blocks for writing durable workflows in code. Turning it into a BPMN runtime required a non-trivial amount of work:

  • Full BPMN 2.0 element coverage. Every event type (timer, message, signal, error, escalation, compensation, link, terminate), interrupting and non-interrupting boundary events, event subprocesses, ad-hoc subprocesses, call activities, multi-instance, standard loops.
  • Scope tree and event bubbling. BPMN compensation, escalation, and errors target logical scopes. They traverse the active ancestry chain, not the static graph — so a re-entered subprocess resolves correctly to the currently-live ancestor. A naive Temporal implementation can't handle this; we maintain a deterministic scope hierarchy on every process instance.
  • Cancellable scope goroutines. workflow.Go does not inherit cancellation from arbitrary parent contexts — a footgun we hit early. Our BPMNContext derives node-scoped contexts that are cancelled correctly when an interrupting boundary event fires.
  • Concurrency-safe state for query handlers. Temporal query handlers run on a separate goroutine from the workflow's main execution and can read workflow state concurrently with mutations from inside the workflow itself. We protect InterpreterState with sync.RWMutex so an in-flight getProcessState query never observes a partially-mutated history slice or scope-tree map. This is what makes our replay scrubber UI safe to poll continuously against a running instance.
  • Live instance modification. Insert a token before a node, cancel a token, on a running instance, without restarting the workflow. Exposed via the API today.
  • Version migration. Explicit migration plans move in-flight instances to a newer process-definition version. This is the part of Patch/GetVersion that everyone struggles with, made into a first-class operation on a typed graph.
  • Replay slider UI. Reconstructs BPMN state at any point in the Temporal history and lets operators scrub through the timeline of an instance.
  • Inline DMN integration. Business-rule tasks invoke registered DMN definitions in-process — no external decision-service round-trip. The decision evaluation is recorded in the BPMN execution history alongside the activity that called it.
  • External worker queue. A HTTP poll API for service tasks that need to live outside the Temporal worker pool — languages without a Temporal SDK, workers behind a firewall, or teams who prefer the polling model.
  • The platform layer. Visual modeler, projects, RBAC, audit history, multi-tenancy, OpenTelemetry, Prometheus metrics — what wraps the engine into a product.

The split is clean. Temporal owns durability and execution. QuantumBPM owns the BPMN spec, the user-facing surfaces, and the operational controls.

What this gives Temporal developers back

If you're already running Temporal, you are not the audience that needs to be convinced BPMN is "an executable diagram language" — you already write executable workflows. The question is what BPMN gives you that your current setup doesn't.

The honest answer, from our side:

1. A diagram for the things engineers don't own. Temporal Web is a great tool — for engineers. It shows event histories, workflow histories, activity logs. It does not give a product manager, an ops engineer, a compliance officer, or a support rep a way to see what an instance is doing. A BPMN diagram does. The same artefact you ship to engineering becomes the artefact you give to the business.

2. Versioning that operators can reason about. Most Temporal teams have a war story about Patch/GetVersion. Our migration plans turn version bumps into a typed operation on a typed graph: "instances at node X move to node Y in the new version, instances at node A stay where they are, this branch becomes that branch." Still hard, but no longer line-by-line code archaeology.

3. Compensation and saga as a primitive, not a pattern. BPMN compensation handlers have spec-defined semantics that have been sharpened over fifteen years. If you've ever shipped a Temporal saga with a subtly broken compensation order, BPMN's named-error + compensation-handler model is a real upgrade. You can keep your existing Temporal workflows; just model the saga-shaped parts in BPMN and call them as child workflows.

4. Human-in-the-loop without rolling your own. User tasks are a first-class BPMN element with a typed completion API. We hold the token; you build a UI on top via our REST API. No need to invent your own task-list table, claim/release/reassign protocol, or escalation timer wiring.

5. DMN as a primitive your Temporal workflows can call. This is genuinely useful even without adopting BPMN. Define your business rules — pricing tiers, eligibility checks, risk scoring — in a versioned DMN model, and call them from inside an existing Temporal activity using our SDK. You get versioning, audit trails, and the ability for non-engineers to edit the rules without touching your Go workflow code.

6. An operations UI sized for non-engineers. Running instances list, incident resolution, message/signal correlation, token modification, replay scrubbing — all in a UI that doesn't assume the user reads workflow histories for breakfast.

The integration story matters too: a Temporal team running on a self-hosted cluster or on Temporal Cloud doesn't need a second durable execution engine; QuantumBPM is designed to run against the Temporal you already operate.

What this isn't

To be honest about scope: this isn't an "operations UI for any Temporal workflow you already wrote." Today our scrubber understands BPMN-shaped workflows specifically — the visual reconstruction depends on the BPMN scope tree we maintain. Generalising the replay UI to arbitrary Temporal workflows is on our minds, but it is not what ships today.

What ships today is a BPMN engine that runs as a Temporal application, integrates with the Temporal infrastructure you have, and gives you the BPMN-shaped layer on top. If your problem is shaped like a business process — onboarding, claims, approvals, fulfilment, KYC, case management, anything human-and-system-and-time — that's the right shape to model in BPMN, and Temporal is the right substrate underneath.

Where to start

If you're a Temporal developer curious about whether the BPMN layer fits your team:

  • QuantumBPM comparison — the architectural deep-dive on what we built on top of Temporal.
  • BPMN vs. Temporal — the honest decision framework: when raw Temporal is right, when a BPMN engine on top earns its keep.
  • BPMN overview — what's supported in the engine.
  • Platform overview — quickstart for self-hosting locally with docker-compose.

The BPMN engine is one half of QuantumBPM. The other half is a DMN 1.5 decision engine that runs in the same product and shares the same project, identity, and audit history. The two compose naturally — BPMN's business-rule task calls DMN — and that pairing is, for us, the actual reason BPMN-on-Temporal earns its keep over raw Temporal alone.

If you have feedback, war stories about BPMN, Temporal, or the line between them — we'd genuinely like to hear it. Reach us at support@quantumbpm.com.