Skip to main content

Orchestrating AI Agents with BPMN: Durable, Auditable Agentic Workflows

· 15 min read
Richard Bízik
Founder of QuantumBPM

If you've built anything agentic in the last year, you know the shape of it: an agent that plans steps, calls tools, reflects on the result, and loops until it hits a goal. Defined in code - a graph of nodes, a crew of roles, or a hand-rolled loop around an LLM call and a tool list. It works in the demo. Then you try to ship it.

The wall everyone hits is the same one: the agent's autonomy is exactly what makes it ungovernable. What tools is it allowed to call? When does a human get to approve before it does something irreversible? What does the trace look like six weeks later when compliance asks what happened on instance #48213? And what happens to the half-finished run when the worker crashes while the agent is waiting three days for a human to click "approve"?

This post is about answering those questions with a BPMN engine - not by replacing the agent, but by giving it a process to live inside. Where it's useful, we'll be concrete about what QuantumBPM gives you for each piece, because "a BPMN engine could do this" and "here is the endpoint you call" are very different levels of promise.

Not BPMN or agents. BPMN and agents.

It's worth being precise about what's deterministic and what isn't, because the whole design follows from that line.

An LLM agent is good at the fuzzy parts: interpreting an ambiguous request, deciding which of several tools fits, summarising a messy document, drafting a reply. It is bad - structurally, not fixably - at the parts that need to be reliable, auditable, and identical every time: "this approval must happen before that payment," "retry this exactly three times then escalate," "wait fourteen days for the document, then time out."

A workflow engine is the mirror image. BPMN is excellent at explicit steps, state, timers, retries, error boundaries, and a recorded history of what happened. It is not in the business of open-ended reasoning.

So the useful framing isn't "should I use an agent or a process engine." It's put the agent where the work is fuzzy, and put deterministic orchestration around it where the work must be reliable. Guardrails and autonomy, not one or the other. The agent gets latitude inside a box you drew, the box is auditable, resumable, and reviewable by people who don't read stack traces.

That box is a BPMN process. Here's how an agent fits in it.

The agent as a task

In BPMN, the unit of work is a task. The trick to making agents governable is to stop thinking of "the agent" as the whole program and start thinking of it as a task inside a larger process you control.

There are two ways to model it, and you'll use both.

A single agent step is a service task. When you have a bounded, fuzzy sub-problem - "classify this incoming ticket," "extract the line items from this invoice," "draft a response to this complaint" - that's one task in the diagram. The task calls the LLM, the LLM does the reasoning, and the result becomes a process variable the next steps can branch on. Everything before and after it - validation, routing, the human approval, the timeout - is ordinary deterministic BPMN.

A reasoning loop is an ad-hoc sub-process. BPMN is mostly known for structured, ordered flows, but the spec has a deliberately unstructured construct: the ad-hoc sub-process. Inside it, there's no fixed sequence - a set of activities is available, and something decides at runtime which to run and in what order. That "something" is your agent. The available activities are the agent's tools, modelled as tasks. The agent plans, picks a task, sees the result, picks another, and the loop is bounded by an explicit completion condition. The difference from a code-only loop is that the tool catalogue is in the model - visible, versioned, and access-controlled - rather than implied by a list somewhere in a source file.

Either way, the win is the same: the agent's freedom is scoped to a region of a diagram, and everything outside that region is deterministic, recorded, and the same every time.

What "the agent decides at runtime" means concretely

The ad-hoc sub-process is the part most engines wave at and few actually execute, so it's worth being exact about how it works in QuantumBPM, because this is where the pattern stops being a metaphor.

When a token enters an ad-hoc sub-process, the engine parks it and exposes the inner activities as separately triggerable steps. Your agent drives the loop in one of two ways:

  • It names the next tool through an API. The agent decides "look up the order," and your worker calls the engine's trigger endpoint for that activity. The activity runs, its output lands back in the scope's variables, and the agent reads them to decide the next move.
  • It sets a variable the model reads. The ad-hoc sub-process can carry a FEEL expression listing which activities are currently eligible, update the variable it reads and the engine activates that set. The "which tools are allowed right now" decision is itself a FEEL expression in the model, not a branch buried in agent code.

Either way, a FEEL completion condition ends the loop - done = true, attempts >= 3, whatever you write - and the engine drains any in-flight activities and moves the token on. And when a sub-problem does need a fixed order, you're not stuck: draw ordinary sequence flows between the activities inside the ad-hoc sub-process and that stretch runs in sequence, while the rest stays agent-driven.

How does the agent find the instances that need it? Make the brain a task too. In the worked example below, the ad-hoc's first activity is an agent-plan service task whose worker long-polls for agent-plan jobs exactly like any other worker - and that is the discovery: a job is an instance that has reached the ad-hoc, handed to the worker with its workflowID and only the context it needs. The agent reasons, triggers the tools it wants (addressing them by that workflowID), reads their results from the instance's scope, and finishes by completing its own job. Scaling to ten thousand live conversations is just more workers polling one queue - not a loop scanning ten thousand instances for "which ones are at the ad-hoc step." That distinction is the difference between a demo and a system.

How the agent actually plugs in

A service task that "calls the LLM" has to call it from somewhere. In QuantumBPM that somewhere is an external worker, and the mechanism is deliberately boring - which is the point.

When the process reaches an agent step, the engine creates an external job tagged with a task type (e.g. classify-ticket). Your worker - a small program you write - long-polls for jobs of that type, runs the LLM call, and reports back:

import { Worker } from "@quantumbpm/sdk";

const worker = new Worker({ baseUrl, projectId, token });

worker.handle("classify-ticket", async (job) => {
const { subject, body } = job.variables;
const result = await llm.classify(subject, body); // the fuzzy part
return { category: result.category, urgency: result.urgency, confidence: result.confidence };
});

await worker.run(); // poll, lock, dispatch, report - the SDK owns this loop

A few things fall out of this design that matter specifically for agents:

  • Long LLM calls don't break the model. Each job is leased with a lock, the SDK sends heartbeats to extend it while your slow reasoning step runs, so a 90-second tool call or a model that pauses to think doesn't get the job handed to another worker.
  • "The model returned garbage" is a first-class outcome. A worker can resolve a job or raise a BPMN error with a code. That error code can match a boundary error event on the agent task - so "the LLM failed three times" routes down a modelled recovery path instead of throwing an unhandled exception. Technical failures (your worker crashed) are distinct from business errors (the agent couldn't do it), and the model treats them differently.
  • You write workers in the language your AI stack already lives in. The worker runtime ships as an SDK in JavaScript/TypeScript, Python, Go, and Java - Python being where most agent tooling already is. The four element types that can hand out external jobs (service task, send task, intermediate throw event, message end event) all use the same poll/complete/error/heartbeat contract.
  • A worker gets only the variables it needs. Each task's input mapping scopes its job payload, so the draft-reply worker receives the looked-up order and the ticket subject - not the whole process state. Smaller payloads, no accidental coupling to some variable that happened to be in scope, and nothing leaks to a worker that has no business seeing it.

The agent is just a worker. The engine doesn't know or care that there's an LLM behind the task - it knows there's a step that takes a context in, produces variables out, and might fail. That ignorance is exactly what keeps the orchestration deterministic while the work inside the box is not.

What the process gives the agent

Once the agent is a task inside a process rather than the whole program, a list of hard problems turns into things you get from the engine.

  • A human-in-the-loop gate is a user task. The single most common production requirement for agents - "a person reviews this before it goes out" - is a first-class BPMN element. The agent drafts, a user task holds the token, a person approves, edits, or rejects. In QuantumBPM that user task is backed by a real API: list the open tasks, list mine, claim or reassign by assignee / candidate user / candidate group, and complete with variables that flow straight back into the process and select the next branch. You don't build a task-list table, a claim/release protocol, or escalation timers - you model an approval, wire your UI to the completion endpoint, and hang a boundary timer off the task for the SLA. (That escalation timer is a modelled element, not a cron job you maintain.)

  • Durable waiting is the default. An agent that waits three days for that approval is the canonical durable-execution scenario, and it's exactly where code-only agent runtimes struggle: something has to hold the state, survive a restart, and not pin a thread for three days. QuantumBPM runs on Temporal, so a process instance waiting two weeks for a human or a document costs nothing unusual - it resumes from its history when the event arrives, whether that's in a second or a fortnight. (We wrote about that substrate in Building a BPMN engine on Temporal.)

  • The deterministic guardrails are DMN, not prompt-wrangling. "Is this refund within the agent's authority?" "Which tier does this customer fall into?" "Does this case need a second approver?" You do not want those decisions living inside an LLM's reasoning where they vary run to run. Model them as a DMN decision table - versioned, audited, editable by a domain expert without touching code - and call it as a business-rule task before or after the agent step. The result lands in a process variable and a gateway branches on it. The agent proposes, DMN decides what's allowed. This is the single most effective way to keep an agent inside its lane - and because QuantumBPM lets you simulate a decision against historical executions, you can test a tightened guardrail against the agent runs you already have before you deploy it.

A DMN decision table in QuantumBPM with the CodeMirror FEEL editor showing live LSP completion and diagnostics

  • Errors and compensation have spec-defined semantics. When an agent step fails - a tool errors, the model returns garbage, a downstream call rejects - you don't want an unhandled exception unwinding an undefined amount of work. BPMN error boundary events and compensation handlers have fifteen years of sharpened semantics for "this failed, undo these specific prior steps in this specific order." Saga-shaped agent workflows get that as a primitive, and (per the section above) a worker's BPMN error code is what lights up the matching boundary.

  • Every step is recorded - and you can watch it back. The agent's task, its inputs and outputs, the tool calls modelled as tasks, the DMN evaluation with the rules that fired, the human's decision - all of it lands in the process instance's history as it executes. When compliance asks what happened, you don't grep a log file: you open the instance and scrub a replay slider that lights up the executed path on the diagram, or read the same run as a timestamped event log - every node enter and leave, its inputs and outputs, and how long it took. For agentic systems, where the whole anxiety is "I don't know what the thing did," that auditability is the point.

Visual replay in QuantumBPM: the replay slider scrubbed to a step, with the executed path highlighted on the process diagram

The completed instance as a timestamped event history in QuantumBPM - every step, with durations

A support agent, end to end

Take a support-automation process - concrete enough to be real, simple enough to fit in your head. It's also the process you can download and deploy below, so the rest of this section describes something you can actually run.

The support-automation process modelled in the QuantumBPM modeler

  1. A ticket arrives. A service task calls an agent to classify it: category, urgency, suggested resolution. Fuzzy work, perfect for the LLM - and, mechanically, an external job a Python worker picks up.
  2. A business-rule task runs a DMN table on the agent's output: low-urgency known categories are auto-resolvable, anything touching billing, anything high-urgency, or anything the agent flagged low-confidence requires a human. Deterministic, versioned, the same every time.
  3. A gateway branches on the DMN result.
    • Auto path: an ad-hoc sub-process whose first activity, agent-plan, is the brain. It triggers tools from a small catalogue - look up the order, search the knowledge base, draft a reply - reads each result, and when it's satisfied hands off down a sequence flow to a deterministic finalize step that validates the draft, records the resolution, and sets resolved = true to close the sub-process.
    • Human path: a user task routes the draft to an agent (the human kind) who edits and approves via the completion API. A boundary timer escalates if nobody picks it up within the SLA.
  4. Both paths converge on a service task that sends the reply, and the instance ends.

Run it yourself. This isn't a sketch - download the BPMN process and the DMN routing table, deploy both, and back the task types with workers: classify-ticket, agent-plan (the brain inside the ad-hoc), the tools lookup-order / search-knowledge-base / draft-reply, finalize-resolution (the deterministic validate-and-record step), and send-reply. The fuzzy, LLM-backed roles are classify-ticket and agent-plan; everything else is deterministic glue, and the routing DMN is the guardrail between them.

Every fuzzy decision is an LLM call. Every reliable decision is DMN or a gateway. Every irreversible action sits behind an explicit step you can put a human in front of. And the whole thing is one diagram you can hand to support ops, to compliance, and to the engineer on call - the same artefact, read three different ways.

That's the difference between an agent you demoed and an agent you can operate.

Where the line sits

To be honest about scope: not every agent belongs in a process engine. A single-shot classifier, a chatbot with no side effects, a research assistant that only ever reads - wrapping those in BPMN buys you ceremony you don't need. The framing earns its keep precisely when the agent acts: when it spends money, sends things to customers, touches records of consequence, or runs long enough that a human and a clock get involved.

When the work is shaped like a business process - onboarding, claims, KYC, fulfilment, case management, anything human-and-system-and-time with an agent doing the fuzzy parts - that's when modelling it as BPMN, with the agent as a task and the guardrails as DMN, turns a clever prototype into something you can put in front of an auditor.

Where to start

  • What is BPMN? - the 20% of the spec you'll actually use, including the ad-hoc sub-process and user tasks the agent pattern leans on.
  • What is DMN? - the decision-table layer that keeps the guardrails out of the agent's reasoning.
  • Building a BPMN engine on Temporal - the durable-execution substrate that makes "wait three days for a human" a non-event.
  • External workers - how you wire an LLM call (or any code) to a service task in JS/TS, Python, Go, or Java.
  • Platform overview - spin it up locally with docker-compose and model the support example above.

If you're putting agents into production and wrestling with the governance, audit, or human-in-the-loop side of it - we'd like to hear how you're drawing the line. Reach us at support@quantumbpm.com.