Operations

This page covers running an Enterprise deployment day-to-day. For initial setup see Deployment, for what's tunable see Configuration.

Observability

Set ENABLE_OTEL=true to enable Prometheus metrics. They're served from a separate HTTP server on REST_METRICS_PORT (default 9464) — not the same port as the public API. Bind that server to a private interface with REST_METRICS_ADDR or firewall the port, it has no authentication.

What's exported today

Two metric stacks contribute to /metrics:

OpenTelemetry (via the quantum meter) — the BPMN engine's per-tenant instruments are wired into the platform's OTEL meter on every engine. All instruments are tagged with engine.name and tenant.id (the project schema), plus per-metric labels listed below.

Instrument	Type	Labels	What it covers
`bpmn.tokens.processed`	Counter	—	BPMN execution tokens processed. The throughput pulse of the engine.
`bpmn.activity.queue.depth`	UpDownCounter	—	Tokens currently queued for processing. Climbs when workers can't keep up.
`bpmn.incidents.total`	Counter	`incident.error_type`	Incidents created, broken down by error type.
`bpmn.gateway.evaluation.duration`	Histogram (ms)	`gateway.type`	Latency of gateway condition evaluation, per gateway type.
`bpmn.process_instances.created`	Counter	`process.id`, `process.version`	Process instance starts.
`bpmn.process_instances.completed`	Counter	`process.id`, `process.version`, `process.status`	Process instance terminations, by terminal status.
`bpmn.process_instance.duration`	Histogram (ms)	`process.id`, `process.version`, `process.status`	End-to-end instance duration.
`bpmn.jobs.processed`	Counter	`job.type`	Service-task jobs processed, by job type.
`bpmn.job.duration`	Histogram (ms)	`job.type`	Service-task job execution latency.
`bpmn.continue_as_new.total`	Counter	`reason`	ContinueAsNew rotations. Currently the only `reason` value is `history_budget`.

Prometheus (direct registration) — additional families registered through the global Prometheus registry:

Family	What it covers
`http_requests_total` / `http_request_duration_seconds`	REST API request totals and latency.
`pgxpool_*`	PostgreSQL pool — connections in use, acquires, waits, idle.
`bpmn_complete_signal_seconds` / `bpmn_complete_db_seconds`	Per-call breakdown of the `/complete` endpoint (Temporal SignalWorkflow time vs. DB UPDATE time). Labeled `single` vs. `batch`.
`bpmn_poll_total_seconds` / `bpmn_poll_jobs_returned`	External-job polling — total wall time per poll (by outcome) and distribution of jobs returned per non-empty poll.
Temporal SDK metrics	Standard Temporal client metrics (workflow tasks, activity tasks, polling). Wired in when `ENABLE_OTEL=true`.

Distributed tracing (optional)

Tracing is off by default and independent of metrics — leaving OTLP_ENDPOINT empty keeps everything above working with no collector. Set OTLP_ENDPOINT to an OTLP gRPC collector (Jaeger, Tempo, otel-collector) to export traces, OTLP_INSECURE and OTLP_SAMPLE_RATIO tune the connection and head sampling (see Configuration → Observability).

What you get when it's on:

Inbound HTTP spans — each REST request is rooted as a span (the external-job poll/complete endpoints included).
Temporal spans — a span per process-instance run and per activity (service / business-rule tasks), with call-activity children linked. Wired through the Temporal client interceptor, so per-tenant workers inherit it automatically.
BPMN attributes — service-task job spans (bpmn.job.execute) carry bpmn.node_id, bpmn.element_type, bpmn.element_name, bpmn.task_type, bpmn.tenant, bpmn.process_instance_id, bpmn.execution_key, and (when available) bpmn.definition_id / bpmn.process_version, so traces filter per definition, per node, and per instance.

Two caveats worth knowing:

It's an inspector, not an analytics store. Traces are sampled and short-retention. Don't build counts or percentiles off them — that's what the metrics above are for.
Long-running instances fragment. A ContinueAsNew rotation starts a new root span (linked), and instances that wait days on timers/external tasks produce multi-day traces that trace backends handle poorly. For deep per-instance inspection of long runs, the Temporal UI is usually the better lens.

External worker propagation (tracing into your own SDK workers across the poll/complete protocol) is not included — external work shows as a wait gap in the trace.

Performance tuning

The settings that most often need attention:

Database connections

DB_MAX_CONNS (default 50) — the pgxpool max. Every BPMN poll, job completion, process start, and DMN evaluation grabs a connection. Pool exhaustion shows up as request queue-up. Watch the pgxpool_* metrics, raise the cap if you see acquire waits climbing.

BPMN workers per tenant

Each "worker" is a Temporal worker (workflow + activity poller pair) running against one tenant's namespace. BPMN_DEFAULT_WORKERS (default 1) is the global default, override per project in YAML:

bpmnWorkers:
  defaultWorkers: 1
  overrides:
    "<project-uuid>": 4

Multiple workers per tenant give you more poll concurrency against Temporal but increase steady-state load. Start at the default, raise specific tenants when their throughput tells you to.

FEEL evaluation caps

FEEL_MAX_DEPTH, FEEL_MAX_ITERATIONS, and FEEL_DEFAULT_TIMEOUT cap what a single FEEL expression can do. The platform aborts any expression that exceeds them. Defaults protect against pathological or hostile input — normal expressions don't come close.

Access cache

In the Enterprise binary the access cache is a thin read-through cache over the projects.schema_name column — given a project UUID from the URL, return the PostgreSQL schema name to scope DB queries to. Role and membership are read from the JWT on every request and are not cached.

What this means in practice:

The cache only saves a single PostgreSQL lookup per project per cache window. The contents change only when projects are created or renamed.
ACCESS_CACHE_TTL is a pure performance knob — raising it reduces DB load with no functional downside. Lowering it costs DB queries with no benefit.
There is no in-platform role revocation to wait on. If you need to revoke a user's access, do it at your IdP (stop issuing the project claim) and wait for their existing token to expire.

The defaults (100 entries, 30s TTL) are fine for most deployments. If you have many projects and want to eliminate the cold-miss latency, raise ACCESS_CACHE_MAX_SIZE to comfortably exceed your project count and ACCESS_CACHE_TTL to something well above your typical request cadence.

License monitoring

Three things happen automatically:

When	What
Startup	License signature, issuer, and expiry are verified. Invalid or expired → backend exits non-zero.
Every 24 hours	A background goroutine re-validates the license. Expiry triggers graceful shutdown.
Less than 60 days until expiry	A warning is logged on every check. Set up an alert on this log line.

Graceful shutdown on expiry: readiness probe goes false → HTTP server drains in-flight requests → BPMN service stops → process exits with code 1.

Backups

All durable state lives in PostgreSQL — the Enterprise binary writes no state files to disk. Back up Postgres with whatever tooling you already use (pg_dump, WAL archiving, managed-Postgres snapshots, etc.).

Temporal cluster holds its own durable state (workflow histories, namespace metadata, search-attribute registrations) and needs its own backup procedure — which largely depends on the storage backend that you use. See Temporal's self-hosted operations docs for the supported approach, or reach out to support@quantumbpm.com if you'd like guidance.

The license itself is held in an environment variable, there's nothing to back up on the platform side. Store it wherever you manage secrets.

Upgrades

QuantumBPM follows semantic versioning. The upgrade procedure depends on whether the release is a minor/patch bump or a new major version.

Minor and patch releases

To upgrade, stop the old binary and start the new one. Schema migrations apply automatically on startup — no separate migration step.

Single-replica deployments will see a brief downtime window covering the new binary's startup time. The first start after a release also runs any pending migrations, which adds to that window proportional to the size of the data being migrated.

Multi-replica deployments should drain old replicas at the reverse proxy or load balancer before bringing new ones up. For HA-specific upgrade patterns tailored to your topology, contact support@quantumbpm.com.

Major releases

A new major version signals a change to the BPMN engine that long-running workflows started by the previous major version cannot safely cross over (typically a replay-determinism or snapshot-shape change). To preserve in-flight workflows, run both versions side-by-side during the transition and drain instances over with the migration CLI shipped alongside the release.

The rollout shape:

Deploy the new major alongside the old one. Each version stamps its own engine identity on the workflows it starts, and they execute on separate Temporal task queues — so v1 workflows keep running on the v1 binary while new starts route to v2.
Run the migration CLI. It enumerates workflows still tagged with the previous major's version label and migrates each to the new version at its next quiescent moment. The migration is asynchronous, idempotent, and auditable — partial runs can be resumed and each migrated workflow records an audit marker.
Retire the old binary once its queue depth reaches zero. The old binary must keep running until the drain completes — it performs the rotation handoff for each workflow. Retiring it early leaves workflows stuck.

Process definitions in active use must be present in the new binary's registry before the drain begins. The migration CLI verifies this before issuing rotations.

Logs

The backend writes structured logs to stdout. Pipe them into your existing log pipeline (Loki, Elasticsearch, CloudWatch, whatever). There is no on-disk log destination to configure.

Healthchecks

The backend exposes a readiness endpoint that flips to "not ready" before graceful shutdown. Wire your orchestrator's readiness probe to it so traffic stops being routed before connections drain.

When to call support

You don't need to debug all of this alone. Reasonable reasons to reach out:

The <60-day license warning is firing and you need a renewed key.
You see Postgres pool exhaustion and want a sizing recommendation for your workload.
You're moving to Kubernetes / HA and want the supported pattern.
You see a metric or behavior that this page doesn't describe.

Contact: support@quantumbpm.com.

Observability​

What's exported today​

Distributed tracing (optional)​

Performance tuning​

Database connections​

BPMN workers per tenant​

FEEL evaluation caps​

Access cache​

License monitoring​

Backups​

Upgrades​

Minor and patch releases​

Major releases​

Logs​

Healthchecks​

When to call support​