Skip to main content

Operations

This page covers running an Enterprise deployment day-to-day. For initial setup see Deployment, for what's tunable see Configuration.

Observability

Set ENABLE_OTEL=true to enable Prometheus metrics. They're served from a separate HTTP server on REST_METRICS_PORT (default 9464) — not the same port as the public API. Bind that server to a private interface with REST_METRICS_ADDR or firewall the port, it has no authentication.

What's exported today

Two metric stacks contribute to /metrics:

OpenTelemetry (via the quantum meter) — the BPMN engine's per-tenant instruments are wired into the platform's OTEL meter on every engine. All instruments are tagged with engine.name and tenant.id (the project schema), plus per-metric labels listed below.

InstrumentTypeLabelsWhat it covers
bpmn.tokens.processedCounterBPMN execution tokens processed. The throughput pulse of the engine.
bpmn.activity.queue.depthUpDownCounterTokens currently queued for processing. Climbs when workers can't keep up.
bpmn.incidents.totalCounterincident.error_typeIncidents created, broken down by error type.
bpmn.gateway.evaluation.durationHistogram (ms)gateway.typeLatency of gateway condition evaluation, per gateway type.
bpmn.process_instances.createdCounterprocess.id, process.versionProcess instance starts.
bpmn.process_instances.completedCounterprocess.id, process.version, process.statusProcess instance terminations, by terminal status.
bpmn.process_instance.durationHistogram (ms)process.id, process.version, process.statusEnd-to-end instance duration.
bpmn.jobs.processedCounterjob.typeService-task jobs processed, by job type.
bpmn.job.durationHistogram (ms)job.typeService-task job execution latency.
bpmn.continue_as_new.totalCounterreasonContinueAsNew rotations. Currently the only reason value is history_budget.

Prometheus (direct registration) — additional families registered through the global Prometheus registry:

FamilyWhat it covers
http_requests_total / http_request_duration_secondsREST API request totals and latency.
pgxpool_*PostgreSQL pool — connections in use, acquires, waits, idle.
bpmn_complete_signal_seconds / bpmn_complete_db_secondsPer-call breakdown of the /complete endpoint (Temporal SignalWorkflow time vs. DB UPDATE time). Labeled single vs. batch.
bpmn_poll_total_seconds / bpmn_poll_jobs_returnedExternal-job polling — total wall time per poll (by outcome) and distribution of jobs returned per non-empty poll.
Temporal SDK metricsStandard Temporal client metrics (workflow tasks, activity tasks, polling). Wired in when ENABLE_OTEL=true.

Performance tuning

The settings that most often need attention:

Database connections

DB_MAX_CONNS (default 50) — the pgxpool max. Every BPMN poll, job completion, process start, and DMN evaluation grabs a connection. Pool exhaustion shows up as request queue-up. Watch the pgxpool_* metrics, raise the cap if you see acquire waits climbing.

BPMN workers per tenant

Each "worker" is a Temporal worker (workflow + activity poller pair) running against one tenant's namespace. BPMN_DEFAULT_WORKERS (default 1) is the global default, override per project in YAML:

bpmnWorkers:
defaultWorkers: 1
overrides:
"<project-uuid>": 4

Multiple workers per tenant give you more poll concurrency against Temporal but increase steady-state load. Start at the default, raise specific tenants when their throughput tells you to.

FEEL evaluation caps

FEEL_MAX_DEPTH, FEEL_MAX_ITERATIONS, and FEEL_DEFAULT_TIMEOUT cap what a single FEEL expression can do. The platform aborts any expression that exceeds them. Defaults protect against pathological or hostile input — normal expressions don't come close.

Access cache

In the Enterprise binary the access cache is a thin read-through cache over the projects.schema_name column — given a project UUID from the URL, return the PostgreSQL schema name to scope DB queries to. Role and membership are read from the JWT on every request and are not cached.

What this means in practice:

  • The cache only saves a single PostgreSQL lookup per project per cache window. The contents change only when projects are created or renamed.
  • ACCESS_CACHE_TTL is a pure performance knob — raising it reduces DB load with no functional downside. Lowering it costs DB queries with no benefit.
  • There is no in-platform role revocation to wait on. If you need to revoke a user's access, do it at your IdP (stop issuing the project claim) and wait for their existing token to expire.

The defaults (100 entries, 30s TTL) are fine for most deployments. If you have many projects and want to eliminate the cold-miss latency, raise ACCESS_CACHE_MAX_SIZE to comfortably exceed your project count and ACCESS_CACHE_TTL to something well above your typical request cadence.

License monitoring

Three things happen automatically:

WhenWhat
StartupLicense signature, issuer, and expiry are verified. Invalid or expired → backend exits non-zero.
Every 24 hoursA background goroutine re-validates the license. Expiry triggers graceful shutdown.
Less than 60 days until expiryA warning is logged on every check. Set up an alert on this log line.

Graceful shutdown on expiry: readiness probe goes false → HTTP server drains in-flight requests → BPMN service stops → process exits with code 1.

Backups

All durable state lives in PostgreSQL — the Enterprise binary writes no state files to disk. Back up Postgres with whatever tooling you already use (pg_dump, WAL archiving, managed-Postgres snapshots, etc.).

If you use Temporal, that cluster has its own state — workflow histories, namespace metadata, search-attribute registrations — and needs its own backup story per Temporal's documentation. The platform doesn't try to back Temporal up for you.

The license itself is held in an environment variable, there's nothing to back up on the platform side. Store it wherever you manage secrets.

Upgrades

QuantumBPM follows semantic versioning. The upgrade procedure depends on whether the release is a minor/patch bump or a new major version.

Minor and patch releases

To upgrade, stop the old binary and start the new one. Schema migrations apply automatically on startup — no separate migration step.

Single-replica deployments will see a brief downtime window covering the new binary's startup time. The first start after a release also runs any pending migrations, which adds to that window proportional to the size of the data being migrated.

Multi-replica deployments should drain old replicas at the reverse proxy or load balancer before bringing new ones up. For HA-specific upgrade patterns tailored to your topology, contact support@quantumbpm.com.

Major releases

A new major version signals a change to the BPMN engine that long-running workflows started by the previous major version cannot safely cross over (typically a replay-determinism or snapshot-shape change). To preserve in-flight workflows, run both versions side-by-side during the transition and drain instances over with the migration CLI shipped alongside the release.

The rollout shape:

  1. Deploy the new major alongside the old one. Each version stamps its own engine identity on the workflows it starts, and they execute on separate Temporal task queues — so v1 workflows keep running on the v1 binary while new starts route to v2.
  2. Run the migration CLI. It enumerates workflows still tagged with the previous major's version label and migrates each to the new version at its next quiescent moment. The migration is asynchronous, idempotent, and auditable — partial runs can be resumed and each migrated workflow records an audit marker.
  3. Retire the old binary once its queue depth reaches zero. The old binary must keep running until the drain completes — it performs the rotation handoff for each workflow. Retiring it early leaves workflows stuck.

Process definitions in active use must be present in the new binary's registry before the drain begins. The migration CLI verifies this before issuing rotations.

Logs

The backend writes structured logs to stdout. Pipe them into your existing log pipeline (Loki, Elasticsearch, CloudWatch, whatever). There is no on-disk log destination to configure.

Healthchecks

The backend exposes a readiness endpoint that flips to "not ready" before graceful shutdown. Wire your orchestrator's readiness probe to it so traffic stops being routed before connections drain.

When to call support

You don't need to debug all of this alone. Reasonable reasons to reach out:

  • The <60-day license warning is firing and you need a renewed key.
  • You see Postgres pool exhaustion and want a sizing recommendation for your workload.
  • You're moving to Kubernetes / HA and want the supported pattern.
  • You see a metric or behavior that this page doesn't describe.

Contact: support@quantumbpm.com.