Observability¶

Flux provides built-in observability through OpenTelemetry — metrics, distributed tracing, and log correlation. The feature is opt-in with zero overhead when disabled.

Installation¶

Observability requires optional dependencies:

pip install flux-core[observability]

Configuration¶

Add the [flux.observability] section to your flux.toml:

[flux.observability]
enabled = true
service_name = "flux"
prometheus_enabled = true
otlp_endpoint = "http://localhost:4317"  # Optional: OTLP collector
trace_sample_rate = 1.0
metric_export_interval = 60

Or use environment variables:

FLUX_OBSERVABILITY__ENABLED=true
FLUX_OBSERVABILITY__SERVICE_NAME=flux
FLUX_OBSERVABILITY__PROMETHEUS_ENABLED=true
FLUX_OBSERVABILITY__OTLP_ENDPOINT=http://localhost:4317
FLUX_OBSERVABILITY__TRACE_SAMPLE_RATE=1.0
FLUX_OBSERVABILITY__METRIC_EXPORT_INTERVAL=60

Configuration Reference¶

Field	Default	Description
`enabled`	`false`	Enable observability
`service_name`	`"flux"`	OpenTelemetry service name
`prometheus_enabled`	`true`	Expose `/metrics` endpoint
`otlp_endpoint`	`null`	OTLP collector gRPC endpoint
`trace_sample_rate`	`1.0`	Trace sampling rate (0.0 to 1.0)
`metric_export_interval`	`60`	OTLP push interval in seconds
`resource_attributes`	`{}`	Additional OTel resource attributes

Behavior¶

enabled: false (default) — nothing initializes, no overhead
enabled: true, no otlp_endpoint — Prometheus /metrics only
enabled: true with otlp_endpoint — both Prometheus and OTLP push

Metrics¶

Flux exposes 18 metric instruments accessible via the Prometheus /metrics endpoint at http://localhost:8000/metrics.

Workflow Metrics¶

Metric	Type	Labels	Description
`flux_workflow_executions_total`	Counter	`workflow_name`, `status`	Workflow executions by status (`started`, `completed`, `failed`, `cancelled`)
`flux_workflow_execution_duration_seconds`	Histogram	`workflow_name`	Worker-side workflow execution duration

Task Metrics¶

Metric	Type	Labels	Description
`flux_task_executions_total`	Counter	`workflow_name`, `task_name`, `status`	Task executions by status (`started`, `completed`, `failed`)
`flux_task_execution_duration_seconds`	Histogram	`workflow_name`, `task_name`	Per-task execution duration
`flux_task_retries_total`	Counter	`workflow_name`, `task_name`	Task retry attempts

Execution Pipeline Metrics¶

Metric	Type	Labels	Description
`flux_execution_queue_depth`	UpDownCounter	—	Executions waiting for workers
`flux_execution_schedule_to_start_seconds`	Histogram	—	Time from queued to worker claim
`flux_checkpoints_total`	Counter	`workflow_name`	Checkpoint events
`flux_checkpoint_duration_seconds`	Histogram	`workflow_name`	Checkpoint HTTP round-trip duration

Worker Metrics¶

Metric	Type	Labels	Description
`flux_workers_active`	UpDownCounter	—	Connected workers
`flux_worker_registrations_total`	Counter	`worker_name`	Registration events
`flux_worker_disconnections_total`	Counter	`worker_name`, `reason`	Disconnection events
`flux_worker_executions_active`	UpDownCounter	`worker_name`	Concurrent executions per worker
`flux_module_cache_total`	Counter	`result`	Module cache lookups (`hit`, `miss`)

Schedule Metrics¶

Metric	Type	Labels	Description
`flux_schedule_triggers_total`	Counter	`schedule_name`, `outcome`	Schedule trigger outcomes

HTTP Metrics¶

Metric	Type	Labels	Description
`flux_http_requests_total`	Counter	`method`, `endpoint`, `status_code`	HTTP request count (paths normalized)
`flux_http_request_duration_seconds`	Histogram	`method`, `endpoint`	HTTP request latency

Distributed Tracing¶

When enabled, Flux creates spans for workflow executions, task executions, and HTTP requests. Trace context is automatically propagated between the server and workers using W3C traceparent/tracestate headers.

Span Hierarchy¶

[Server] HTTP POST /workflows/{name}/run/async
  -- (trace context propagated via SSE event) --
  +-- [Worker] flux.workflow.execute
        +-- [Worker] flux.task.execute {task_name}
        +-- [Worker] flux.task.execute {task_name}

Span Attributes¶

All custom attributes use the flux.* namespace:

flux.workflow.name — Workflow name
flux.execution.id — Execution ID
flux.task.name — Task name
flux.worker.name — Worker name

Log Correlation¶

When observability is enabled, an OTel log handler is added to the root flux logger. Log records emitted inside an active span automatically include otelTraceID and otelSpanID attributes, allowing you to correlate logs with traces.

Docker Compose Setup¶

The easiest way to run Flux with full observability is using the Docker Compose observability profile:

docker compose -f docker-compose.yml \
  -f docker/profiles/observability.yml \
  up -d

This starts:

Service	URL	Purpose
Flux Server	`http://localhost:8000`	Workflow engine with `/metrics`
Prometheus	`http://localhost:9090`	Metrics storage and queries
Grafana	`http://localhost:3000`	Dashboards (admin/admin)
Jaeger	`http://localhost:16686`	Trace visualization
OTel Collector	`localhost:4317`	Receives OTLP data

Example Prometheus Queries¶

# Workflow execution rate (per minute)
rate(flux_workflow_executions_total[5m]) * 60

# Average workflow duration
rate(flux_workflow_execution_duration_seconds_sum[5m]) / rate(flux_workflow_execution_duration_seconds_count[5m])

# Execution queue depth
flux_execution_queue_depth

# Task failure rate
rate(flux_task_executions_total{status="failed"}[5m])

# HTTP request latency (p95)
histogram_quantile(0.95, rate(flux_http_request_duration_seconds_bucket[5m]))

# Connected workers
flux_workers_active

# Execution queue depth
flux_execution_queue_depth

Grafana Setup¶

Open Grafana at http://localhost:3000 (login: admin/admin)
Add Prometheus data source: http://prometheus:9090
Create dashboards using the queries above

Jaeger Setup¶

Open Jaeger at http://localhost:16686
Select service flux from the dropdown
Search for traces to see distributed execution flow