Skip to content

Monitoring the Stack

Effective monitoring of Customer-Managed Prefect covers four areas: PostgreSQL, Redis, the Kubernetes cluster, and the Prefect application services themselves.

PostgreSQL

Monitor the following metrics on each of the three databases (Nebula, Orion, Events):

Metric Why It Matters
CPU Utilization Sustained high CPU can indicate missing indexes, inefficient queries, or insufficient sizing
Memory Utilization Low available memory can lead to increased disk I/O and degraded query performance
Active Connections Exhausting the connection limit will cause new requests to fail; track against max_connections
PG Locks Lock contention can cause queries to queue and time out, particularly during high-concurrency flow runs

For recommended PostgreSQL configuration to support performance and observability, see Database Flags.

Redis

Monitor the following metrics on each Redis instance (Cache, Work, Streams):

Metric Why It Matters
CPU Utilization High CPU on the Streams instance may indicate a backlog of unprocessed events
Memory Utilization Redis will evict keys or error when memory is exhausted; size each instance appropriately
Cache Hit Ratio A low hit ratio on the Cache instance means more requests are falling through to PostgreSQL, increasing database load

Kubernetes Cluster

Monitor at the cluster and node level:

Metric Why It Matters
Node CPU Utilization Sustained saturation will cause pod throttling and degraded service response times
Node Memory Utilization Memory pressure leads to pod evictions, which can interrupt running services
Control Plane Metrics API server latency and etcd health are leading indicators of cluster instability

Prefect Application Services

Prometheus Metrics

Each Prefect web service exposes a Prometheus-compatible /metrics endpoint. For example:

auth:4300/metrics

Scrape these endpoints with your Prometheus instance to track request rates, error rates, and latency per service. See Prefect Services for the full list of web services and their roles.

Service Logs

Logs are the most direct signal of service health. Collect and monitor logs from all running Prefect pods:

kubectl logs -n prefect <pod-name>

Key things to watch for:

  • Error-level log entries — unexpected exceptions or repeated failures in any service
  • Authentication errors — failures in the auth service will cascade to all other services
  • Database connection errors — indicate connectivity or pool exhaustion issues with PostgreSQL or Redis
  • Slow query warnings — sustained slow queries point to database sizing or indexing issues

To increase log verbosity for a service during investigation, set the following environment variables on the relevant Kubernetes deployment:

- name: PREFECT_CLOUD_LOGGING_LEVEL
  value: 'DEBUG'
- name: PREFECT_CLOUD_DISABLE_LOGGING_CONFIG
  value: 'true'