Monitoring the Stack¶

Effective monitoring of Customer-Managed Prefect covers four areas: PostgreSQL, Redis, the Kubernetes cluster, and the Prefect application services themselves.

PostgreSQL¶

Monitor the following metrics on each of the three databases (Nebula, Orion, Events):

Metric	Why It Matters
CPU Utilization	Sustained high CPU can indicate missing indexes, inefficient queries, or insufficient sizing
Memory Utilization	Low available memory can lead to increased disk I/O and degraded query performance
Active Connections	Exhausting the connection limit will cause new requests to fail; track against `max_connections`
PG Locks	Lock contention can cause queries to queue and time out, particularly during high-concurrency flow runs

For recommended PostgreSQL configuration to support performance and observability, see Database Flags.

Redis¶

Monitor the following metrics on each Redis instance (Cache, Work, Streams):

Metric	Why It Matters
CPU Utilization	High CPU on the Streams instance may indicate a backlog of unprocessed events
Memory Utilization	Redis will evict keys or error when memory is exhausted; size each instance appropriately
Cache Hit Ratio	A low hit ratio on the Cache instance means more requests are falling through to PostgreSQL, increasing database load

Kubernetes Cluster¶

Monitor at the cluster and node level:

Metric	Why It Matters
Node CPU Utilization	Sustained saturation will cause pod throttling and degraded service response times
Node Memory Utilization	Memory pressure leads to pod evictions, which can interrupt running services
Control Plane Metrics	API server latency and etcd health are leading indicators of cluster instability

Prefect Application Services¶

Prometheus Metrics¶

Each Prefect web service exposes a Prometheus-compatible /metrics endpoint. For example:

auth:4300/metrics

Scrape these endpoints with your Prometheus instance to track request rates, error rates, and latency per service. See Prefect Services for the full list of web services and their roles.

Service Logs¶

Logs are the most direct signal of service health. Collect and monitor logs from all running Prefect pods:

kubectl logs -n prefect <pod-name>

Key things to watch for:

Error-level log entries — unexpected exceptions or repeated failures in any service
Authentication errors — failures in the auth service will cascade to all other services
Database connection errors — indicate connectivity or pool exhaustion issues with PostgreSQL or Redis
Slow query warnings — sustained slow queries point to database sizing or indexing issues

To increase log verbosity for a service during investigation, set the following environment variables on the relevant Kubernetes deployment:

- name: PREFECT_CLOUD_LOGGING_LEVEL
  value: 'DEBUG'
- name: PREFECT_CLOUD_DISABLE_LOGGING_CONFIG
  value: 'true'