Monitoring the Stack¶
Effective monitoring of Customer-Managed Prefect covers four areas: PostgreSQL, Redis, the Kubernetes cluster, and the Prefect application services themselves.
PostgreSQL¶
Monitor the following metrics on each of the three databases (Nebula, Orion, Events):
| Metric | Why It Matters |
|---|---|
| CPU Utilization | Sustained high CPU can indicate missing indexes, inefficient queries, or insufficient sizing |
| Memory Utilization | Low available memory can lead to increased disk I/O and degraded query performance |
| Active Connections | Exhausting the connection limit will cause new requests to fail; track against max_connections |
| PG Locks | Lock contention can cause queries to queue and time out, particularly during high-concurrency flow runs |
For recommended PostgreSQL configuration to support performance and observability, see Database Flags.
Redis¶
Monitor the following metrics on each Redis instance (Cache, Work, Streams):
| Metric | Why It Matters |
|---|---|
| CPU Utilization | High CPU on the Streams instance may indicate a backlog of unprocessed events |
| Memory Utilization | Redis will evict keys or error when memory is exhausted; size each instance appropriately |
| Cache Hit Ratio | A low hit ratio on the Cache instance means more requests are falling through to PostgreSQL, increasing database load |
Kubernetes Cluster¶
Monitor at the cluster and node level:
| Metric | Why It Matters |
|---|---|
| Node CPU Utilization | Sustained saturation will cause pod throttling and degraded service response times |
| Node Memory Utilization | Memory pressure leads to pod evictions, which can interrupt running services |
| Control Plane Metrics | API server latency and etcd health are leading indicators of cluster instability |
Prefect Application Services¶
Prometheus Metrics¶
Each Prefect web service exposes a Prometheus-compatible /metrics endpoint. For example:
auth:4300/metrics
Scrape these endpoints with your Prometheus instance to track request rates, error rates, and latency per service. See Prefect Services for the full list of web services and their roles.
Service Logs¶
Logs are the most direct signal of service health. Collect and monitor logs from all running Prefect pods:
kubectl logs -n prefect <pod-name>
Key things to watch for:
- Error-level log entries — unexpected exceptions or repeated failures in any service
- Authentication errors — failures in the
authservice will cascade to all other services - Database connection errors — indicate connectivity or pool exhaustion issues with PostgreSQL or Redis
- Slow query warnings — sustained slow queries point to database sizing or indexing issues
To increase log verbosity for a service during investigation, set the following environment variables on the relevant Kubernetes deployment:
- name: PREFECT_CLOUD_LOGGING_LEVEL
value: 'DEBUG'
- name: PREFECT_CLOUD_DISABLE_LOGGING_CONFIG
value: 'true'