Insights · OpenShift
OpenShift Monitoring: Metrics, Alerts, and SLOs on OCP
Overview
OpenShift monitoring is not optional infrastructure wallpaper — it is how platform teams prove API server latency, etcd health, and workload SLOs before executives or regulators ask uncomfortable questions. OCP ships the cluster monitoring stack powered by the Prometheus Operator: Prometheus instances scrape platform and (when enabled) user-workload targets, Alertmanager routes notifications, and Thanos queriers optionally federate long-term storage.
Out of the box, openshift-monitoring namespaces run with tight RBAC — application developers do not see node-exporter metrics on infra nodes unless you enable user workload monitoring or deploy a dedicated observability tenant. Understanding that boundary prevents shadow Prometheus instances sprouting in every project, duplicating scrape configs and burning CPU.
This article explains default stack architecture, enabling user-workload metrics, building Grafana dashboards that matter, alert routing discipline, and how monitoring data underpins disaster recovery runbooks when clusters fail.
Without OpenShift monitoring discipline, on-call engineers grep logs during incidents while executives ask why SLAs broke without warning. Metrics are not vanity — they are contractual evidence and capacity signals.
Need help implementing this?
Talk to engineers who deploy these patterns on OpenShift in production—not generic advisory decks.
Platform Monitoring Stack and Prometheus Operator
The Cluster Monitoring Operator manages Prometheus, Alertmanager, kube-state-metrics, node-exporter, and telemetry collectors in openshift-monitoring. ServiceMonitors and PodMonitors — when permitted — define scrape targets declaratively. Platform alerts cover etcd members, API server error rates, machine config pool degradation, and cluster operator availability.
Prometheus retention defaults suit troubleshooting, not years of compliance storage. Integrate Thanos sidecars or remote_write to corporate Mimir, Cortex, or cloud monitoring for long retention and global query views. Verify remote_write credentials via openshift-monitoring secrets and network egress policies before enabling.
oc adm top and metrics-api provide kubectl-visible resource usage; they do not replace Prometheus for SLO burn rates. Teach on-call engineers which signals are platform-owned (kube-apiserver request latency) vs tenant-owned (HTTP 5xx from app routes).
Platform Prometheus RBAC denies tenant access by design — do not disable it for convenience. Instead enable user workload monitoring or federate metrics to a corporate tenant with proper isolation.
User Workload Monitoring and Tenant Metrics
Enable user workload monitoring when application teams need ServiceMonitor resources in their namespaces scraped into a dedicated Prometheus tenant. This keeps tenant metrics isolated from platform Prometheus RBAC while still using supported operators. Document RBAC granting prometheus-user workloads roles to namespace admins who deploy monitors.
Application instrumentation should expose RED metrics — rate, errors, duration — on /metrics endpoints compatible with Prometheus text format. Sidecars vs embedded exporters: prefer embedded for simplicity. OpenTelemetry collectors can receive OTLP and export Prometheus metrics when libraries support OTLP natively.
Avoid running duplicate Prometheus in every namespace — centralize where possible with federation or remote_write from a managed observability cluster. The cost of fifty tiny Prometheus instances on a medium OCP cluster often exceeds one well-sized shared tenant.
ServiceMonitor label selectors must match pod labels exactly — a common misconfiguration produces empty Prometheus targets and false confidence from health endpoints that nobody scrapes.
Grafana Dashboards and Visualization on OpenShift
OpenShift includes Grafana for platform dashboards; many enterprises integrate corporate Grafana with OAuth SSO. Import community dashboards for Kubernetes and etcd, then customize thresholds to your environment — default panels assume generic hardware. Build golden dashboards per service tier: latency percentiles, saturation, errors, and dependency health.
Dashboards as code — Jsonnet, Grafana operator CRDs, or ConfigMaps synced via GitOps — prevent UI-only changes that vanish after pod restart. Version dashboards beside application repos so PR review covers observability changes alongside code.
Correlate logs with metrics via trace IDs when using OpenShift Logging with Loki or Elasticsearch. Jumping from Alertmanager notification to Grafana panel to Kibana or Grafana Explore log view should be one click in the runbook, not three bookmarks.
OpenShift console observability tabs help developers who lack Grafana access — platform teams still own golden dashboards for SRE review. Do not let console metrics replace long-retention analytics for capacity planning.
Alertmanager, Routing, and On-Call Discipline
Alert fatigue kills OpenShift monitoring programs. Every alert should be actionable, owned, and tied to a runbook. Use inhibition rules so node NotReady suppresses hundreds of pod alerts. Route platform alerts to SRE paging; application alerts to product teams via Alertmanager receivers matched on namespace or label.
Tune PrometheusRule resources bundled with operators — they ship noisy defaults sometimes. Review watchdog and TargetDown alerts quarterly. Synthetic probes blackbox-exporter style validate routes and DNS from outside the cluster; internal health checks miss ingress misconfiguration.
SLO-based alerting via error budgets (multi-window, multi-burn-rate) reduces false positives compared to static thresholds on CPU. Implement gradually — start with availability SLOs on critical routes, expand to latency once baselines exist.
Run alert review meetings monthly — delete alerts that never page or always page. OpenShift monitoring maturity is measured by signal-to-noise ratio, not dashboard count.
Monitoring for Capacity, Security, and Disaster Recovery
Capacity planning uses Prometheus trends: node CPU/memory allocation ratios, etcd db size growth, persistent volume usage, and ingress connection counts. Forecast before quarterly hardware or cloud commit renewals. Chargeback reports from metrics reduce opaque namespace sprawl.
Security monitoring layers Falco or ACS (Advanced Cluster Security) events atop metrics — unexpected shell in container, crypto miner CPU spikes. Export audit logs to SIEM; correlate with Prometheus anomalies.
During disaster recovery, monitoring tells you whether the restored cluster matches healthy baselines — API latency, etcd leader elections, operator versions. Our disaster recovery article ties RPO/RTO validation to these signals. OpenShift monitoring is the nervous system; treat outages to observability as severity-1 because you are flying blind.
Keep monitoring stack recovery order in DR runbooks — Prometheus before workload scale-up, or alerts fire on cascading failures mistaken for root cause. Test observability failover in game days, not only application failover.
Day-2 OpenShift Monitoring Operations
Cluster Monitoring Operator upgrades ride CVO bumps — verify custom PrometheusRule CRs survive API version changes. Backup dashboard ConfigMaps and rule YAML in Git before platform upgrades.
Thanos or remote_write failures fill local Prometheus disks — alert on storage usage in openshift-monitoring namespaces. Cardinality explosions from unbounded label values can crash Prometheus; enforce label drop rules via relabel configs.
OpenShift monitoring succeeds when platform and application teams share metric naming conventions — document RED and USE methodologies in internal standards so dashboards compose across services.
User workload metrics federation to corporate Grafana requires OAuth or token proxy configuration — test dashboard access for developers without cluster-admin before announcing self-service monitoring.
Logs, Traces, and the Full Observability Stack
Cluster Logging Operator or LokiStack forwards application logs — correlate trace IDs from OpenTelemetry SDKs with log lines for faster root cause analysis.
Distributed tracing via Tempo or Jaeger on OCP adds request path visibility Routes alone cannot provide — sample rates balance cost and forensic value.
OpenShift monitoring plus logging plus tracing forms the three pillars — metrics alert, logs explain, traces pinpoint. Invest in all three for tier-one services.
Synthetics probing public routes from outside the cluster catch DNS and certificate issues invisible to in-cluster kube-probes. Run synthetics from multiple regions if user base is geo-distributed.
Capacity dashboards should project etcd and PVC growth 90 days forward — leadership approves hardware or cloud commits with data, not gut feel. OpenShift monitoring maturity separates reactive firefighting from planned scaling.
Explore further
Related services
- ServiceOpenShift Support Services
- ServiceOpenShift Managed Services
Related technology
- TechnologyPrometheus
- TechnologyGrafana
Related reading
- InsightOpenShift Disaster Recovery
Need help with OpenShift?
Talk to engineers who implement these patterns in production—not generic advisory decks.
