← Back to Blog

Why the Four Golden Signals Are Not Enough

Many engineering organizations mistake comprehensive telemetry for operational maturity. It is common to find environments with extensive dashboards, infrastructure alerts, and high-volume metrics that still fail to provide clarity during an incident. When production degrades, teams frequently face a critical gap: they can see that system resource utilization is fluctuating, but they cannot immediately determine whether users are experiencing a broken product.

This issue stems from a fundamental misalignment: measuring system internals rather than user outcomes. While infrastructure visibility is necessary for debugging, it does not correlate directly with the user experience. To bridge this gap, teams must shift from passive monitoring to structured reliability engineering.

Back to Basics: The Four Golden Signals

An effective observability strategy does not require hundreds of fragmented dashboard panels. Instead, it should focus on the core signals that define service health from the user's perspective. Google’s Site Reliability Engineering (SRE) framework outlines four critical metrics that capture this state:

  • Latency: The time required to service a request. This must be tracked using percentiles (such as p95 and p99) rather than averages, which routinely obscure severe outliers experienced by a subset of users.
  • Traffic: A measure of the demand placed on the system. Depending on the architecture, this is typically quantified by requests per second, concurrent users, or queue throughput.
  • Errors: The rate of failed requests. This includes explicit failures (such as HTTP 5xx status codes) and implicit failures (such as an application returning an HTTP 200 but delivering empty or corrupted payloads).
  • Saturation: A measure of system utilization relative to its maximum capacity. While CPU and memory are standard indicators, saturation often manifests earlier in downstream constraints like database connection pool exhaustion, queue lag, or thread pool depletion.

Why Golden Signals Beat Infrastructure Metrics

Infrastructure metrics such as CPU utilization, disk I/O, and memory consumption remain essential for root-cause analysis, but they are poor indicators of immediate user impact. Customers do not experience high CPU utilization; they experience slow or failed requests, missing functionality, and outright downtime.

A service can run at 95% CPU capacity while serving requests flawlessly, just as a system with 5% CPU capacity can be completely broken due to a misconfigured upstream dependency or an application-level deadlock. By shifting the primary focus to the Four Golden Signals, teams monitor the symptoms of degradation rather than the implementation details of the underlying infrastructure.

Why Visibility Is Not Enforcement

Tracking the Four Golden Signals and building alert thresholds is only an intermediate step. Without a formal framework to translate these metrics into operational decisions, metrics remain purely informative. Operational maturity requires the introduction of three distinct concepts:

  • SLI (Service Level Indicator): A quantifiable metric demonstrating how well a service is performing. For example: the percentage of valid HTTP requests completed successfully within 300ms.
  • SLO (Service Level Objective): The target reliability goal defined for an SLI over a specific rolling time window. For example: maintaining a 99.9% successful request rate over 30 days.
  • Error Budget: The total allowable room for unreliability within a given SLO window (e.g., a 99.9% SLO allows for a 0.1% error budget).

An error budget provides the necessary leverage to balance product velocity with system stability. When an error budget is exhausted, it serves as a clear policy indicator that engineering priorities must pivot from feature development to technical debt reduction, architectural stabilization, and performance remediation.

Real-World Operational Nuances

Implementing these concepts in production requires moving past rigid textbook definitions. Two common anti-patterns frequently disrupt SLO initiatives:

1. Treating Error Budgets as Hard Deployment Blocks

While theoretical SRE models suggest freezing all deployments the moment an error budget is depleted, strict automated delivery locks are rarely practical. Security vulnerabilities must be patched, compliance fixes must ship, and the very changes needed to restore stability often require rolling out new code.

Instead of a hard stop on delivery pipelines, the error budget should act as a prioritization governance mechanism, ensuring that reliability engineering receives dedicated capacity before a critical degradation occurs.

2. The Single Burn-Rate Trap

Configuring a single burn-rate threshold to manage SLO alerts introduces an impossible operational trade-off. High thresholds accurately identify catastrophic failures but miss slow, compounding issues. Conversely, low thresholds capture subtle degradation but generate high volumes of alert fatigue.

Production environments require multi-window, multi-burn-rate alerting strategies to handle different incident profiles effectively:

Alert SeverityBurn RateBudget ConsumptionRequired Action
Critical14.4Consuming 2% of budget in 1 hourImmediate page via incident management system
Warning6.0Consuming 5% of budget in 6 hoursAsynchronous team notification
Ticket1.0Consuming budget gradually over daysScheduled backlog item for next sprint

A Practical Dashboard Architecture

To prevent information overload, teams should decouple detection from diagnosis by implementing a two-tier dashboard strategy.

Tier 1: Reliability Dashboards

These high-level views are restricted to critical business metrics: SLIs, SLO compliance, remaining error budgets, and active incident statuses. Their purpose is to answer a single question:"Are users currently experiencing a problem?"

Tier 2: Diagnostic Dashboards

These service-specific views contain granular application metrics, infrastructure telemetry, distributed traces, and resource utilization. Their purpose is to answer the subsequent question:"Why is the problem happening?"

The Bottom Line

The Four Golden Signals are an excellent baseline for system visibility, but visibility alone does not guarantee reliability. The Golden Signals identify what is happening; SLIs, SLOs, and error budgets determine whether it matters to the business. Reliability improves when data directly informs engineering priorities—dashboards offer visibility, but objectives enforce accountability.

Further Reading

Struggling with alert fatigue or unreliable monitoring that fails to prevent incidents?

Enterprise rigor, startup velocity.

Let’s talk infrastructure.