Console Login

Observability vs. Monitoring: Why Your "All Green" Dashboard is Lying to You

Observability vs. Monitoring: Why Your "All Green" Dashboard is Lying to You

It’s 03:00 AM. PagerDuty screams. You open Grafana. CPU is at 40%. RAM is fine. Disk I/O is steady. The dashboard is a sea of comforting green validation. Yet, the support tickets are piling up: "Checkout is broken." "The site is crawling."

This is the classic failure of monitoring. You are watching the infrastructure, but you aren't seeing the transaction. In 2025, knowing your server is "up" is the bare minimum participation trophy of engineering. It doesn't mean you're operational.

I've spent the last decade debugging high-traffic systems across Europe. The difference between a chaotic fire-fighting session and a calm fix usually comes down to one thing: Moving from monitoring (checking the health of components) to observability (understanding the internal state of the system based on external outputs).

The "Known Unknowns" vs. "Unknown Unknowns"

Let’s cut the marketing fluff. Monitoring is for problems you can predict. You know disk space runs out, so you set an alert for disk_usage > 90%. These are known unknowns.

Observability is for the things you never thought to ask. Why did latency spike to 500ms only for users in Bergen using Safari on iOS, but only when the recommendation engine cache missed? You can't write a Nagios check for that.

The Three Pillars in Practice (Not Theory)

You’ve heard of Logs, Metrics, and Traces. But how you implement them defines if you have observability or just expensive storage bills.

1. Metrics: The "What"

Metrics are cheap. They aggregate well. But averages lie. A 200ms average response time looks great, even if 5% of your users are hitting 10-second timeouts.

Pro Tip: Stop alerting on averages. Alert on the 95th and 99th percentiles. In Prometheus, use histograms. If your p99 latency on CoolVDS isn't under 50ms for static content, check your nginx config, not the hardware. Our NVMe arrays don't bottleneck.

2. Logs: The "Context"

Logs are heavy. In a microservices environment, a raw text log is useless without correlation. If you aren't injecting a TraceID into your Nginx or HAProxy logs, you are flying blind.

Here is how we configure Nginx to support OpenTelemetry trace propagation, essential for correlating a slow request with a specific backend error:

http {
    log_format trace_fmt '$remote_addr - $remote_user [$time_local] "$request" '
                         '$status $body_bytes_sent "$http_referer" '
                         '"$http_user_agent" "$http_x_forwarded_for" '
                         'trace_id=$opentelemetry_trace_id span_id=$opentelemetry_span_id';

    access_log /var/log/nginx/access.log trace_fmt;
}

3. Traces: The "Where"

Tracing follows the request through the stack. In 2025, OpenTelemetry (OTel) is the undisputed standard. Proprietary agents are dead. If you are building on CoolVDS, you run the OTel Collector as a binary agent.

Below is a production-ready otel-collector-config.yaml snippet for offloading traces to a backend like Jaeger or Tempo, ensuring you don't burn local CPU on processing:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  otlp:
    endpoint: "tempo.monitoring.svc.cluster.local:4317"
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

Why Infrastructure Choice Dictates Observability

You cannot observe what you cannot touch. This is where the "Managed Cloud" abstraction fails advanced teams. If you are on a restrictive PaaS, you often cannot install eBPF probes or access kernel-level metrics.

On a CoolVDS KVM instance, you have full kernel authority. This allows you to use tools like bpftrace to see exactly which system calls are stalling your database.

For example, if MySQL is stalling, a standard monitor says "High I/O." An observability approach using eBPF asks the kernel:

# bpftrace -e 'tracepoint:syscalls:sys_enter_fsync { @[comm] = count(); }'
Attaching 1 probe...
@["mysqld"]: 421

This command reveals that mysqld triggered 421 fsyncs in a few seconds. Now you know it's a configuration issue (likely innodb_flush_log_at_trx_commit), not a noisy neighbor stealing your IOPs. On shared hosting, you'd just be guessing.

The Norwegian Context: Latency and Law

Observability data is sensitive. It contains IP addresses, User IDs, and query parameters. Under GDPR and the scrutiny of Datatilsynet, shipping this log data to a US-based SaaS observability platform is a legal minefield (referencing Schrems II implications which are still headaches in 2025).

By hosting your observability stack (Prometheus/Loki/Grafana) on CoolVDS instances in Oslo:

  • Data Sovereignty: Your logs never leave Norwegian jurisdiction.
  • Latency: pushing gigabytes of trace data over the public internet to a cloud provider adds cost and latency. Keeping it local on our high-speed internal network is free.

Implementation Guide: The "Golden Signals" Setup

If you are setting up a fresh stack today, don't overcomplicate it. Start with the "Golden Signals" defined by Google SREs: Latency, Traffic, Errors, and Saturation.

Here is a Prometheus alert rule for High Error Rates that actually makes sense. It triggers only if the error rate exceeds 1% of total traffic for 5 minutes, avoiding pager fatigue from blips:

groups:
- name: golden_signals
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) 
      / 
      sum(rate(http_requests_total[5m])) > 0.01
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "High error rate on {{ $labels.instance }}"
      description: "5xx errors are > 1% of traffic."

The Verdict

Monitoring confirms your server is on. Observability confirms your business is working. The transition requires more than just installing a tool; it requires access to the metal and the freedom to configure the kernel to your needs.

Don't let a "black box" hosting provider hide the root cause of your downtime. Get a VPS where you can see everything.

Ready to see what's actually happening inside your application? Deploy a KVM instance on CoolVDS today and get root access to your reality.