Console Login

Observability vs. Monitoring: Why Your Green Dashboards Are Lying to You

Observability vs. Monitoring: Why Your Green Dashboards Are Lying to You

It was 3:42 AM on a Tuesday. My pager screamed. The dashboard, a beautiful sprawling grid of Grafana panels, was solid green. CPU usage? Nominal. Memory? 60%. Disk I/O? Within limits. Yet, the checkout service on our client's Magento cluster was timing out for 40% of users in Oslo. We were flying blind with a dashboard that said everything was fine.

This is the failure of Monitoring. It answers the question: "Is the system healthy based on metrics I already decided were important?"

We fixed it by looking at distributed traces, which revealed a third-party payment gateway API call hanging due to a TLS handshake timeout. That is Observability. It answers: "Why is the system behaving this way?"

In 2025, if you are still relying solely on static thresholds, your infrastructure is a ticking time bomb. Let's break down the transition from monitoring to observability, the tech stack you need (OpenTelemetry, Prometheus, Loki), and why your underlying hardware determines whether your observability stack helps you or kills your performance.

The Technical Distinction: Known vs. Unknown

Monitoring is for known unknowns. You know disk space can run out, so you set an alert at 90%. You know CPU can spike, so you watch load averages.

Observability is for unknown unknowns. You didn't know that deploying version 2.4.1 would cause a race condition in the Redis connection pool only when traffic from Trondheim hits a specific load balancer node. You can't write an alert for that. You need data granularity that allows you to ask arbitrary questions.

The Three Pillars in 2025

By now, the "Three Pillars" concept is standard, but the tools have matured significantly.

  • Metrics: Aggregated data. "What is the error rate?" (Tool: Prometheus/VictoriaMetrics)
  • Logs: Discrete events. "What did the error say?" (Tool: Loki/Elasticsearch)
  • Traces: Request lifecycle. "Where did the error happen?" (Tool: Jaeger/Tempo)

Implementation: The OpenTelemetry Standard

Gone are the days of proprietary agents. In 2025, if you aren't using OpenTelemetry (OTel), you are locking yourself into a vendor trap. OTel provides a single set of APIs, libraries, and agents to collect distributed traces and metrics.

Here is how you actually instrument a Python service for OTel. Notice we aren't just logging strings; we are creating spans to track execution time.

# app.py - Instrumentation logic
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Configure the provider
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Point to your local collector (often running as a sidecar)
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

with tracer.start_as_current_span("process_order_oslo") as span:
    span.set_attribute("geo.region", "NO-Oslo")
    span.set_attribute("customer.tier", "enterprise")
    try:
        # Simulate logic
        process_payment()
    except Exception as e:
        span.record_exception(e)
        span.set_status(trace.Status(trace.StatusCode.ERROR))

This code doesn't just say "error." It tags the span with the region and customer tier. Later, in Grafana, you can filter by geo.region="NO-Oslo" and see exactly which span failed.

The Infrastructure Tax: High Cardinality Kills I/O

Here is the part most tutorials skip. Observability generates massive amounts of data. Tracing every request (or even sampling 10%) creates a write-heavy workload on your storage backend.

If you run Elasticsearch or Loki on cheap, spinning rust (HDD) or shared-resource VPS, your observability stack will crash right when you need it most—during a high-traffic incident. I've seen it happen. The app generates logs faster than the disk can write them, causing backpressure that crashes the application containers.

Pro Tip: Never colocate your observability storage with your application database on the same physical disk if you can avoid it. If you must, ensure you have dedicated IOPS.

At CoolVDS, we standardized on NVMe storage specifically for this reason. When you are pushing ingestion rates of 50MB/s into a Loki stream, standard SATA SSDs often hit queue depth limits. You need the parallelism of NVMe.

Benchmarking Your Disk for Observability

Before deploying a heavy ELK or LGTM (Loki-Grafana-Tempo-Mimir) stack, run this fio test to ensure your VPS can handle the random write patterns of log indexing:

# Simulating database/log ingestion random write patterns
fio --name=random_write_test \
  --ioengine=libaio --rw=randwrite --bs=4k --numjobs=4 \
  --size=4G --runtime=60 --time_based \
  --group_reporting --direct=1

On a standard budget VPS, you might see 2,000 IOPS. On a CoolVDS instance, we typically sustain 20,000+ IOPS on our NVMe tiers. That difference is the buffer between a functioning dashboard and a crashed monitoring server.

The Norwegian Context: GDPR & Data Sovereignty

Observability data is dangerous. It often inadvertently contains PII (IP addresses, user IDs, email snippets in stack traces). Under GDPR and the strict interpretations of Datatilsynet here in Norway, sending these logs to a US-cloud provider is a compliance nightmare (Schrems II implications are still very real in 2025).

Hosting your observability stack on Norwegian soil, or at least within the EEA, is not just a performance choice—it's a legal one. By using a local provider like CoolVDS, you ensure that your trace data, which maps your entire business logic, never leaves the jurisdiction.

Configuring Vector for Local Processing

We use Vector (a high-performance Rust-based data pipeline) to scrub PII before it hits the disk. Here is a configuration snippet to redact Norwegian National ID numbers (fødselsnummer) before storage:

[transforms.scrub_pii]
  type = "remap"
  inputs = ["nginx_logs"]
  source = '''
  # Redact 11-digit Norwegian ID numbers using Regex
  .message = replace(.message, r'\b\d{11}\b', "[REDACTED_FNR]")
  
  # Parse JSON if applicable
  if is_json(.message) {
    . = parse_json!(.message)
  }
  '''

[sinks.local_loki]
  type = "loki"
  inputs = ["scrub_pii"]
  endpoint = "http://10.0.0.5:3100"
  labels.host = "${HOSTNAME}"

Structured Logging: The Foundation

You cannot observe unstructured text. If your Nginx logs are just lines of text, you are failing. Configure Nginx to output JSON. This allows tools like Vector or Logstash to parse latency fields instantly.

http {
    log_format json_analytics escape=json
    '{'
        '"msec": "$msec", ' # Request time in seconds with milliseconds resolution
        '"connection": "$connection", '
        '"connection_requests": "$connection_requests", '
        '"pid": "$pid", '
        '"request_id": "$request_id", '
        '"request_length": "$request_length", '
        '"remote_addr": "$remote_addr", '
        '"remote_user": "$remote_user", '
        '"remote_port": "$remote_port", '
        '"time_local": "$time_local", '
        '"time_iso8601": "$time_iso8601", '
        '"request": "$request", '
        '"request_uri": "$request_uri", '
        '"args": "$args", '
        '"status": "$status", '
        '"body_bytes_sent": "$body_bytes_sent", '
        '"bytes_sent": "$bytes_sent", '
        '"http_referer": "$http_referer", '
        '"http_user_agent": "$http_user_agent", '
        '"http_x_forwarded_for": "$http_x_forwarded_for", '
        '"http_host": "$http_host", '
        '"server_name": "$server_name", '
        '"request_time": "$request_time", '
        '"upstream": "$upstream_addr", '
        '"upstream_connect_time": "$upstream_connect_time", '
        '"upstream_header_time": "$upstream_header_time", '
        '"upstream_response_time": "$upstream_response_time", '
        '"upstream_response_length": "$upstream_response_length", '
        '"upstream_cache_status": "$upstream_cache_status", '
        '"ssl_protocol": "$ssl_protocol", '
        '"ssl_cipher": "$ssl_cipher", '
        '"scheme": "$scheme", '
        '"request_method": "$request_method"'
    '}';

    access_log /var/log/nginx/access_json.log json_analytics;
}

With this config, you can instantly query: "Show me the 99th percentile latency for POST requests to /api/cart originating from IP addresses in Bergen."

The Reality Check

Building a true observability stack is not about installing a plugin. It is an architectural commitment. It requires:

  1. Code Changes: Developers must instrument code with spans and attributes.
  2. Cultural Changes: Moving from "Is it up?" to "Is it fast and correct?"
  3. Infrastructure Changes: Moving from cheap shared hosting to dedicated, high-IOPS environments.

Latency matters. If your observability ingestion point is in Frankfurt but your servers are in Oslo, you are adding 20-30ms of network overhead to every trace export. Keeping your monitoring infrastructure local—connected via the NIX (Norwegian Internet Exchange)—reduces jitter and packet loss.

CoolVDS isn't magic. It's just raw, unthrottled KVM resources on enterprise hardware. But when you are trying to debug a microservice race condition at 3 AM, having hardware that doesn't steal CPU cycles or choke on disk writes is the difference between a 5-minute fix and an all-night outage.

Stop staring at green lights. Start querying your data. Deploy a high-performance observability node on CoolVDS today and finally see what your code is actually doing.