Console Login

Observability is Not Monitoring: Why Your Green Dashboard is Lying to You

Observability is Not Monitoring: Why Your Green Dashboard is Lying to You

It’s 03:14 AM. You’re in Oslo. The rain is hitting the window, and PagerDuty just screamed at you. You open Grafana. Everything is green. CPU is at 40%, memory is fine, disk space is ample. Yet, your customers in Bergen are getting 504 Gateway Timeouts. This is the classic DevOps nightmare: you have monitoring, but you don't have observability.

Monitoring is asking the system: "Are you okay?" The system answers: "Yes."
Observability is asking the system: "What are you doing right now?" and getting a stack trace involving a locked database row caused by a rogue cron job.

In this post, we aren't discussing "digital landscapes." We are fixing broken stacks. We will look at how to move from passive checking to active introspection using tools available right now in 2024, like OpenTelemetry and eBPF, and why the underlying hardware (specifically the VPS isolation layer) matters more than your dashboard colors.

The "War Story": The Phantom Latency

Last winter, I was debugging a Magento cluster hosted on a budget provider. The monitoring dashboards showed healthy load averages. Yet, checkout times spiked to 8 seconds randomly. We wasted days tweaking PHP-FPM settings.

The culprit? Steal time.

The provider was overselling CPU cycles. Our "monitoring" checked our guest OS metrics, which looked fine. "Observability"—specifically checking the steal time via iostat and correlating it with application traces—revealed that the hypervisor was choking our I/O requests.

Pro Tip: Always check %steal on your VPS. If it's consistently above 1-2%, migrate. On CoolVDS, we use strict KVM isolation. You get the cycles you pay for, preventing this specific class of "phantom" failure.

Step 1: Structured Logging (Stop Grepping Text Files)

If you are still SSH-ing into servers to tail -f /var/log/nginx/error.log, stop. You cannot correlate events across microservices with text files.

You need JSON logs. Here is the exact Nginx configuration I use to feed logs into Loki or Elasticsearch. It captures the request time and upstream response time, which are critical for pinpointing if the slowness is Nginx or the backend app.

http {
    log_format json_analytics escape=json
    '{'
        '"time_local": "$time_local", '
        '"remote_addr": "$remote_addr", '
        '"request_uri": "$request_uri", '
        '"status": "$status", '
        '"request_time": "$request_time", '
        '"upstream_response_time": "$upstream_response_time", '
        '"trace_id": "$http_x_b3_traceid" '
    '}';

    access_log /var/log/nginx/access_json.log json_analytics;
}

Note the $http_x_b3_traceid. This brings us to the core of 2024 observability: Distributed Tracing.

Step 2: The OpenTelemetry Revolution

By mid-2024, OpenTelemetry (OTel) has effectively won the protocol war. It unifies metrics, logs, and traces. Instead of running a Jaeger agent, a Prometheus exporter, and a Fluentd bit separately, you run the OTel Collector.

Here is a battle-tested docker-compose snippet to get a local observability stack running on a CoolVDS instance. This setup uses the OTel collector to push data to Prometheus (metrics) and Tempo (traces).

version: "3.9"
services:
  # The brain of operations
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.100.0
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317" # OTLP gRPC
      - "4318:4318" # OTLP HTTP
      - "8888:8888" # Metrics

  # Metrics storage
  prometheus:
    image: prom/prometheus:v2.51.2
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=15d
    ports:
      - "9090:9090"

  # Tracing storage
  tempo:
    image: grafana/tempo:2.4.1
    command: [ "-config.file=/etc/tempo.yaml" ]
    ports:
      - "3200:3200"
      - "4317"  # otlp grpc

This stack is lightweight enough to run on a standard 4GB CoolVDS instance without eating into your application's resources.

Step 3: Instrumenting the Application

Infrastructure metrics (CPU, RAM) are just noise if you can't link them to business logic. You need to know that this specific SQL query caused that CPU spike.

In Python (FastAPI/Flask), auto-instrumentation in 2024 is robust. You don't need to rewrite your code. You just wrap the execution.

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

export OTEL_TRACES_EXPORTER=otlp
export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
export OTEL_SERVICE_NAME="checkout-service-norway"

# Run your app with the wrapper
opentelemetry-instrument python main.py

Now, every HTTP request generates a span. If a request to the database takes 500ms, you see it in the trace. No more guessing.

The Hardware Reality: Where "Cloud" Fails

You can have the best Grafana dashboards in Europe, but if your underlying storage IOPS fluctuate, your observability data will be full of anomalies you cannot explain.

In Norway, data sovereignty is critical (thanks, GDPR and Datatilsynet). But performance is the other side of that coin. When you host on shared platforms with "burstable" instances, your 99th percentile latency (p99) becomes garbage because another tenant is compiling Rust kernels next door.

CoolVDS uses NVMe storage exclusively. Why does this matter for observability?

  • Write Speed: High-cardinality logs (like debug traces) require massive write throughput. Rotating generic HDDs will drop log lines under load.
  • Latency Consistency: To alert on a 50ms latency spike, your baseline must be stable. If your disk variance is +/- 100ms, your alerts are useless.

Check Your I/O Wait

Run this command on your current server. If %iowait is consistently visible during traffic spikes, your storage is the bottleneck, not your code.

iostat -x 1 10

Look at the await column. On a CoolVDS NVMe instance, this should be sub-millisecond. On crowded legacy VPS hosts, I've seen this hit 20ms+.

Conclusion: Stop Guessing

Monitoring is for uptime. Observability is for understanding. To survive in the 2024 DevOps environment, you need to implement structured logging, embrace OpenTelemetry, and ensure your infrastructure isn't the root cause of your noise.

You cannot observe a system that is fundamentally unstable due to noisy neighbors. Start with a solid foundation. Deploy a KVM-isolated, NVMe-backed instance in our Oslo datacenter. Experience latency to NIX (Norwegian Internet Exchange) that is practically a rounding error.

Ready to see what your code is actually doing? Spin up a CoolVDS instance today and install the OTel collector in under 5 minutes.