Console Login

Observability vs Monitoring: Why Your Green Dashboard is Lying to You

Observability vs Monitoring: Why Your Green Dashboard is Lying to You

It is 3:00 AM on a Tuesday. Your phone screams. PagerDuty alert: "High Latency - API Gateway." You stumble to your laptop, eyes barely open, and log into your monitoring dashboard. Everything is green. CPU is at 40%, RAM is fine, disk space is ample. Yet, Twitter is exploding with angry Norwegian customers unable to check out.

This is the failure of Monitoring. Monitoring tells you that the server is up. Observability tells you why the database query for the user session table is hanging for 500ms.

With the explosion of microservices and the recent release of Kubernetes 1.10, the old ways of "checking if the port is open" (I'm looking at you, Nagios) are dead. As we navigate the post-GDPR world—literally just four days into the new regulation—where your data lives and how you debug it matters more than ever.

The Philosophical Split: Knowns vs. Unknowns

Let's cut through the buzzwords. In the DevOps community here in Oslo, we differentiate them like this:

  • Monitoring tracks the known unknowns. You know disk space can run out, so you set a threshold at 90%. You know the CPU can spike, so you alert at load average 4.0. It is a dashboard of pre-conceived failure modes.
  • Observability allows you to ask questions about unknown unknowns. It is a property of the system that allows you to debug a problem you never predicted, purely by inspecting its outputs (logs, metrics, and traces).
Pro Tip: If you have to SSH into a server to run htop or tail -f /var/log/syslog to diagnose an issue, your system is not observable. You are flying blind with a flashlight.

The Three Pillars in 2018

To achieve observability, we need to correlate three specific data sources. If you are hosting on CoolVDS, you have the raw I/O power to handle this ingestion, but you need to configure it right.

1. Metrics (The "What")

For this, Prometheus is the undisputed king right now, especially with the release of version 2.0 earlier this year which drastically improved storage efficiency. Unlike Zabbix, which pushes, Prometheus pulls.

Here is a standard prometheus.yml scrape config for a Go application running on a local VDS instance:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'payment_service'
    static_configs:
      - targets: ['localhost:9090']
    metrics_path: '/metrics'
    scheme: 'http'

2. Logging (The "Why")

Grepping text files is slow. You need structured logging. If you are running Nginx as a reverse proxy, the default log format is useless for programmatic analysis. You need to output JSON so your ELK (Elasticsearch, Logstash, Kibana) stack can parse it instantly.

Change your /etc/nginx/nginx.conf to include this:

http {
    log_format json_combined escape=json
      '{ "time_local": "$time_local", '
      '"remote_addr": "$remote_addr", '
      '"remote_user": "$remote_user", '
      '"request": "$request", '
      '"status": "$status", '
      '"body_bytes_sent": "$body_bytes_sent", '
      '"request_time": "$request_time", '
      '"http_referrer": "$http_referer", '
      '"http_user_agent": "$http_user_agent" }';

    access_log /var/log/nginx/access.json json_combined;
}

Now, when you ship this to Elasticsearch, you can visualize request_time distribution. You might find that while your average latency is 50ms (green dashboard), your p99 latency is 4 seconds (angry customers). That is observability.

3. Tracing (The "Where")

This is the hardest part. Tools like Jaeger (now a CNCF hosted project) or Zipkin are essential if you are breaking your monolith into microservices. They pass a correlation ID through the header of every request, letting you visualize the waterfall of time spent in each service.

The Infrastructure Reality: NVMe or Bust

Here is the painful truth about running an Observability stack (ELK + Prometheus): It eats I/O for breakfast.

Elasticsearch is essentially a database that indexes every single word. If you try to run an ELK stack on a budget VPS with standard SATA SSDs (or heaven forbid, HDD), your iowait will spike, and your monitoring tool itself will crash. I have seen this happen during Black Friday sales—the logs from the traffic spike killed the logging server before the traffic killed the web server.

This is where infrastructure choice becomes architectural strategy. At CoolVDS, we standardized on NVMe storage for all instances. We didn't do this just for marketing; we did it because modern workloads like Elasticsearch 6.2 require high random read/write speeds that SATA interfaces physically cannot provide.

Metric Standard VPS (SATA SSD) CoolVDS (NVMe)
Random Read IOPS ~5,000 ~15,000+
ELK Re-indexing Time 45 minutes 8 minutes
Latency Spike Risk High (Noisy Neighbor) Low (Dedicated Lanes)

GDPR and Data Sovereignty in Norway

We are now four days into the GDPR era (May 25, 2018). If you are logging IP addresses and user agents (which are PII), sending that data to a SaaS monitoring platform hosted in the US is now a legal headache involving Privacy Shield verification and Data Processing Agreements.

The safest technical approach for Norwegian businesses is to keep the observability data within the EEA, preferably in Norway. Hosting your own Grafana and Prometheus instance on a VPS in Norway isn't just about latency—though pinging 127.0.0.1 from your app to your collector is fast—it's about compliance. You keep the logs on your encrypted disks, under your control.

Implementation: A Docker Compose Example

Ready to test this? Assuming you have Docker 18.03 installed, here is a quick docker-compose.yml snippet to get a Grafana and Prometheus stack running on your CoolVDS instance. This setup keeps your monitoring data close to your application logic.

version: '3'
services:
  prometheus:
    image: prom/prometheus:v2.2.1
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:5.1.3
    depends_on:
      - prometheus
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=secret_password

volumes:
  prometheus_data:

Deploying this takes about 45 seconds on a fresh instance.

Conclusion

Green checks on a dashboard are vanity metrics. Real engineering is about knowing exactly which SQL query caused the timeout. By shifting to Observability, you stop guessing and start fixing.

However, remember that observability generates massive amounts of write-heavy data. Don't let your logging stack become the bottleneck. Ensure your underlying infrastructure has the IOPS to handle the truth.

Ready to take off the blindfold? Spin up a high-performance NVMe instance on CoolVDS today and see what your application is really doing.