Observability is Not Just "More Monitoring"
It’s 3:00 AM. Your phone buzzes. PagerDuty is screaming. You open your Grafana dashboard. All the lights are green. CPU is at 40%, RAM is steady, disk I/O is nominal. Yet, your support inbox is flooding with Norwegian users claiming they can't complete a purchase.
This is the failure of Monitoring. Monitoring answers the question: "Is the system healthy?" based on thresholds you defined three years ago. It handles "known unknowns."
Observability, on the other hand, answers: "Why is the system behaving this way?" regardless of what you predicted. It handles the "unknown unknowns." In the high-stakes environment of Nordic e-commerce and SaaS, relying solely on basic metrics is negligence.
The Anatomy of a lie: When HTTP 200 is a Failure
I recall a specific incident deploying a microservices architecture for a fintech client in Oslo. We had strict SLAs. Our Nginx monitoring reported 100% uptime and sub-100ms response times. But the application logic was silently failing due to a race condition in the database layer that only triggered under specific high-concurrency writes.
Monitoring saw HTTP 200 OK because the API gateway successfully returned a generic "Please try again" JSON payload. The infrastructure was fine. The business was bleeding money.
We only caught it because we had implemented distributed tracing via OpenTelemetry. We saw a span duration spike in the payment-service that didn't correlate with CPU load. It was a thread lock.
The "LGTM" Stack: A 2024 Standard
While the ELK stack (Elasticsearch, Logstash, Kibana) was the king of the 2010s, it is heavy, resource-hungry, and expensive to scale. In 2024, the pragmatic choice for serious DevOps teams is the LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus). It decouples storage from compute effectively.
Here is how you actually set this up. Don't just install packages; configure them for high cardinality.
1. The Collector Configuration
You need an OpenTelemetry Collector to sit between your apps and your backend. This allows you to sanitize data (critical for GDPR compliance) before it hits the disk.
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
# Scrubbing PII for GDPR compliance before storage
attributes/gdpr:
actions:
- key: user.email
action: hash
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
otlp:
endpoint: "tempo:4317"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, attributes/gdpr]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Note the attributes/gdpr processor. If you are logging raw user data in Norway without anonymization, Datatilsynet (The Norwegian Data Protection Authority) will eventually have a very expensive chat with you.
Infrastructure Matters: The I/O Bottleneck
Here is the trade-off nobody talks about: Observability generates massive amounts of write-heavy data.
If you enable full tracing on a high-traffic application, you are writing gigabytes of logs and traces per hour. On a budget VPS with standard SSDs (or worse, spinning rust), your `iowait` will skyrocket. The observability tool itself becomes the cause of your outage. This is the Heisenberg Uncertainty Principle of DevOps: measuring the system crashes the system.
Pro Tip: Never run your observability stack on the same disk controller as your database. If you can't separate the hardware, ensure you have high-throughput NVMe storage. This is why we standardized on NVMe for all CoolVDS instances—monitoring shouldn't kill your production.
Self-Hosting vs. SaaS (Schrems II & Cost)
In 2024, sending your telemetry data to Datadog or New Relic is a double-edged sword. First, the cost scales linearly with traffic. Second, data residency. Under Schrems II, shipping log data containing IP addresses or user identifiers to US-controlled clouds is legally risky for European companies.
Self-hosting Grafana and Loki in Norway gives you two advantages:
- Legal Safety: Data stays within the jurisdiction.
- Latency: Sending traces to a US endpoint adds 100ms+ overhead to the request loop if you are using synchronous blocking calls (don't do that, but legacy apps happen). Sending it to a local instance in Oslo takes <2ms.
Deploying Loki with Docker Compose
Here is a battle-tested snippet for getting Loki up with sane retention limits (to prevent filling your disk):
version: "3"
services:
loki:
image: grafana/loki:2.9.0
command: -config.file=/etc/loki/local-config.yaml
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
- loki-data:/loki
ports:
- "3100:3100"
restart: unless-stopped
promtail:
image: grafana/promtail:2.9.0
volumes:
- /var/log:/var/log
- ./promtail-config.yaml:/etc/promtail/config.yaml
command: -config.file=/etc/promtail/config.yaml
restart: unless-stopped
volumes:
loki-data:
Comparison: Traditional vs. Observable
| Feature | Traditional Monitoring | Modern Observability |
|---|---|---|
| Core Question | Is it working? | Why is it broken? |
| Data Source | Aggregates (Averages) | High-cardinality Events |
| Granularity | Server / Host | Request / User ID |
| Infrastructure Needs | Low (SNMP, Ping) | High (NVMe, RAM) |
Implementation Strategy
Don't try to boil the ocean. Start by instrumenting your most critical API endpoints. Use the "RED" method:
- Rate (Requests per second)
- Errors (The number of those requests that are failing)
- Duration (The amount of time those requests take)
Once you have metrics, add Tracing to the slow endpoints. Finally, correlate Logs to those traces using TraceIDs.
The Verdict
Observability is an investment in your sleep schedule. It allows you to debug production without logging into the server. But it demands respect for the underlying hardware. You cannot run a heavy Grafana/Loki stack on oversold, noisy-neighbor hosting environments.
If you are building for the Nordic market, you need the low latency of local peering and the raw I/O throughput to handle ingestion spikes without choking your actual application. We built CoolVDS to handle exactly these kinds of workloads—where performance guarantees aren't just marketing copy, but a technical necessity.
Ready to stop guessing? Deploy your own observability stack on a CoolVDS NVMe instance today. Spin it up in under 60 seconds and see what your application is really doing.