The Lie of 99.9%: Implementing brutally honest infrastructure monitoring at scale
Silence is not golden. In systems administration, silence is usually terrifying. It means your alert pipeline is broken, your pager is out of battery, or your monitoring agent has been OOM-killed by the very application it was supposed to watch.
I have spent the last decade debugging distributed systems across Europe. I’ve seen "green" dashboards on screens while the customer support lines were melting down because the checkout API was timing out every 45 seconds. The difference between "monitoring" and "observability" isn't just marketing semantics—it's the difference between knowing the server is on and knowing the server is working.
If you are deploying critical infrastructure in 2025, a simple ping check is negligence. You need granularity. You need to understand the behavior of the Linux kernel under load. And specifically for the Nordic market, you need to know exactly what your latency looks like through NIX (Norwegian Internet Exchange) versus transit providers.
Here is how we build monitoring stacks that don't lie, referencing the architecture we recommend on CoolVDS high-performance NVMe instances.
The Storage Bottleneck: Why TSDBs Die on Cheap VPS
Time Series Databases (TSDBs) like Prometheus or VictoriaMetrics are write-heavy. They ingest thousands of data points per second, compress them, and flush them to disk. On a standard HDD or a crowded SATA SSD VPS, your monitoring system will become the bottleneck.
I once debugged a Prometheus instance that had 5-minute gaps in data. The CPU wasn't high. The RAM was fine. The issue was I/O Wait. The underlying storage system of the budget provider was choking on the write operations (WAL - Write Ahead Log). If you can't trust the timestamp of your metric, you have nothing.
This is why we strictly provision NVMe storage on CoolVDS. When you are scraping 500 endpoints every 15 seconds, you need high IOPS and low latency.
Diagnosing I/O Wait
Before you blame your application, check if the disk is stealing your CPU cycles. Run this:
iostat -xz 1
Look at the %iowait column. If this is consistently above 5-10% on your monitoring node, your storage is too slow for your ingestion rate. You can also verify process-specific I/O usage:
iotop -oPa
The Stack: Prometheus, Grafana, and Node Exporter
We stick to the industry standard. It’s open-source, it’s portable, and it works. However, the default configuration of `node_exporter` is noisy. It collects metrics you will never use, bloating your TSDB and increasing disk usage.
Here is a deployment snippet for a production-ready monitoring stack using docker-compose. Note the limitations we place on memory to prevent the observer from killing the observed.
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.54.1
container_name: prometheus
restart: unless-stopped
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=15d'
- '--web.enable-lifecycle'
ports:
- 9090:9090
networks:
- monitor-net
deploy:
resources:
limits:
memory: 2G
node-exporter:
image: prom/node-exporter:v1.8.2
container_name: node-exporter
restart: unless-stopped
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- 9100:9100
networks:
- monitor-net
networks:
monitor-net:
driver: bridge
volumes:
prometheus_data:
Optimizing Scrape Configs
A common mistake is scraping everything every 10 seconds. This is unnecessary for disk usage metrics, which change slowly, but critical for CPU spikes. Split your scrape jobs.
Here is a refined prometheus.yml that differentiates between high-frequency and low-frequency targets:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'critical_services'
scrape_interval: 10s
static_configs:
- targets: ['app-server-01:9100', 'db-primary:9100']
relabel_configs:
- source_labels: [__address__]
regex: '.*'
target_label: instance
replacement: 'norway-prod-01'
- job_name: 'auxiliary_metrics'
scrape_interval: 1m
static_configs:
- targets: ['backup-server:9100']
metric_relabel_configs:
- source_labels: [__name__]
regex: 'go_.*'
action: drop
Pro Tip: Use `metric_relabel_configs` to drop raw Go or Python runtime metrics (`go_*`, `python_*`) unless you are specifically debugging memory leaks. They account for 40% of stored series in default setups and provide zero value to the average sysadmin.
Latency and The "Schrems II" Reality
In Norway and the wider EEA, where your data lives matters legally (GDPR/Schrems II), but where your monitoring lives matters technically. If your monitoring server is in Frankfurt and your infrastructure is in Oslo, you are introducing 20-30ms of network jitter into your "internal" checks.
We host our infrastructure in Oslo. This allows for sub-millisecond latency monitoring for local services. When configuring your alerts, you must account for this. An alert for `probe_duration_seconds > 0.5` is reasonable within a local datacenter (LAN), but if you are monitoring across the internet, you will get paged at 3 AM for a routing blip.
To test the raw TCP connection latency without the overhead of HTTP, use nc or socat:
nc -zv 10.0.0.5 80
For a detailed breakdown of where the latency is introduced (DNS vs Connect vs Transfer), I use this formatted curl command:
curl -w "DNS: %{time_namelookup} Connect: %{time_connect} TTFB: %{time_starttransfer} Total: %{time_total}\n" -o /dev/null -s https://coolvds.com
Alerting: Signal vs. Noise
The fastest way to burn out a DevOps engineer is "Alert Fatigue." If I get an email saying "CPU is high" and I can't do anything about it, I will create an Outlook rule to delete that email. Alerts must be actionable.
We use Alertmanager to group notifications. Do not send an email for every failed container. Group them by cluster or environment.
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-ops'
receivers:
- name: 'slack-ops'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#ops-alerts'
send_resolved: true
title: '{{ template "slack.default.title" . }}'
text: '{{ template "slack.default.text" . }}'
The Kernel Never Lies
Finally, standard metrics often miss the nuance of virtualization. If you are on a VPS, you are sharing a physical CPU. The metric node_cpu_seconds_total{mode="steal"} is your best friend. It measures the time your VM wanted to run, but the hypervisor said "wait."
On CoolVDS, we maintain strict KVM isolation policies to keep steal time at near zero. On oversold platforms, I have seen steal time hit 20%, causing applications to stutter without reporting high CPU usage inside the guest.
Check it manually if your system feels sluggish:
top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* st.*/\1/"
If that number is anything other than 0.0, your provider is the problem, not your code.
Conclusion
Building a monitoring stack is about building trust in your own systems. You need fast storage (NVMe), data sovereignty (local Norwegian data centers), and a configuration that respects your attention span. Don't settle for opaque "managed monitoring" that hides the details.
If you are ready to build a stack that can handle thousands of write operations per second without sweating, deploy a CoolVDS NVMe instance today and see what actual zero-steal performance looks like.