Console Login

Stop Guessing: A Battle-Hardened Guide to Infrastructure Monitoring at Scale (2023)

Stop Guessing: A Battle-Hardened Guide to Infrastructure Monitoring at Scale

I have a rule in my engineering teams: If it is not monitored, it does not exist in production.

We have all been there. It is 03:14 AM. PagerDuty is screaming. The CEO is asking why the checkout page is throwing 502 Bad Gateway errors. You are grepping through /var/log/nginx/error.log hoping for a miracle, but the logs are silent because the disk just hit 100% utilization and the logging daemon crashed an hour ago.

Observability is not about buying Datadog and burning $5,000 a month on custom metrics. It is about understanding the granular behavior of your kernel, your I/O subsystems, and your network stack before the user notices a slowdown. In the Norwegian hosting market, where latency to NIX (Norwegian Internet Exchange) is measured in single-digit milliseconds, sloppy monitoring is a competitive disadvantage.

This is how we build a monitoring stack that actually works, using the tools available to us in mid-2023.

The Architecture: Pull vs. Push

For most infrastructure setups running on VPS or bare metal, the Prometheus (Pull) model beats the Push model. Why? Because you want your monitoring system to know when a target is down. If a server crashes hard, it stops pushing metrics. Silence is ambiguous. In a pull model, if Prometheus can't scrape the target, you get an immediate up == 0 state.

We are going to deploy a standard stack:

  • Node Exporter: For kernel-level metrics.
  • Prometheus: Time-series database.
  • Grafana: Visualization.
  • Alertmanager: Routing notifications.
Pro Tip: Never run your monitoring stack on the same infrastructure you are monitoring. If your primary cluster in Oslo goes dark, your monitoring needs to be alive to tell you why. We typically deploy a dedicated CoolVDS instance in a secondary zone or region to watch the primary fleet.

Step 1: The "Meat" - Configuring Node Exporter Correctly

Most tutorials tell you to just run the Docker container and forget it. That is amateur hour. The default Node Exporter settings expose too much garbage and miss the critical high-resolution data we need for debugging I/O stalls.

Do not use the default collectors for a high-load production server. Specifically, the wifi, zfs (unless you use it), and btrfs collectors can be expensive.

Here is a production-ready systemd unit file for node_exporter (v1.5.0) running on an Ubuntu 22.04 LTS instance:

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
    --collector.disable-defaults \
    --collector.cpu \
    --collector.meminfo \
    --collector.filesystem \
    --collector.netdev \
    --collector.diskstats \
    --collector.loadavg \
    --collector.time \
    --collector.uname \
    --collector.filefd \
    --web.listen-address=:9100

[Install]
WantedBy=multi-user.target

Notice --collector.diskstats. This is non-negotiable. On shared clouds, you often see "CPU Steal" spiking. On a premium platform like CoolVDS, we utilize KVM virtualization which provides stricter isolation, but you still need to watch your I/O Wait times. If node_disk_io_time_seconds_total spikes while CPU usage is low, your application is bottlenecked by storage, not code.

Step 2: Prometheus Configuration & Service Discovery

Static configs are fine for 5 servers. For 50, you need dynamic discovery. However, for the sake of this guide, let's look at the scrape config. The most important parameter here is scrape_interval.

If you set this to 1 minute, you are blind. Micro-bursts of traffic often last 10-20 seconds. Set your critical infrastructure scrape interval to 15 seconds maximum.

global:
  scrape_interval: 15s 
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'coolvds-nodes'
    static_configs:
      - targets: ['10.10.0.5:9100', '10.10.0.6:9100']
    relabel_configs:
      - source_labels: [__address__]
        regex: '10\.10\.0\.5:9100'
        target_label: 'instance_name'
        replacement: 'db-master-oslo'

By relabeling, you make your alerts readable. "Instance 10.10.0.5 is down" means nothing to me at 3 AM. "db-master-oslo is down" triggers an immediate adrenaline response.

Step 3: Visualizing the "Noisy Neighbor" Effect

One of the reasons we migrated our core workloads to CoolVDS was the consistency of the NVMe storage. In previous years, hosting providers oversold their storage arrays. You would write a file, and the latency would jump from 1ms to 200ms because someone else on the host node was rebuilding a RAID array.

To detect this, use this PromQL query in Grafana to measure disk read latency:

rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m])

If this value consistently exceeds 0.01 (10ms) on an NVMe drive, you have a problem. On CoolVDS instances, we consistently measure this below 2ms, even during peak operational hours in the Oslo data center.

Comparison: What your monitoring sees

Metric Budget VPS (Oversold) CoolVDS (KVM/NVMe)
CPU Steal > 5% (High Variance) ~ 0% (Near Metal)
Disk Latency (Write) Spikes to 50ms+ Steady < 2ms
Network Jitter Unpredictable Low (Direct peering)

Step 4: Alerting Logic (Avoiding Fatigue)

The fastest way to destroy a DevOps culture is to page people for things they don't need to fix immediately. Disk usage is the classic example. Alerting at "90% full" is stupid. If I have a 1TB drive, 10% free is 100GB. That could last a year.

Instead, alert on the rate of change (Burn Rate). We want to know: "Based on current writing speed, will the disk fill up in the next 4 hours?"

- alert: DiskWillFillIn4Hours
  expr: predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs"}[1h], 4 * 3600) < 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Disk is filling up fast on {{ $labels.instance }}"

This predict_linear function is powerful. It calculates the derivative and projects the trend line. This is the difference between proactive engineering and reactive panic.

GDPR and Data Residency in 2023

We cannot talk about infrastructure in Europe without addressing the elephant in the room: Datatilsynet and GDPR. Post-Schrems II, sending monitoring data containing PII (IP addresses in logs, user IDs in traces) to US-based SaaS clouds is legally risky.

Hosting your own Prometheus stack on CoolVDS servers located in Norway or the EEA solves this. Your metrics stay within the jurisdiction. You own the data. There is no third-party processor agreement needed for your own internal monitoring servers.

Deployment Automation

Finally, do not deploy this manually. Here is a snippet of an Ansible playbook to deploy the node exporter to your fleet. This assumes you are using an inventory file.

---
- hosts: all
  become: true
  tasks:
    - name: Download Node Exporter
      get_url:
        url: "https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz"
        dest: "/tmp/node_exporter.tar.gz"

    - name: Extract Node Exporter
      unarchive:
        src: "/tmp/node_exporter.tar.gz"
        dest: "/usr/local/bin/"
        remote_src: yes
        extra_opts: ["--strip-components=1"]

    - name: Copy Systemd Service
      copy:
        src: node_exporter.service
        dest: /etc/systemd/system/node_exporter.service
        notify: restart_node_exporter

  handlers:
    - name: restart_node_exporter
      service:
        name: node_exporter
        state: restarted
        enabled: yes

Conclusion

Building a monitoring system is an investment in your sleep schedule. By utilizing the Prometheus stack on reliable infrastructure, you gain visibility that goes beyond simple "up/down" checks. You see the texture of your traffic and the health of your hardware.

But software is only half the equation. You cannot tune a server that is suffering from physical resource contention. That is why we recommend testing your stack on CoolVDS. The KVM isolation and NVMe backing give you a baseline of performance that makes monitoring boring—which is exactly how it should be.

Ready to see what low latency actually looks like? Spin up a CoolVDS instance in Oslo today and start graphing your way to stability.