Console Login

Stop Guessing, Start Measuring: A Battle-Hardened Guide to Self-Hosted APM in 2020

You Can't Optimize What You Can't See

It’s 3:00 AM on a Tuesday. PagerDuty is screaming. The Norwegian e-commerce site you manage is timing out, and the marketing team just launched a campaign targeting Oslo commuters. The database is locked up. The developers say, "It works on my MacBook." You are looking at top and seeing nothing but chaos. If this sounds familiar, your observability strategy is broken.

In the high-stakes world of systems administration, relying on gut feelings or sporadic log checks is negligence. We need hard data. In 2020, with the Schrems II ruling effectively nuking the Privacy Shield, sending your performance metrics (which often contain IP addresses or user metadata) to US-based SaaS platforms is a legal minefield. The solution? Build your own stack. Keep it local. Keep it fast.

Here is how to deploy a production-grade Application Performance Monitoring (APM) stack using Prometheus and Grafana on a clean Linux environment, and why the underlying hardware—specifically the NVMe storage on your VPS—matters more than your code.

The Architecture: Pull, Don't Push

We are going to use the industry standard for 2020: Prometheus for scraping metrics and Grafana for visualization. Unlike legacy push-based systems (like Graphite), Prometheus pulls data. This ensures your monitoring system doesn't crash your application if the monitoring server goes down. The app just keeps humming, unaware it's not being watched.

Step 1: The Foundation (Node Exporter)

First, we need to extract metrics from the kernel. We use node_exporter. Do not run this as root if you can avoid it, but for the sake of this guide, we will set it up as a systemd service on an Ubuntu 20.04 LTS instance.

useradd --no-create-home --shell /bin/false node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz
tar xvf node_exporter-1.0.1.linux-amd64.tar.gz
cp node_exporter-1.0.1.linux-amd64/node_exporter /usr/local/bin/
chown node_exporter:node_exporter /usr/local/bin/node_exporter

Now, create the service definition. Precision is key here.

# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Reload daemon, start service. You know the drill: systemctl daemon-reload && systemctl start node_exporter.

The Silent Killer: I/O Wait and CPU Steal

Before we configure the scraper, we need to talk about why your dashboards might lie to you. In a virtualized environment, you are fighting for resources.

Pro Tip: If you see high %st (steal time) in top, your hosting provider is overselling their CPU cores. You can tune your SQL queries all day, but if the hypervisor isn't giving you cycles, you will lag. This is why CoolVDS uses KVM (Kernel-based Virtual Machine) with strict resource guarantees. We don't play the "burst" game with your production apps.

Furthermore, standard SSDs are no longer enough for database-heavy workloads. If you are running MySQL 8.0 or PostgreSQL 12, the bottleneck is almost always disk latency. NVMe storage isn't a luxury in 2020; it's a requirement. On CoolVDS instances, we expose NVMe directly to the KVM guest, dropping I/O latency from milliseconds to microseconds.

Step 2: Configuring Prometheus

Install Prometheus on a dedicated management node (or a separate CoolVDS instance to keep it isolated). Don't monitor your server from the same server—that's like a doctor trying to perform surgery on themselves.

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node_exporter_clients'
    static_configs:
      - targets: ['10.0.0.2:9100', '10.0.0.3:9100'] # Internal IPs for security!

Notice I used internal IPs? If you are hosting in Norway, keep traffic on the private network. It’s faster, unmetered, and doesn't expose your metrics port (9100) to the public internet.

Step 3: Visualizing with Grafana 7

Grafana 7.0 (released earlier this year) brought massive improvements to the UI. Connect it to your Prometheus data source. But don't just stare at CPU graphs. They are vanity metrics.

The "USE" Method

Brendan Gregg’s USE method is your bible. For every resource (CPU, Disk, RAM), check:

  • Utilization: How busy is it? (e.g., 90% CPU load).
  • Saturation: Is work queuing up? (e.g., Load Average > CPU count).
  • Errors: Are hardware errors occurring?

Here is a specific PromQL query to find disks that are getting hammered (Saturation):

rate(node_disk_io_time_seconds_total[1m])

If this value approaches 1.0 (100%), your disk is the bottleneck. On legacy VPS providers using spinning rust or SATA SSDs over crowded networks, you will see this spike during backups or heavy traffic. On CoolVDS NVMe infrastructure, this line stays flat. The physics of NVMe simply allows for higher queue depths.

Application-Level Metrics: Nginx

System metrics are fine, but is Nginx actually serving requests? Enable the stub_status module. In 2020, this is the lightest way to get request data without parsing heavy access logs.

# Inside /etc/nginx/conf.d/status.conf
server {
    listen 127.0.0.1:8080;
    location /stub_status {
        stub_status;
        allow 127.0.0.1;
        deny all;
    }
}

Then use the nginx-prometheus-exporter sidecar to translate this into Prometheus metrics. Now you can alert on 5xx error rates instantly.

Data Sovereignty and Latency

Why go through this trouble instead of installing a New Relic agent? Datatilsynet (The Norwegian Data Protection Authority). With the current legal climate regarding data transfers to the US, hosting your monitoring stack on a VPS in Norway ensures you are compliant with GDPR requirements. Your data never leaves the NIX (Norwegian Internet Exchange) infrastructure.

Plus, the speed of light is constant. If your users are in Oslo and your monitoring server is in Virginia, you are reacting to incidents 100ms slower than reality. By hosting on CoolVDS, you get single-digit millisecond latency to the major Nordic ISPs.

Conclusion

Performance isn't magic; it's engineering. It requires visibility, fast I/O, and isolation from noisy neighbors. By building this stack, you take control of your infrastructure's destiny.

Don't let a slow disk or a stolen CPU cycle be the reason your checkout page hangs. Deploy a CoolVDS instance today—provisioned in under 60 seconds—and see what your application is actually doing.