The Silence Before the 504 Gateway Time-out
There is a specific kind of dread that hits a System Administrator at 03:14 AM. It isn’t the alert itself. It’s the silence that follows when you try to SSH into the server and the cursor just blinks. The load average didn't just spike; the machine locked up so hard that your remote metrics agent couldn't even transmit the final death rattle.
If you are relying on external, SaaS-based monitoring solutions hosted in US-EAST-1 while your users are in Oslo, you are fighting a losing battle against physics and compliance. Latency matters. Data residency matters.
In 2024, the standard for observability isn't just "is it up?" It is about understanding why it is slow. We are going to build a production-grade Application Performance Monitoring (APM) stack using OpenTelemetry, Prometheus, and Grafana. We will host this strictly within Norwegian borders to satisfy Datatilsynet requirements and ensure milliseconds-level granularity.
The Compliance Trap: Schrems II and Your Data
Before we touch the config files, let’s address the elephant in the server room. If your APM tool ingests user IPs, request headers, or database queries, you are processing PII (Personally Identifiable Information). Sending this data to a US-owned cloud provider subjects it to the CLOUD Act, creating a headache under GDPR regulations (specifically the Schrems II ruling).
The pragmatic solution? Self-host your observability pipeline.
By keeping your metrics and logs on a Norwegian VPS, you eliminate the cross-border data transfer risk. However, self-hosting APM is resource-intensive. Prometheus is essentially a time-series database that devours disk I/O during compaction cycles. If you run this on a budget VPS with shared HDD storage or CPU stealing, your monitoring stack will crash exactly when your application load spikes. This is a classic "noisy neighbor" problem.
Architectural Note: At CoolVDS, we specifically configure our KVM instances with direct NVMe pass-through and dedicated CPU time to prevent "monitoring lag." You cannot debug a high-load event if your debugger is suffering from I/O wait.
The Stack: OpenTelemetry (OTel) is the New Standard
Gone are the days of proprietary agents for every language. By mid-2024, OpenTelemetry became the de-facto standard for collecting traces, metrics, and logs.
We will set up:
- OTel Collector: To receive data from your app.
- Prometheus: To store metrics.
- Grafana: To visualize the chaos.
Step 1: Infrastructure Preparation
Start with a clean instance running Ubuntu 24.04 LTS (Noble Numbat). Ensure you have at least 4GB of RAM; Java-based collectors and Prometheus in-memory chunks are hungry.
# Update and install dependencies
sudo apt-get update && sudo apt-get install -y docker.io docker-compose-plugin
# Tuning network buffers for high-ingestion
sudo sysctl -w net.core.rmem_max=26214400
sudo sysctl -w net.core.wmem_max=26214400Step 2: The Collector Configuration
The OpenTelemetry Collector sits between your application and your backend (Prometheus). It allows you to filter, batch, and scrub sensitive data before storage. Create a file named otel-collector-config.yaml:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "coolvds_app"
send_timestamps: true
metric_expiration: 180m
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]Step 3: deploying via Docker Compose
We use Docker for portability, but in a high-throughput environment, you might run the binaries directly on the host to avoid the Docker networking overhead. For this guide, we prioritize ease of deployment.
version: "3.9"
services:
# The Collector
otel-collector:
image: otel/opentelemetry-collector:0.100.0
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8889:8889" # Prometheus Exporter
# The Storage
prometheus:
image: prom/prometheus:v2.51.2
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=15d
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
# The Visualization
grafana:
image: grafana/grafana:10.4.2
environment:
- GF_SECURITY_ADMIN_PASSWORD=CoolVDS_Secure_Pass!
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
volumes:
prometheus_data:
grafana_data:You also need a basic prometheus.yml to scrape the collector:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']Instrumentation: Don't Just guess, Measure
Infrastructure metrics (CPU, RAM) are useful, but they don't tell you business health. You need to instrument your code. If you are running a Python application (common for backend APIs), use the OTel SDK to auto-instrument without changing code.
pip install opentelemetry-distro opentelemetry-exporter-otlp
export OTEL_SERVICE_NAME="checkout-service"
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
opentelemetry-instrument python main.pyThis captures HTTP latency, database query times, and exceptions automatically. When a user reports "the site is slow," you can trace that specific request ID to a slow SQL `JOIN` operation.
The Storage Bottleneck: Why Hardware Matters
Here is where many DevOps engineers fail. They deploy this stack on a cheap VPS with network-attached storage (NAS) or standard SSDs with low IOPS limits.
Prometheus writes data to disk in blocks. As your retention grows (e.g., keeping data for 30 days to spot trends), the "compaction" process triggers. This merges smaller data blocks into larger ones. It is extremely I/O intensive.
| Resource | Shared Hosting / Budget VPS | CoolVDS Architecture |
|---|---|---|
| Disk I/O | Throttled, noisy neighbors cause write delays. | High-Performance NVMe. Direct throughput. |
| CPU Steal | High. Compaction jobs get paused. | Dedicated KVM resources. |
| Network | Often routed via central Europe (latency). | Optimized peering in Oslo (NIX). |
If your disk latency spikes during compaction, Prometheus stops ingesting new metrics. You literally go blind during the maintenance window. We designed CoolVDS instances with high-frequency NVMe specifically to handle the write-heavy patterns of time-series databases.
Final Configuration for Production
To ensure your stack survives a reboot and handles log rotation, create a systemd override for Docker if you aren't using the Compose plugin, but more importantly, configure your firewall.
Security Warning: Do not expose ports 9090 (Prometheus) or 4317 (OTLP) to the public internet unless absolutely necessary. Use a reverse proxy like Nginx or WireGuard VPN.
# Simple UFW setup for Norway-based admin IP only
sudo ufw default deny incoming
sudo ufw allow from 192.168.1.50 to any port 22
sudo ufw allow from 10.0.0.0/8 to any port 9090 # Internal Network
sudo ufw enableConclusion
Observability is not a luxury; it is the insurance policy for your infrastructure. By building a self-hosted stack on robust Norwegian infrastructure, you gain three things: compliance with local data laws, elimination of SaaS vendor lock-in, and the raw performance required to debug real-time issues.
Don't let slow I/O kill your monitoring just when you need it most. Deploy a test instance on CoolVDS today and see what legitimate NVMe performance does for your Prometheus ingestion rates.