Console Login

The CTO’s Guide to Disaster Recovery: Surviving the Unthinkable in a GDPR World

When the Lights Go Out in Oslo: A Pragmatic Approach to Disaster Recovery

There is a terrifying silence that fills a Slack channel when a senior engineer types: "The primary database is gone, and the replica is empty." I have been in that room. It wasn't a fire or a flood; it was a logic error in a deployment script that cascaded through a synchronous replication setup. The company lost four days of transactional data because their "Disaster Recovery" plan was just a nightly cron job that hadn't run successfully in two weeks.

In the Nordic market, where the Datatilsynet (Norwegian Data Protection Authority) wields significant power and GDPR fines can decimate margins, treating Disaster Recovery (DR) as an afterthought is professional negligence. It is not enough to have data; you must be able to restore it before your Recovery Time Objective (RTO) bleeds your budget dry.

This is not a theoretical essay. This is a blueprint for building resilience using tools available right now, in late 2024, focusing on sovereignty, speed, and strict consistency.

The Legal Reality: Schrems II and Data Sovereignty

Before we touch a single config file, we must address the legal architecture. Since the Schrems II ruling, relying on US-owned hyperscalers for disaster recovery has become a compliance minefield. If your primary data resides in Norway, your failover cannot simply be an S3 bucket in Virginia.

For Norwegian enterprises, the safest path is sovereign infrastructure. You need a provider where the legal entity, the hardware, and the data centers are bound by EEA law. This is where the choice of infrastructure partner becomes strategic. Using a provider like CoolVDS, which operates strictly within European jurisdiction with local NVMe storage, eliminates the legal headache of cross-border data transfers during a crisis.

The Architecture of Resilience: 3-2-1-1-0

The old 3-2-1 backup rule (3 copies, 2 media types, 1 offsite) is obsolete in the age of sophisticated ransomware. In 2024, we adhere to 3-2-1-1-0:

  • 3 copies of data.
  • 2 different media types (e.g., Block Storage vs. Object Storage).
  • 1 offsite location (e.g., a secondary CoolVDS datacenter location).
  • 1 copy that is Immutable or Air-gapped.
  • 0 errors after automated recovery verification.

Implementing Immutability with Restic and MinIO

Ransomware targets backups first. To counter this, we use restic with a backend that supports Object Lock (WORM - Write Once Read Many). If you are running a self-hosted Object Storage on a separate CoolVDS instance using MinIO, you can enforce this at the bucket level.

Here is a production-ready script pattern for backing up critical configuration files with encryption:

#!/bin/bash
# /usr/local/bin/secure_backup.sh

export RESTIC_REPOSITORY="s3:https://backup-gw.coolvds-internal.net/bucket-name"
export RESTIC_PASSWORD_FILE="/root/.restic_pwd"
export AWS_ACCESS_KEY_ID="XXXX"
export AWS_SECRET_ACCESS_KEY="YYYY"

# Initialize if not exists
restic snapshots > /dev/null 2>&1 || restic init

# Backup with tagging for retention policies
restic backup /etc /var/www/html \
  --tag scheduled \
  --exclude-file=/etc/restic/excludes.txt \
  --iexclude '*.log' \
  --iexclude '*.tmp'

# Prune old snapshots but keep the last 4 weekly and 12 monthly
restic forget \
  --keep-daily 7 \
  --keep-weekly 4 \
  --keep-monthly 12 \
  --prune

This setup is lightweight and encrypts data before it leaves your server. Because CoolVDS offers unmetered internal bandwidth between instances in the same region, this traffic doesn't impact your public transfer quotas.

Database Consistency: The WAL Method

Filesystem snapshots are risky for running databases. A snapshot taken while MySQL or PostgreSQL is flushing to disk can result in corruption. For PostgreSQL, the gold standard in 2024 is still Continuous Archiving via Write Ahead Logs (WAL).

Instead of a massive nightly dump, we stream changes in near real-time to a recovery server. This lowers your Recovery Point Objective (RPO) from 24 hours to mere seconds.

Primary Server Config (postgresql.conf):

wal_level = replica
archive_mode = on
archive_command = 'test ! -f /mnt/nfs_backup/archivedir/%f && cp %p /mnt/nfs_backup/archivedir/%f'
archive_timeout = 60  # Force a switch every 60 seconds at minimum
Pro Tip: Do not mount the backup NFS share directly on the database server if you can avoid it. Instead, use rsync or an S3 piping tool like pgbackrest to push WAL files. If the DB server is compromised, you don't want the attacker to have direct write access to your backup volume.

The "Restore" Bottleneck: IOPS Matter

Here is the hard truth nobody tells you about restoration: It is I/O intensive.

When you are restoring a 500GB database from a dump, your disk is writing as fast as the physics allow. On budget VPS providers using shared HDD or throttled SSDs (often capping at 300-500 IOPS), a restore can take 14+ hours. During those 14 hours, your business is offline.

This is where the underlying hardware of your hosting provider becomes a critical feature of your DR plan. We utilize CoolVDS specifically because they expose NVMe storage directly to the KVM instance. We aren't fighting for IOPS with 500 other noisy neighbors.

Storage Type Seq. Write Speed Est. 500GB Restore Time
Standard SATA SSD (Shared) ~150 MB/s ~55 Minutes (Best Case)
CoolVDS NVMe ~2,500 MB/s < 4 Minutes

That difference—4 minutes versus an hour—is often the difference between a minor hiccup and a breached SLA contract.

Failover Routing with High Availability

Restoring data is step one. Redirecting traffic is step two. If your primary IP is down, you need a mechanism to swing traffic to the DR site. In a Norwegian context, minimizing latency to NIX (Norwegian Internet Exchange) is vital for user experience.

We recommend a Floating IP setup managed via Keepalived. This allows two CoolVDS instances to share a single public IP address. If the master fails, the backup assumes the IP instantly.

Sample keepalived.conf for the MASTER:

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 101
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass SecretPassword123
    }
    virtual_ipaddress {
        192.168.1.10/24
    }
    track_script {
        check_nginx
    }
}

This configuration ensures that if Nginx dies on the primary node, the IP floats to the secondary node immediately. Because CoolVDS supports private networking (VLANs), this heartbeat traffic remains isolated and secure.

The Final Test: Chaos Engineering

A disaster recovery plan that hasn't been tested is just a wish. You must simulate failure. Shut down the primary interface. Corrupt a test table. See if your monitoring system (Zabbix, Prometheus) screams at you.

Your infrastructure must be predictable. When you hit "Enter" on that restore script, you need to know exactly how the underlying hypervisor behaves. This predictability is why we lean on KVM virtualization rather than container-based VPS solutions for our core database nodes. The isolation guarantees that our resources are ours, ensuring that recovery times are consistent every single time.

Next Steps

Review your current backup logs. If you see warnings or if you haven't restored a file in the last 30 days, you are vulnerable. Start small: spin up a secondary instance on CoolVDS, configure an automated Restic sync, and measure your restore time. Resilience is not expensive; downtime is.