Console Login

Database Sharding Strategies: Surviving the 10TB Threshold without Losing Sanity

Database Sharding Strategies: Surviving the 10TB Threshold without Losing Sanity

Let's be brutally honest: nobody wants to shard their database. It is the architectural equivalent of a root canal. You delay it, you optimize your queries, you buy bigger hardware, and you pray that vertical scaling buys you another six months. But eventually, physics wins. I remember a specific incident in 2023 with a fintech client based in Oslo; their primary write master was sitting on a jagged edge of 95% CPU utilization during the tax return season, processing transactions that simply couldn't wait. We threw hardware at it—faster NVMe, more cores—but the lock contention on the transactions table was the bottleneck, not the raw IOPS. That is the moment you realize: your monolithic architecture is dead. If you are reading this, you are likely staring at a similar precipice. You don't need buzzwords about scalability; you need a strategy to slice your data without corrupting it.

The "Vertical Wall" and Why Hardware Still Matters

Before we rip your database apart, we have to acknowledge the hardware reality. In the Norwegian hosting market, where latency to NIX (Norwegian Internet Exchange) is measured in sub-milliseconds, the quality of your underlying infrastructure often dictates when you need to shard. Many developers jump to sharding too early because their hosting provider creates artificial bottlenecks via "noisy neighbor" syndrome. If you are running on shared containers where CPU stealing is rampant, your database isn't too big—your server is just choking. This is why we use KVM virtualization at CoolVDS. By isolating the kernel and guaranteeing CPU cycles, we often see clients push a single PostgreSQL instance to 4TB or 5TB of active data on our NVMe storage before sharding becomes strictly necessary. However, once you cross the threshold where a single write-master cannot handle the ingest rate, or your backup recovery time objective (RTO) exceeds legal SLAs set by Datatilsynet, sharding is mandatory.

Strategy 1: Application-Level Sharding (The "Do It Yourself" Approach)

This is the oldest trick in the book, and while it introduces complexity into your codebase, it offers the ultimate control. You essentially route queries to specific database servers based on a sharding key—usually a user_id or tenant_id. In a GDPR-heavy environment like Norway, this is actually a massive advantage. You can physically isolate Norwegian user data on servers located in Oslo (like CoolVDS instances) while shunting non-EU traffic to other jurisdictions if necessary, satisfying strict data residency requirements purely via routing logic. The downside? You lose cross-shard ACID transactions. Your application logic must handle the routing.

Pro Tip: Never use a sequential integer as a shard key if you want uniform distribution. You will create "hot shards" where all new users land on the same server, melting it down. Always hash the key.

Here is a Python example of a deterministic shard router using a CRC32 hash, which is fast and distributes reasonably well for this purpose:

def get_db_shard(user_id, total_shards):
    # Deterministic routing based on user_id
    shard_index = zlib.crc32(str(user_id).encode('utf-8')) % total_shards
    
    # Map index to actual connection strings
    shard_map = {
        0: "postgres://db_node_01.coolvds.internal:5432/app",
        1: "postgres://db_node_02.coolvds.internal:5432/app",
        2: "postgres://db_node_03.coolvds.internal:5432/app",
        3: "postgres://db_node_04.coolvds.internal:5432/app"
    }
    
    return shard_map.get(shard_index)

Strategy 2: Middleware Sharding with ProxySQL (MySQL/MariaDB)

If you are in the MySQL ecosystem, touching the application code might be too risky, especially with legacy monoliths. Enter ProxySQL. It sits between your app and the database layer, analyzing SQL traffic and routing it transparently. This allows you to split reads and writes, or implement sharding without the application knowing the topology changed. For a high-traffic Magento or WooCommerce site hosting in Norway, this is often the standard path.

A basic proxysql.cnf setup for query routing looks like this. Note how we define hostgroups for sharding:

mysql_servers = (
    {
        address = "10.0.0.101" # Shard 1
        port = 3306
        hostgroup = 10
        max_connections = 200
    },
    {
        address = "10.0.0.102" # Shard 2
        port = 3306
        hostgroup = 20
        max_connections = 200
    }
)

mysql_query_rules = (
    {
        rule_id = 1
        active = 1
        match_pattern = "^SELECT .* FROM users WHERE id % 2 = 0"
        destination_hostgroup = 10
        apply = 1
    },
    {
        rule_id = 2
        active = 1
        match_pattern = "^SELECT .* FROM users WHERE id % 2 = 1"
        destination_hostgroup = 20
        apply = 1
    }
)

The regex above is simplistic (modulo sharding), but it illustrates the power. The application just sends SQL; ProxySQL handles the dirty work. However, notice the network overhead. Every query makes an extra hop. On a high-latency network, this kills performance. This is why deploying your ProxySQL layer and your Database nodes on the same low-latency CoolVDS private network is critical. We offer private VRACKs (Virtual Racks) that keep this internal traffic off the public internet, reducing latency to near-zero.

Strategy 3: Native Sharding with PostgreSQL & Citus

For the PostgreSQL purists—and in 2025, that should be most of you—Citus is the extension that turns Postgres into a distributed database. It is far superior to manual sharding because it pushes the query logic down to the worker nodes. You execute a query on the coordinator, and it parallelizes the execution across the cluster. For analytics or multi-tenant SaaS apps, this is the gold standard.

Configuring a Citus Worker Node

When setting up a worker node on a CoolVDS instance, you need to tune the postgresql.conf specifically for high throughput. Standard configs are too conservative.

# /etc/postgresql/17/main/postgresql.conf

# Memory - Assuming 64GB RAM Instance
shared_buffers = 16GB
effective_cache_size = 48GB
work_mem = 64MB
maintenance_work_mem = 2GB

# Citus specific
citus.max_worker_nodes_tracked = 2048
citus.shard_replication_factor = 2 # High Availability

# Checkpoints (Crucial for heavy writes)
min_wal_size = 2GB
max_wal_size = 8GB
checkpoint_completion_target = 0.9

Once the extension is loaded, distributing a table is a single command. This transforms a standard table into a sharded one across your fleet:

-- Connect to the coordinator node
SELECT citus_add_node('10.0.0.101', 5432);
SELECT citus_add_node('10.0.0.102', 5432);

-- Convert the table
SELECT create_distributed_table('events', 'device_id');

The Infrastructure Reality Check

Sharding software is only as reliable as the underlying hardware. I have seen perfect Citus clusters fail because one node was on a host with a "noisy neighbor" mining crypto, causing IOPS starvation. When one shard lags, the whole query lags. Distributed systems are governed by the slowest node.

This is where the choice of VPS provider becomes an architectural decision, not just a billing one. At CoolVDS, we don't oversubscribe our NVMe storage arrays. When you request a slice, you get the dedicated throughput required to maintain consistent latency across all shards. For a sharded setup, network stability is paramount. A flap in the network triggers re-elections and failovers that can stall your application for seconds. Our data centers in Oslo are engineered with redundant paths specifically to prevent split-brain scenarios in clustered databases.

Monitoring the Monster

Once you shard, you can no longer just `tail -f` a log file. You need aggregated metrics. Tools like PMM (Percona Monitoring and Management) or a tightly configured Prometheus/Grafana stack are mandatory. You need to watch for Data Skew. This happens when one shard gets significantly more data than others (e.g., the "Justin Bieber" problem where one user has millions of followers). Monitoring disk usage per shard is your early warning system.

Bash script to check disk usage across remote shards:

#!/bin/bash
# Quick health check for shard disk usage
SHARDS=("10.0.0.101" "10.0.0.102" "10.0.0.103")

for IP in "${SHARDS[@]}"; do
    echo "Checking Shard: $IP"
    ssh user@$IP "df -h | grep '/var/lib/postgresql'"
done

Conclusion: Don't Shard Until You Must

Sharding solves the capacity problem but creates a complexity problem. Before you go down this road, ensure you have maximized your vertical scaling options. Have you tuned your indices? Have you optimized your innodb_buffer_pool_size? Have you migrated to high-performance NVMe infrastructure?

If you have done all that and your I/O wait is still killing you, then welcome to the world of sharding. It's a rugged landscape, but with the right architecture and solid underlying metal, it's manageable. Just make sure your foundation is solid.

Is your database choking on success? Deploy a high-performance, KVM-backed CoolVDS instance in Oslo today and see how far vertical scaling can actually take you.