Console Login

Kubernetes Networking Deep Dive: Solving Latency & CNI Chaos in 2024

Kubernetes Networking Deep Dive: Solving Latency & CNI Chaos

Let’s be honest: Kubernetes networking is usually where the dream of "cloud-native" dies a slow, packet-dropping death. You deploy your microservices, everything looks green in ArgoCD, but the frontend is throwing 504s because a pod on Node A can't talk to Node B fast enough.

I’ve spent the last six months debugging a high-traffic fintech platform hosted in Oslo. The application was solid, but the network overlay was eating 30% of our CPU cycles. Most VPS providers won't tell you this, but if your underlying virtualization layer steals cycles or throttles I/O, your fancy service mesh doesn't stand a chance.

This is a technical deep dive into optimizing Kubernetes networking for performance and stability, specifically within the Nordic infrastructure context of late 2024.

1. The CNI Wars: IPTables is Dead (Long Live eBPF)

If you are still running Flannel or basic Calico using IPTables mode in production in October 2024, you are voluntarily bottlenecking your cluster. IPTables wasn't designed for the churn of dynamic container IPs. It’s an O(N) list; when you have 5,000 services, every packet lookup becomes a traverse through a massive ruleset.

We switched to Cilium with eBPF (Extended Berkeley Packet Filter). eBPF allows the kernel to process packets without the overhead of traversing the standard Linux networking stack for every single request.

Implementing Cilium for High Throughput

To get this working correctly, you need a kernel newer than 5.10 (Ubuntu 24.04 on CoolVDS ships with 6.8, which is perfect). Here is the helm configuration we use to bypass IPTables entirely using kube-proxy replacement:

helm install cilium cilium/cilium --version 1.16.1 \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set bpf.masquerade=true \
  --set k8sServiceHost=${API_SERVER_IP} \
  --set k8sServicePort=${API_SERVER_PORT} \
  --set loadBalancer.mode=dsr  # Direct Server Return for lower latency
Pro Tip: Enabling loadBalancer.mode=dsr (Direct Server Return) is a massive win for latency. The backend pod replies directly to the client, bypassing the load balancer node on the return trip. However, this requires your underlying network to allow asymmetric routing. We enable this by default on CoolVDS private networks to support advanced clustering topologies.

2. The "ndots:5" DNS Trap

This is the single most common misconfiguration I see in Norwegian dev teams migrating to K8s. By default, Kubernetes sets ndots:5 in /etc/resolv.conf. This means if your code tries to resolve google.com, it first tries:

  1. google.com.namespace.svc.cluster.local
  2. google.com.svc.cluster.local
  3. google.com.cluster.local
  4. ...and so on.

This triggers 5 failed DNS lookups for every single external call. Multiply that by 1,000 RPS, and CoreDNS implodes. If you are hosting a latency-sensitive app (like real-time bidding or high-frequency trading) near NIX (Norwegian Internet Exchange), this internal delay ruins your proximity advantage.

The Fix

You can force the ndots configuration in your Deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-processor
spec:
  template:
    spec:
      dnsConfig:
        options:
          - name: ndots
            value: "2"
      containers:
      - name: app
        image: registry.coolvds.com/payment:v2.4

Setting this to "2" dramatically reduces internal DNS pressure.

3. Tuning the Node Kernel for 10Gbps+

Kubernetes nodes are just Linux servers. If the host limits are low, the containers hit a wall. When running on CoolVDS NVMe instances, we have massive I/O throughput available, but the default Linux network stack is tuned for 2010 hardware, not 2024 speeds.

We apply a specific sysctl profile to all our worker nodes via a DaemonSet (or cloud-init):

# /etc/sysctl.d/99-k8s-network.conf

# Increase the read/write buffer sizes for high-speed networks
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

# Increase the number of incoming connections
net.core.somaxconn = 8192

# Allow more connections to be tracked (crucial for NAT/Masquerade)
net.netfilter.nf_conntrack_max = 1048576

# Reuse TIME_WAIT sockets (careful with this, but needed for high ingress)
net.ipv4.tcp_tw_reuse = 1

If you hit the nf_conntrack_max limit, your node starts dropping packets silently. It’s a nightmare to debug. Always monitor node_netstat_Tcp_ActiveOpens in Prometheus.

4. Gateway API: Replacing Ingress

As of late 2024, the standard Ingress resource is functionally legacy. The Gateway API (v1.1) provides a much more robust way to handle traffic splitting and header manipulation without messy Nginx annotations.

Here is how we split traffic for a canary deployment—a requirement for many compliance-heavy Norwegian clients who need to test changes on a small user subset before a full rollout:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: api-routing
  namespace: backend
spec:
  parentRefs:
  - name: external-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v2/orders
    backendRefs:
    - name: orders-v1
      port: 8080
      weight: 90
    - name: orders-v2-canary
      port: 8080
      weight: 10

The Hardware Foundation Matters

You can tune sysctl all day, but if your VPS is running on noisy hardware with