I have spent the last 72 hours staring at tcpdump output scrolling across a terminal, trying to figure out why 3% of HTTP requests between my frontend and backend pods were vanishing into the ether. Everyone tells you Kubernetes is the future of infrastructure. They talk about self-healing and auto-scaling. They rarely talk about the absolute brutality of debugging a distributed software-defined network when it decides to act up.
If you are running Kubernetes 1.2 in production today, you know that networking is the single most complex abstraction layer to master. It is not just about opening ports anymore. It is about how packets traverse namespaces, bridges, and virtual ethernet devices, all while being mangled by thousands of iptables rules. Let's rip open the hood and look at how this machine actually moves data, and how to keep it from melting down.
The Fundamental Constraint: The Flat Network
Kubernetes imposes a strict requirement: every Pod must be able to communicate with every other Pod without Network Address Translation (NAT). It sounds simple, but achieving this across multiple physical hosts requires an overlay network or a complex routing setup. In a typical setup using Flannel (which is what most of us are using right now), we rely on VXLAN encapsulation.
Pro Tip: VXLAN encapsulation adds overhead. Your CPU has to wrap every packet in a UDP header before sending it out, and the receiving host has to unwrap it. On cheap, over-sold VPS hosting, this context switching kills your throughput.
The Cost of Encapsulation
When Pod A (10.244.1.2) sends a packet to Pod B (10.244.2.2) on a different node, the OS encapsulates that frame. If you are seeing high sys CPU usage, this is often the culprit. Here is a typical Flannel configuration we use to ensure the backend uses the correct interface:
# /etc/systemd/system/flanneld.service
ExecStart=/usr/bin/flanneld \
-etcd-endpoints=http://10.10.0.1:2379 \
-iface=eth0 \
-ip-masq
If your -iface is flapping or your underlying provider has high jitter on the physical network, your overlay network collapses. This is why we deploy our K8s clusters on CoolVDS NVMe instances. The KVM virtualization provides consistent CPU scheduling, which is critical for processing these encapsulated packets without inducing latency spikes.
Service Discovery: The `iptables` Labyrinth
With the release of Kubernetes 1.2 in March, the community made a massive shift. The kube-proxy component now defaults to iptables mode instead of the old userspace proxy. This is a massive performance win, but it makes debugging a nightmare.
In userspace mode, kube-proxy actually proxied the traffic. In iptables mode, it just writes rules, and the Linux kernel handles the routing. This is faster, O(1) mostly, but have you looked at your ruleset lately?
Run this on your node:
sudo iptables-save | grep KUBE-SVC
You will see a chain for every Service. When a packet hits a Service IP (ClusterIP), iptables uses the statistic module to randomly load balance traffic to backend Pods. It looks something like this:
-A KUBE-SVC-SOMESERVICE -m comment --comment "default/my-service:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-POD1
-A KUBE-SVC-SOMESERVICE -m comment --comment "default/my-service:" -j KUBE-SEP-POD2
If you have 1,000 services, this list gets long. The kernel is fast, but it's not magic. We recently debugged a cluster where a developer created 500 Services for a dev environment, and network latency visibly increased. We had to tune the net.netfilter.nf_conntrack_max sysctl to prevent dropped packets.
Configuration for High Traffic
To keep the kernel from dropping connections under load, especially with the state tracking required by these rules, you must tune your sysctl settings. Do not leave these at defaults.
# /etc/sysctl.conf
net.netfilter.nf_conntrack_max = 131072
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 1024 65000
# Essential for avoiding "Neighbor table overflow" in large clusters
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh3 = 4096
Ingress: Letting the World In
We are currently using the NGINX Ingress Controller (still in Beta, but stable enough). It acts as a reverse proxy sitting at the edge of the cluster. The trickiest part here is handling the "hairpin" traffic if you are running on bare metal or a provider that doesn't offer a native LoadBalancer integration.
Since CoolVDS gives us direct root access and raw networking, we usually set up an external HAProxy or just use NodePort combined with DNS round-robin for smaller setups. However, for serious production, we bind the Ingress controller to the host network namespace to bypass the overlay entirely for incoming traffic:
apiVersion: v1
kind: Pod
metadata:
name: nginx-ingress
namespace: kube-system
spec:
hostNetwork: true
containers:
- image: gcr.io/google_containers/nginx-ingress-controller:0.8.3
name: nginx-ingress
ports:
- containerPort: 80
hostPort: 80
- containerPort: 443
hostPort: 443
Using hostNetwork: true improves performance significantly by skipping the bridge and iptables DNAT for ingress traffic, but it requires port management discipline.
Local Latency and Data Sovereignty
We operate primarily out of Norway. The physical distance between your nodes matters. If you span a cluster across data centers with high latency, etcd will start timing out (heartbeat interval defaults are tight).
For Norwegian businesses, there is also the compliance angle. With the increasing scrutiny from Datatilsynet, keeping data traffic within national borders is becoming a hard requirement. CoolVDS data centers in Oslo peek directly at NIX (Norwegian Internet Exchange). This means when your Norwegian users hit your Kubernetes Ingress, the latency is practically non-existent—often sub-5ms.
Why Underlying Hardware Matters
You can optimize your iptables and tune your sysctls all day, but if the hypervisor underneath you is stealing CPU cycles (Steal Time), your network throughput will suffer. Kubernetes networking is CPU-intensive due to the packet encapsulation and rule processing.
Many VPS providers oversell their cores. You might think you have 4 vCPUs, but you are fighting for time slots with 20 other noisy neighbors. This manifests as network jitter. CoolVDS guarantees KVM resources. When we run benchmarks using iperf3 between two CoolVDS instances, the variance is minimal. That stability is what keeps the kubelet reporting "Ready" instead of flapping to "NotReady" under load.
Final Thoughts
Kubernetes 1.2 has brought us massive stability improvements, but the network complexity is the price we pay. Monitor your conntrack tables, watch your CPU steal time, and don't assume the network is reliable.
If you are tired of debugging network flakes caused by noisy neighbors, move your cluster to a platform that respects raw performance. Spin up a CoolVDS NVMe instance today and see what stable latency looks like.