on a high note. The Kubernetes cluster was alive. But as any engineer knows, a working system is often just the prelude to the next, more interesting problem. While the cluster was technically functional, its architecture had a hidden Achilles' heel: a single point of failure for all incoming traffic.
My mission was clear: eliminate it. The tool for the job was MetalLB, and the task seemed simple. I was wrong. What followed was a multi-day investigation down a deep and winding networking rabbit hole. This is the case file for that investigation—a detective story that starts with misleading TLS errors, leads to phantom network blocks, and ends with the unmasking of a culprit buried in the very foundation of my virtualization tools. Join me as we solve the case of the unreachable cluster.
192.168.1.x) and removed the nodeSelector to allow any node to become the leader for the VIP.LoadBalancer to ClusterIP and back again. This successfully assigned a new, correct VIP to Traefik.SSL_ERROR_SYSCALL, while my Traefik logs showed a cryptic local error: tls: bad record MAC. bad record MAC error strongly suggested a "man-in-the-middle" device corrupting TLS packets. We suspected an overly aggressive firewall on my router or VMs host.record MAC error still occurred when connecting from another computer on my local network, proving the traffic corruption was happening closer to the cluster. Then, pinging the VIP revealed an even deeper issue: Destination Host Unreachable. We couldn't even reach the IP at a basic network level.CiliumNetworkPolicy to allow the traffic.arp -a output showing a successful ARP entry for the VIP and logs showing MetalLB was announcing the service. ARP wasn't the problem after all. The node was reachable at Layer 2.tcpdump on the leader node. The packet capture revealed the definitive truth:eth1 (my main LAN).eth0.ip route table of my VM, which confirmed the diagnosis. My node's default route was incorrectly pointing to the eth0 interface, a private management network created by Vagrant.default via 192.168.121.1 dev eth0sudo ip route del default && sudo ip route add default via 192.168.1.254 dev eth1Vagrantfile, using a shell provisioner set to run: "always" to enforce the correct default route on every single boot.ClusterIssuer.cert-manager was now able to contact Let's Encrypt, issue a valid certificate, and Traefik began serving it correctly.cert-manager). Confirm what the application thinks is happening. Verify Kubernetes service-specific config (like externalTrafficPolicy or whether a LoadBalancerIP is sticky).ping for reachability and traceroute for path. If ping fails with "Destination Host Unreachable," always check Layer 2.arp -a is Critical for L2 Diagnostics: When basic ping fails, arp -a confirms if the IP is even resolving to a MAC address. A successful ARP reply shifts the focus immediately to routing or higher-layer issues.ip route Maps Your World: If ARP works but traffic still fails, inspect the host's routing table. Misconfigured default gateways, competing routes, or incorrect interface assignments (eth0 vs. eth1) are common culprits for asymmetric routing.tcpdump is the Ultimate Truth Serum (L2-L7): When logs mislead and pings don't tell the whole story, tcpdump reveals precisely what packets are arriving, leaving, and being dropped. It was the definitive tool that exposed the asymmetric routing.eth0) creating a competing default route was the hidden antagonist in our debugging journey.LoadBalancer IP will change automatically. Don't assume your router isn't interfering. Don't assume a ping failure means ARP is broken. Test and get concrete data for every assumption.You’re in the final interview for a Senior Platform Engineer role. You’re at the whiteboard. The CTO leans forward and asks, "How would you design our observability strategy from scratch?"
You know the tools. You've done this before. Confidently, you dive straight in : "I'd use Prometheus for metrics, Loki for logs because it's cost-effective, OpenTelemetry for tracing...".
The CTO cuts you off. "I don't need a list of tools. I asked for a strategy."
At this moment, you realise that something is missin…