Pod-to-pod latency spikes in a high-traffic Nest.js Kubernetes setup usually come from DNS delays, connection churn, CNI/network overhead, or inefficient server configuration. Improving server performance, upgrading networking, and tuning infrastructure helps stabilize latency at scale.
To fix latency spikes at 1B+ daily requests, optimize both Nest.js performance and Kubernetes networking:
1) Nest.js Improvements
- Switch to Fastify for faster request handling.
- Enable HTTP keep-alive to avoid repeated TCP/TLS handshakes.
- Run Nest.js in cluster mode to fully use CPU cores.
- Use gRPC/HTTP2 with connection pooling for internal communication.
- Add circuit breakers + retries to prevent cascading slowdowns.
2) Kubernetes Networking Fixes
- Enable NodeLocal DNS Cache to remove DNS lookup spikes.
- Use an eBPF CNI like Cilium for lower jitter.
- Switch kube-proxy to IPVS or use Cilium’s proxy-free routing.
- Keep traffic within the same AZ/Node to reduce cross-zone latency.
3) Infrastructure Tuning
- Increase conntrack limits and socket buffers.
- Scale using HPA based on P95 latency, not just CPU.
- Monitor DNS latency, handshake time, and connection reuse in APM.
.png)

.png)
