We recently ran into a tricky issue during a Kubernetes deployment. We had an AWS Network Load Balancer (NLB) set up to handle traffic for an LDAP service, but we kept seeing intermittent traffic drops. Direct connections to the LDAP server worked fine, but some queries sent through the NLB would time out, which was causing reliability problems.
Digging Deeper
We started by using thenc -zv ldap.acme.local 636
command to check if we could connect to the LDAP server. Most of the time it worked, but sometimes it would time out.
After some digging, we found that the problem always happened in one specific subnet within an Availability Zone (AZ) that had the IP range 10.20.30.XX. We decided to bypass the NLB and connect directly to the node port where the LDAP traffic was going. We saw successful connections on most nodes, except for one. This node was part of a different node group with a special label ( test: true noschedule
), which meant it wasn’t supposed to handle normal traffic.
Unraveling the Root Cause
Looking closer, we realized the NLB’s health checks were looking at the wrong port! This meant the node that wasn’t supposed to handle traffic was being marked as healthy by the NLB, even though it couldn’t actually deal with LDAP requests. So, the NLB was sending traffic to this node, and that’s why we were getting timeouts.
Basically, we had two problems:
- Traffic was being sent to a node that shouldn’t have been getting it.
- The health checks were using the wrong port
Solution Implementation
Here’s how we fixed it:
- Fixed the Health Checks: We changed the health check settings on the NLB so they were checking the right port – the one where the LDAP service was actually running. This made the problematic node show up as unhealthy, so the NLB stopped sending traffic there.
- Labeled the Target Nodes: We added a special label to our Kubernetes service configuration:
service.beta.kubernetes.io/aws-load-balancer-target-node-labels: eks.amazonaws.com/nodegroup=eks-cluster-default-node
This made sure that only nodes in the right node group (eks-cluster-default-node
) were included in the NLB’s target group, so the node with the special label was left out.
What We Learned
We learned that all nodes are part of the node port setup, even if they have special labels. These labels only control where pods are placed, not which nodes can get traffic.
The externalTrafficPolicy
setting on the Kubernetes service was important too. It was set to Cluster
at first, which meant all nodes could get traffic. We changed it to Local
, so only nodes actually running the LDAP pods would look healthy. This made sure traffic only went to the right places
Key Takeaways
When you’re working with Kubernetes and AWS Network Load Balancers, a few things can trip you up:
- If you’re getting timeouts when going through the load balancer but direct connections work, it might be a routing problem.
- If the problem only happens in certain subnets or Availability Zones, it could mean something’s not set up right with the nodes or the load balancer.
- Make sure your health checks are looking at the right ports!
- Remember that special labels on nodes only affect where pods go, not which nodes can get traffic
- The
externalTrafficPolicy
setting is important for controlling how traffic is spread out.
By fixing the health checks, adding the right labels, and understanding how externalTrafficPolicy
works, we got rid of our traffic routing problems and made our deployment stable and reliable.
About
With more than 20 years of experience in the technology sector, Mario has dedicated the last decade to offering consultancy, support, and strategic guidance to clients across various cloud platforms.
Disclaimer: The information shared on this blog is for informational purposes only and should not be considered as professional advice. The opinions presented here are personal and independent of any affiliations with tech companies or organizations.
Mario Tolic