Skip to main content

Fixing Tanzu Kubernetes Pod to External Services Connectivity Issues with NSX-T

Fixing Tanzu Kubernetes Pod to External Services Connectivity Issues with NSX-T

Last month I got a call from a customer who was pulling their hair out over a networking issue. They had just deployed VMware Tanzu Kubernetes Grid on their vSphere with Tanzu environment, everything looked good in the dashboards, all pods were running, but their applications inside the pods could not reach external databases running on traditional VMs in the same datacenter.

The frustrating part was that some pods could reach external services perfectly fine, while others would just timeout. There was no clear pattern. Let me tell you how we figured this out and fixed it.

The Initial Problem

Here is what the customer setup looked like:

  • vSphere 8.0 with Tanzu enabled
  • NSX-T 4.1.2 for networking
  • Three Tanzu Kubernetes clusters running different microservices applications
  • External PostgreSQL database running on traditional VMs (non-Kubernetes)
  • External API services running on another set of VMs

The symptom was simple but annoying. When pods tried to connect to the PostgreSQL database at IP 192.168.50.25, sometimes it worked, sometimes it did not. The application logs showed connection timeouts:

Error: could not connect to server: Connection timed out
Is the server running on host "192.168.50.25" and accepting TCP/IP connections on port 5432?

The weird part was that if you did a kubectl exec into the pod and ran ping 192.168.50.25, it worked fine. But the actual database connection on port 5432 would fail.

Initial Troubleshooting Steps

First thing I did was check if this was a DNS issue. I asked them to try connecting using IP address directly instead of hostname. Same problem. So DNS was not the culprit.

Next, I checked if the pods could reach other external services. I had them create a test pod and try different connections:

kubectl run test-pod --image=nicolaka/netshoot -it --rm -- /bin/bash

# Inside the pod, test different connections
ping 192.168.50.25
# This worked fine

curl -v telnet://192.168.50.25:5432
# This would timeout

curl -v telnet://192.168.50.30:8080
# This worked (different VM, different service)

So ping worked, but TCP connections to specific ports were failing. That told me this was likely a firewall issue, not routing.

Checking NSX-T Distributed Firewall

Since they were using NSX-T, my next thought was to check the Distributed Firewall rules. I logged into NSX Manager and went to Security > Distributed Firewall.

What I found was interesting. They had a rule that allowed traffic from "Tanzu-Workload-Network" to "Database-Servers" security group. On paper, this should have worked. But when I looked closer at the security groups, I noticed something odd.

The "Tanzu-Workload-Network" security group was defined based on a specific NSX segment. But here is the thing about Tanzu Kubernetes pods. They do not sit directly on NSX segments. They use overlay networking within Kubernetes, and NSX sees them through SNAT (Source NAT) translation.

Understanding the Root Cause

Let me explain what was actually happening. When a pod in Tanzu Kubernetes tries to reach an external service:

  1. The pod sends traffic to its default gateway (the Kubernetes service network)
  2. Traffic goes through the Tanzu Kubernetes cluster's load balancer
  3. NSX-T performs SNAT to translate the pod IP to the Tier-0 gateway IP
  4. The traffic then goes to the destination VM

The problem was in step 3. The NSX-T firewall rules were checking the SOURCE IP of the traffic. After SNAT, the source IP was no longer from the "Tanzu-Workload-Network" segment. It was coming from the Tier-0 gateway IP pool.

This is why some connections worked and some did not. It depended on which Tier-0 gateway IP got assigned during SNAT, and whether that IP was accidentally covered by other broader firewall rules.

Solution Part 1: Fix the NSX-T Firewall Rules

Once we understood the problem, the fix became clear. We needed to modify the firewall rules to account for the SNAT translation.

Step 1: Identify the Tier-0 Gateway IP Pool

First, we needed to find out which IP range NSX-T was using for SNAT when Tanzu traffic goes out.

  1. Log into NSX Manager
  2. Go to Networking > Tier-0 Gateways
  3. Click on your Tier-0 gateway (in their case it was called "T0-Gateway-01")
  4. Go to Service Interfaces section
  5. Note down the IP addresses configured there

In their environment, the Tier-0 gateway was using 192.168.10.1 for the external interface.

Step 2: Create a New Security Group for Tanzu Traffic

Instead of using the segment-based security group, we created a new one specifically for Tanzu traffic after SNAT:

  1. Go to Inventory > Groups
  2. Click "Add Group"
  3. Name: "Tanzu-K8s-External-Traffic"
  4. Under Membership Criteria, select "IP Address"
  5. Add the IP address: 192.168.10.1
  6. Save

Step 3: Update the Distributed Firewall Rules

Now we updated the firewall rule:

  1. Go to Security > Distributed Firewall
  2. Find the rule that allows access to Database Servers
  3. Edit the rule
  4. In the "Source" field, add the new "Tanzu-K8s-External-Traffic" group we just created
  5. Keep the original "Tanzu-Workload-Network" group as well (for direct VM-to-VM traffic if any)
  6. Make sure the rule is set to "Allow"
  7. Publish the changes

After publishing these changes, we tested again from the pod:

kubectl run test-pod --image=nicolaka/netshoot -it --rm -- /bin/bash

curl -v telnet://192.168.50.25:5432
# Now it worked!

Success! But we were not done yet.

Solution Part 2: Fix Tanzu Network Policies

While testing, we found another issue. Some namespaces in the Tanzu cluster had NetworkPolicy objects that were blocking egress traffic by default. This is actually a good security practice, but it was not configured properly.

We checked the existing network policies:

kubectl get networkpolicies --all-namespaces

In the "production" namespace, they had a very restrictive policy:

kubectl get networkpolicy -n production default-deny-egress -o yaml

The output showed:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-egress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress: []

This policy was blocking ALL egress traffic from pods in the production namespace. We needed to add specific rules to allow traffic to the database and external APIs.

Create a New NetworkPolicy to Allow Database Access

We created a new policy file called allow-database-access.yaml:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-database-access
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend-api
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 192.168.50.0/24
    ports:
    - protocol: TCP
      port: 5432
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
    ports:
    - protocol: UDP
      port: 53

Let me explain what this policy does:

  • It applies to pods with label app: backend-api in the production namespace
  • It allows egress traffic to the 192.168.50.0/24 subnet (where the database lives) on port 5432
  • It also allows DNS traffic (UDP port 53) to anywhere, because pods need to resolve domain names

Apply this policy:

kubectl apply -f allow-database-access.yaml

Verify it was created:

kubectl get networkpolicy -n production

Now test from a pod with the app: backend-api label:

kubectl run test-backend --image=nicolaka/netshoot -n production --labels="app=backend-api" -it --rm -- /bin/bash

# Inside the pod
curl -v telnet://192.168.50.25:5432
# Should work now

# Try from a pod without the label
kubectl run test-other --image=nicolaka/netshoot -n production -it --rm -- /bin/bash
curl -v telnet://192.168.50.25:5432
# This should still be blocked (as intended)

Solution Part 3: Configure NSX-T Container Network Interface (CNI)

While we were fixing things, I also noticed their NSX-T CNI configuration was not optimal. By default, vSphere with Tanzu uses NSX-T CNI, but there are some settings that can cause connectivity issues if not configured properly.

Check the Current CNI Configuration

SSH into one of the Tanzu Kubernetes control plane nodes (you will need to enable SSH in the cluster configuration first).

Check the NSX CNI configuration:

cat /etc/nsx-ujo/ncp.ini

Look for these specific settings:

[nsx_v3]
policy_nsxapi = True
single_tier_topology = True

[coe]
cluster = your-cluster-name
enable_snat = True

In their case, enable_snat was set to True, which is correct. But I have seen cases where this gets set to False, and that causes all sorts of connectivity issues.

If you need to change this setting, you cannot do it directly on the control plane node. You need to modify it through the Tanzu Kubernetes cluster spec.

Get your cluster configuration:

kubectl get tanzukubernetescluster -n your-namespace

Edit the cluster:

kubectl edit tanzukubernetescluster your-cluster-name -n your-namespace

Look for the network section and ensure it looks like this:

spec:
  topology:
    controlPlane:
      ...
    workers:
      ...
  settings:
    network:
      cni:
        name: antrea
      serviceDomain: cluster.local
      services:
        cidrBlocks:
        - 10.96.0.0/12
      pods:
        cidrBlocks:
        - 10.244.0.0/16

Save and exit. The cluster will reconcile the changes automatically.

Solution Part 4: Troubleshooting with NSX Intelligence

After making all these changes, we wanted to verify that traffic was flowing correctly. NSX-T has a great feature called NSX Intelligence that helped us visualize the traffic flows.

Enable NSX Intelligence (if not already enabled)

  1. Log into NSX Manager
  2. Go to System > NSX Intelligence
  3. Click "Enable NSX Intelligence"
  4. Wait for it to be enabled (takes about 5-10 minutes)

View Traffic Flows

  1. Go to Plan & Troubleshoot > NSX Intelligence
  2. In the search box, enter the source IP (the Tier-0 gateway IP: 192.168.10.1)
  3. Click on "Flows"
  4. You should see traffic flows from the Tier-0 IP to your database server IP
  5. Click on any flow to see detailed information

This visualization helped us confirm that traffic was now flowing properly through NSX-T from the Tanzu pods to the external database.

Additional Troubleshooting Commands

Here are some useful commands we used during troubleshooting that might help you:

Check NSX-T Container Plugin Logs

SSH to the Tanzu Kubernetes control plane node and check NCP logs:

tail -f /var/log/nsx-ujo/ncp.log

Look for any errors related to connectivity or SNAT.

Check Pod Network Configuration

From inside a pod, check its network configuration:

kubectl run test-pod --image=nicolaka/netshoot -it --rm -- /bin/bash

# Inside the pod
ip addr show
ip route show
iptables -L -t nat

The ip route show command will show you the default gateway the pod is using. This should point to the NSX-T virtual network.

Test Connectivity from Different Points

Test from the Tanzu Kubernetes node itself (not from inside a pod):

# SSH to TKG node
curl -v telnet://192.168.50.25:5432

If this works but pod-to-database does not work, then the issue is definitely in the Kubernetes networking layer (NetworkPolicy or CNI configuration).

If even the node cannot reach the database, then the issue is in NSX-T routing or firewall.

Verification and Testing

After all these fixes, we created a comprehensive test to make sure everything was working:

Deploy a Test Application

We deployed a simple Python application that connects to the PostgreSQL database:

apiVersion: v1
kind: Pod
metadata:
  name: db-test-app
  namespace: production
  labels:
    app: backend-api
spec:
  containers:
  - name: postgres-client
    image: postgres:14
    command:
    - sleep
    - "3600"
    env:
    - name: PGHOST
      value: "192.168.50.25"
    - name: PGPORT
      value: "5432"
    - name: PGUSER
      value: "appuser"
    - name: PGPASSWORD
      value: "yourpassword"
    - name: PGDATABASE
      value: "production_db"

Apply it:

kubectl apply -f db-test-app.yaml

Test the connection:

kubectl exec -it db-test-app -n production -- psql -c "SELECT version();"

If this returns the PostgreSQL version information, then the connectivity is working perfectly.

Lessons Learned

After spending two days troubleshooting this issue, here are the key things I learned:

  1. NSX-T SNAT changes the source IP: When creating firewall rules for Tanzu workloads accessing external services, remember that the source IP will be the Tier-0 gateway IP after SNAT, not the pod IP or node IP.
  2. NetworkPolicies and NSX-T DFW work together: Both layers need to allow the traffic. Even if NSX-T allows it, a restrictive NetworkPolicy in Kubernetes can block it, and vice versa.
  3. Test from multiple points: When troubleshooting, test from the pod, from the node, and from a regular VM. This helps you isolate where the problem is.
  4. NSX Intelligence is your friend: Use NSX Intelligence to visualize traffic flows. It saves hours of guessing where traffic is getting blocked.
  5. Document your IP ranges: Keep a clear document of what IP ranges are used for what purpose. In our case, knowing the Tier-0 gateway IPs was crucial for fixing the firewall rules.
  6. Start with less restrictive policies and tighten them: When first deploying Tanzu with NSX-T, start with more permissive firewall rules to get connectivity working, then gradually tighten them for security. Trying to get everything perfect from day one often leads to connectivity issues that are hard to troubleshoot.

Common Mistakes to Avoid

Based on this experience and similar issues I have seen with other customers, here are common mistakes people make:

  • Using security groups based on VM segments for Tanzu traffic: This does not work because pods are not VMs on segments. They are containers with overlay networking.
  • Forgetting about DNS: If you create a very restrictive NetworkPolicy, do not forget to allow DNS (UDP port 53). Otherwise pods cannot resolve any domain names.
  • Not checking both ingress and egress: Sometimes the problem is not that your pod cannot send traffic out, but that the response cannot come back in. Check both directions.
  • Assuming ping works means everything works: ICMP (ping) uses a different protocol than TCP. Just because ping works does not mean your application traffic will work. Always test the actual port your application uses.
  • Not using labels consistently: NetworkPolicies use label selectors. If your pods do not have the right labels, the policies will not apply to them correctly.

Final Configuration Summary

For anyone facing similar issues, here is a summary of what a working configuration should look like:

NSX-T Side:

  • Security group that includes the Tier-0 gateway IP(s) used for SNAT
  • Distributed Firewall rule allowing traffic from that security group to your external services
  • NSX Intelligence enabled for troubleshooting

Tanzu Kubernetes Side:

  • NetworkPolicy allowing egress to the specific IP ranges and ports you need
  • NetworkPolicy allowing DNS (UDP 53) for name resolution
  • Proper pod labels so NetworkPolicies apply correctly
  • NSX CNI with enable_snat: True in the configuration

Testing:

  • Test with actual application ports, not just ping
  • Test from inside pods, not just from nodes
  • Use NSX Intelligence to verify traffic flows
  • Check logs on both NSX-T and Tanzu sides

Conclusion

Networking in Tanzu Kubernetes with NSX-T can be complex because you have multiple layers of networking and security working together. When things go wrong, the key is to understand how traffic flows through the entire stack, from the pod to NSX-T to the destination.

The most important thing to remember is that NSX-T performs SNAT for Tanzu traffic going to external destinations, so your firewall rules need to account for the post-SNAT IP addresses, not the pod IPs.

I hope this helps anyone struggling with similar connectivity issues between Tanzu Kubernetes pods and external services through NSX-T. If you are still facing issues after trying these steps, double-check your NSX-T routing configuration and make sure the Tier-0 gateway is properly configured with external connectivity.

Popular posts from this blog

AD LDS – Syncronizing AD LDS with Active Directory

First, we will install the AD LDS Instance: 1. Create and AD LDS instance by clicking Start -> Administrative Tools -> Active Directory Lightweight Directory Services Setup Wizard. The Setup Wizard appears. 2. Click Next . The Setup Options dialog box appears. For the sake of this guide, a unique instance will be the primary focus. I will have a separate post regarding AD LDS replication at some point in the near future. 3. Select A unique instance . 4. Click Next and the Instance Name dialog box appears. The instance name will help you identify and differentiate it from other instances that you may have installed on the same end point. The instance name will be listed in the data directory for the instance as well as in the Add or Remove Programs snap-in. 5. Enter a unique instance name, for example IDG. 6. Click Next to display the Ports configuration dialog box. 7. Leave ports at their default values unless you have conflicts with the default values. 8. Click N...

DNS Scavenging.

                        DNS Scavenging is a great answer to a problem that has been nagging everyone since RFC 2136 came out way back in 1997.  Despite many clever methods of ensuring that clients and DHCP servers that perform dynamic updates clean up after themselves sometimes DNS can get messy.  Remember that old test server that you built two years ago that caught fire before it could be used?  Probably not.  DNS still remembers it though.  There are two big issues with DNS scavenging that seem to come up a lot: "I'm hitting this 'scavenge now' button like a snare drum and nothing is happening.  Why?" or "I woke up this morning, my DNS zones are nearly empty and Active Directory is sitting in a corner rocking back and forth crying.  What happened?" This post should help us figure out when the first issue will happen and completely av...

HOW TO EDIT THE BCD REGISTRY FILE

The BCD registry file controls which operating system installation starts and how long the boot manager waits before starting Windows. Basically, it’s like the Boot.ini file in earlier versions of Windows. If you need to edit it, the easiest way is to use the Startup And Recovery tool from within Vista. Just follow these steps: 1. Click Start. Right-click Computer, and then click Properties. 2. Click Advanced System Settings. 3. On the Advanced tab, under Startup and Recovery, click Settings. 4. Click the Default Operating System list, and edit other startup settings. Then, click OK. Same as Windows XP, right? But you’re probably not here because you couldn’t find that dialog box. You’re probably here because Windows Vista won’t start. In that case, you shouldn’t even worry about editing the BCD. Just run Startup Repair, and let the tool do what it’s supposed to. If you’re an advanced user, like an IT guy, you might want to edit the BCD file yourself. You can do this...