Fixing Tanzu Kubernetes Pod to External Services Connectivity Issues with NSX-T

Last month I got a call from a customer who was pulling their hair out over a networking issue. They had just deployed VMware Tanzu Kubernetes Grid on their vSphere with Tanzu environment, everything looked good in the dashboards, all pods were running, but their applications inside the pods could not reach external databases running on traditional VMs in the same datacenter.

The frustrating part was that some pods could reach external services perfectly fine, while others would just timeout. There was no clear pattern. Let me tell you how we figured this out and fixed it.

The Initial Problem

Here is what the customer setup looked like:

vSphere 8.0 with Tanzu enabled
NSX-T 4.1.2 for networking
Three Tanzu Kubernetes clusters running different microservices applications
External PostgreSQL database running on traditional VMs (non-Kubernetes)
External API services running on another set of VMs

The symptom was simple but annoying. When pods tried to connect to the PostgreSQL database at IP 192.168.50.25, sometimes it worked, sometimes it did not. The application logs showed connection timeouts:

Error: could not connect to server: Connection timed out
Is the server running on host "192.168.50.25" and accepting TCP/IP connections on port 5432?

The weird part was that if you did a kubectl exec into the pod and ran ping 192.168.50.25, it worked fine. But the actual database connection on port 5432 would fail.

Initial Troubleshooting Steps

First thing I did was check if this was a DNS issue. I asked them to try connecting using IP address directly instead of hostname. Same problem. So DNS was not the culprit.

Next, I checked if the pods could reach other external services. I had them create a test pod and try different connections:

kubectl run test-pod --image=nicolaka/netshoot -it --rm -- /bin/bash

# Inside the pod, test different connections
ping 192.168.50.25
# This worked fine

curl -v telnet://192.168.50.25:5432
# This would timeout

curl -v telnet://192.168.50.30:8080
# This worked (different VM, different service)

So ping worked, but TCP connections to specific ports were failing. That told me this was likely a firewall issue, not routing.

Checking NSX-T Distributed Firewall

Since they were using NSX-T, my next thought was to check the Distributed Firewall rules. I logged into NSX Manager and went to Security > Distributed Firewall.

What I found was interesting. They had a rule that allowed traffic from "Tanzu-Workload-Network" to "Database-Servers" security group. On paper, this should have worked. But when I looked closer at the security groups, I noticed something odd.

The "Tanzu-Workload-Network" security group was defined based on a specific NSX segment. But here is the thing about Tanzu Kubernetes pods. They do not sit directly on NSX segments. They use overlay networking within Kubernetes, and NSX sees them through SNAT (Source NAT) translation.

Understanding the Root Cause

Let me explain what was actually happening. When a pod in Tanzu Kubernetes tries to reach an external service:

The pod sends traffic to its default gateway (the Kubernetes service network)
Traffic goes through the Tanzu Kubernetes cluster's load balancer
NSX-T performs SNAT to translate the pod IP to the Tier-0 gateway IP
The traffic then goes to the destination VM

The problem was in step 3. The NSX-T firewall rules were checking the SOURCE IP of the traffic. After SNAT, the source IP was no longer from the "Tanzu-Workload-Network" segment. It was coming from the Tier-0 gateway IP pool.

This is why some connections worked and some did not. It depended on which Tier-0 gateway IP got assigned during SNAT, and whether that IP was accidentally covered by other broader firewall rules.

Solution Part 1: Fix the NSX-T Firewall Rules

Once we understood the problem, the fix became clear. We needed to modify the firewall rules to account for the SNAT translation.

Step 1: Identify the Tier-0 Gateway IP Pool

First, we needed to find out which IP range NSX-T was using for SNAT when Tanzu traffic goes out.

Log into NSX Manager
Go to Networking > Tier-0 Gateways
Click on your Tier-0 gateway (in their case it was called "T0-Gateway-01")
Go to Service Interfaces section
Note down the IP addresses configured there

In their environment, the Tier-0 gateway was using 192.168.10.1 for the external interface.

Step 2: Create a New Security Group for Tanzu Traffic

Instead of using the segment-based security group, we created a new one specifically for Tanzu traffic after SNAT:

Go to Inventory > Groups
Click "Add Group"
Name: "Tanzu-K8s-External-Traffic"
Under Membership Criteria, select "IP Address"
Add the IP address: 192.168.10.1
Save

Step 3: Update the Distributed Firewall Rules

Now we updated the firewall rule:

Go to Security > Distributed Firewall
Find the rule that allows access to Database Servers
Edit the rule
In the "Source" field, add the new "Tanzu-K8s-External-Traffic" group we just created
Keep the original "Tanzu-Workload-Network" group as well (for direct VM-to-VM traffic if any)
Make sure the rule is set to "Allow"
Publish the changes

After publishing these changes, we tested again from the pod:

kubectl run test-pod --image=nicolaka/netshoot -it --rm -- /bin/bash

curl -v telnet://192.168.50.25:5432
# Now it worked!

Success! But we were not done yet.

Solution Part 2: Fix Tanzu Network Policies

While testing, we found another issue. Some namespaces in the Tanzu cluster had NetworkPolicy objects that were blocking egress traffic by default. This is actually a good security practice, but it was not configured properly.

We checked the existing network policies:

kubectl get networkpolicies --all-namespaces

In the "production" namespace, they had a very restrictive policy:

kubectl get networkpolicy -n production default-deny-egress -o yaml

The output showed:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-egress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress: []

This policy was blocking ALL egress traffic from pods in the production namespace. We needed to add specific rules to allow traffic to the database and external APIs.

Create a New NetworkPolicy to Allow Database Access

We created a new policy file called allow-database-access.yaml:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-database-access
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend-api
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 192.168.50.0/24
    ports:
    - protocol: TCP
      port: 5432
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
    ports:
    - protocol: UDP
      port: 53

Let me explain what this policy does:

It applies to pods with label app: backend-api in the production namespace
It allows egress traffic to the 192.168.50.0/24 subnet (where the database lives) on port 5432
It also allows DNS traffic (UDP port 53) to anywhere, because pods need to resolve domain names

Apply this policy:

kubectl apply -f allow-database-access.yaml

Verify it was created:

kubectl get networkpolicy -n production

Now test from a pod with the app: backend-api label:

kubectl run test-backend --image=nicolaka/netshoot -n production --labels="app=backend-api" -it --rm -- /bin/bash

# Inside the pod
curl -v telnet://192.168.50.25:5432
# Should work now

# Try from a pod without the label
kubectl run test-other --image=nicolaka/netshoot -n production -it --rm -- /bin/bash
curl -v telnet://192.168.50.25:5432
# This should still be blocked (as intended)

Solution Part 3: Configure NSX-T Container Network Interface (CNI)

While we were fixing things, I also noticed their NSX-T CNI configuration was not optimal. By default, vSphere with Tanzu uses NSX-T CNI, but there are some settings that can cause connectivity issues if not configured properly.

Check the Current CNI Configuration

SSH into one of the Tanzu Kubernetes control plane nodes (you will need to enable SSH in the cluster configuration first).

Check the NSX CNI configuration:

cat /etc/nsx-ujo/ncp.ini

Look for these specific settings:

[nsx_v3]
policy_nsxapi = True
single_tier_topology = True

[coe]
cluster = your-cluster-name
enable_snat = True

In their case, enable_snat was set to True, which is correct. But I have seen cases where this gets set to False, and that causes all sorts of connectivity issues.

If you need to change this setting, you cannot do it directly on the control plane node. You need to modify it through the Tanzu Kubernetes cluster spec.

Get your cluster configuration:

kubectl get tanzukubernetescluster -n your-namespace

Edit the cluster:

kubectl edit tanzukubernetescluster your-cluster-name -n your-namespace

Look for the network section and ensure it looks like this:

spec:
  topology:
    controlPlane:
      ...
    workers:
      ...
  settings:
    network:
      cni:
        name: antrea
      serviceDomain: cluster.local
      services:
        cidrBlocks:
        - 10.96.0.0/12
      pods:
        cidrBlocks:
        - 10.244.0.0/16

Save and exit. The cluster will reconcile the changes automatically.

Solution Part 4: Troubleshooting with NSX Intelligence

After making all these changes, we wanted to verify that traffic was flowing correctly. NSX-T has a great feature called NSX Intelligence that helped us visualize the traffic flows.

Enable NSX Intelligence (if not already enabled)

Log into NSX Manager
Go to System > NSX Intelligence
Click "Enable NSX Intelligence"
Wait for it to be enabled (takes about 5-10 minutes)

View Traffic Flows

Go to Plan & Troubleshoot > NSX Intelligence
In the search box, enter the source IP (the Tier-0 gateway IP: 192.168.10.1)
Click on "Flows"
You should see traffic flows from the Tier-0 IP to your database server IP
Click on any flow to see detailed information

This visualization helped us confirm that traffic was now flowing properly through NSX-T from the Tanzu pods to the external database.

Additional Troubleshooting Commands

Here are some useful commands we used during troubleshooting that might help you:

Check NSX-T Container Plugin Logs

SSH to the Tanzu Kubernetes control plane node and check NCP logs:

tail -f /var/log/nsx-ujo/ncp.log

Look for any errors related to connectivity or SNAT.

Check Pod Network Configuration

From inside a pod, check its network configuration:

kubectl run test-pod --image=nicolaka/netshoot -it --rm -- /bin/bash

# Inside the pod
ip addr show
ip route show
iptables -L -t nat

The ip route show command will show you the default gateway the pod is using. This should point to the NSX-T virtual network.

Test Connectivity from Different Points

Test from the Tanzu Kubernetes node itself (not from inside a pod):

# SSH to TKG node
curl -v telnet://192.168.50.25:5432

If this works but pod-to-database does not work, then the issue is definitely in the Kubernetes networking layer (NetworkPolicy or CNI configuration).

If even the node cannot reach the database, then the issue is in NSX-T routing or firewall.

Verification and Testing

After all these fixes, we created a comprehensive test to make sure everything was working:

Deploy a Test Application

We deployed a simple Python application that connects to the PostgreSQL database:

apiVersion: v1
kind: Pod
metadata:
  name: db-test-app
  namespace: production
  labels:
    app: backend-api
spec:
  containers:
  - name: postgres-client
    image: postgres:14
    command:
    - sleep
    - "3600"
    env:
    - name: PGHOST
      value: "192.168.50.25"
    - name: PGPORT
      value: "5432"
    - name: PGUSER
      value: "appuser"
    - name: PGPASSWORD
      value: "yourpassword"
    - name: PGDATABASE
      value: "production_db"

Apply it:

kubectl apply -f db-test-app.yaml

Test the connection:

kubectl exec -it db-test-app -n production -- psql -c "SELECT version();"

If this returns the PostgreSQL version information, then the connectivity is working perfectly.

Lessons Learned

After spending two days troubleshooting this issue, here are the key things I learned:

NSX-T SNAT changes the source IP: When creating firewall rules for Tanzu workloads accessing external services, remember that the source IP will be the Tier-0 gateway IP after SNAT, not the pod IP or node IP.
NetworkPolicies and NSX-T DFW work together: Both layers need to allow the traffic. Even if NSX-T allows it, a restrictive NetworkPolicy in Kubernetes can block it, and vice versa.
Test from multiple points: When troubleshooting, test from the pod, from the node, and from a regular VM. This helps you isolate where the problem is.
NSX Intelligence is your friend: Use NSX Intelligence to visualize traffic flows. It saves hours of guessing where traffic is getting blocked.
Document your IP ranges: Keep a clear document of what IP ranges are used for what purpose. In our case, knowing the Tier-0 gateway IPs was crucial for fixing the firewall rules.
Start with less restrictive policies and tighten them: When first deploying Tanzu with NSX-T, start with more permissive firewall rules to get connectivity working, then gradually tighten them for security. Trying to get everything perfect from day one often leads to connectivity issues that are hard to troubleshoot.

Common Mistakes to Avoid

Based on this experience and similar issues I have seen with other customers, here are common mistakes people make:

Using security groups based on VM segments for Tanzu traffic: This does not work because pods are not VMs on segments. They are containers with overlay networking.
Forgetting about DNS: If you create a very restrictive NetworkPolicy, do not forget to allow DNS (UDP port 53). Otherwise pods cannot resolve any domain names.
Not checking both ingress and egress: Sometimes the problem is not that your pod cannot send traffic out, but that the response cannot come back in. Check both directions.
Assuming ping works means everything works: ICMP (ping) uses a different protocol than TCP. Just because ping works does not mean your application traffic will work. Always test the actual port your application uses.
Not using labels consistently: NetworkPolicies use label selectors. If your pods do not have the right labels, the policies will not apply to them correctly.

Final Configuration Summary

For anyone facing similar issues, here is a summary of what a working configuration should look like:

NSX-T Side:

Security group that includes the Tier-0 gateway IP(s) used for SNAT
Distributed Firewall rule allowing traffic from that security group to your external services
NSX Intelligence enabled for troubleshooting

Tanzu Kubernetes Side:

NetworkPolicy allowing egress to the specific IP ranges and ports you need
NetworkPolicy allowing DNS (UDP 53) for name resolution
Proper pod labels so NetworkPolicies apply correctly
NSX CNI with enable_snat: True in the configuration

Testing:

Test with actual application ports, not just ping
Test from inside pods, not just from nodes
Use NSX Intelligence to verify traffic flows
Check logs on both NSX-T and Tanzu sides

Conclusion

Networking in Tanzu Kubernetes with NSX-T can be complex because you have multiple layers of networking and security working together. When things go wrong, the key is to understand how traffic flows through the entire stack, from the pod to NSX-T to the destination.

The most important thing to remember is that NSX-T performs SNAT for Tanzu traffic going to external destinations, so your firewall rules need to account for the post-SNAT IP addresses, not the pod IPs.

I hope this helps anyone struggling with similar connectivity issues between Tanzu Kubernetes pods and external services through NSX-T. If you are still facing issues after trying these steps, double-check your NSX-T routing configuration and make sure the Tier-0 gateway is properly configured with external connectivity.

My IT Blog

Search This Blog