Last month I got a call from a customer who was pulling their hair out over a networking issue. They had just deployed VMware Tanzu Kubernetes Grid on their vSphere with Tanzu environment, everything looked good in the dashboards, all pods were running, but their applications inside the pods could not reach external databases running on traditional VMs in the same datacenter.
The frustrating part was that some pods could reach external services perfectly fine, while others would just timeout. There was no clear pattern. Let me tell you how we figured this out and fixed it.
The Initial Problem
Here is what the customer setup looked like:
- vSphere 8.0 with Tanzu enabled
- NSX-T 4.1.2 for networking
- Three Tanzu Kubernetes clusters running different microservices applications
- External PostgreSQL database running on traditional VMs (non-Kubernetes)
- External API services running on another set of VMs
The symptom was simple but annoying. When pods tried to connect to the PostgreSQL database at IP 192.168.50.25, sometimes it worked, sometimes it did not. The application logs showed connection timeouts:
Error: could not connect to server: Connection timed out Is the server running on host "192.168.50.25" and accepting TCP/IP connections on port 5432?
The weird part was that if you did a kubectl exec into the pod and ran ping 192.168.50.25, it worked fine. But the actual database connection on port 5432 would fail.
Initial Troubleshooting Steps
First thing I did was check if this was a DNS issue. I asked them to try connecting using IP address directly instead of hostname. Same problem. So DNS was not the culprit.
Next, I checked if the pods could reach other external services. I had them create a test pod and try different connections:
kubectl run test-pod --image=nicolaka/netshoot -it --rm -- /bin/bash # Inside the pod, test different connections ping 192.168.50.25 # This worked fine curl -v telnet://192.168.50.25:5432 # This would timeout curl -v telnet://192.168.50.30:8080 # This worked (different VM, different service)
So ping worked, but TCP connections to specific ports were failing. That told me this was likely a firewall issue, not routing.
Checking NSX-T Distributed Firewall
Since they were using NSX-T, my next thought was to check the Distributed Firewall rules. I logged into NSX Manager and went to Security > Distributed Firewall.
What I found was interesting. They had a rule that allowed traffic from "Tanzu-Workload-Network" to "Database-Servers" security group. On paper, this should have worked. But when I looked closer at the security groups, I noticed something odd.
The "Tanzu-Workload-Network" security group was defined based on a specific NSX segment. But here is the thing about Tanzu Kubernetes pods. They do not sit directly on NSX segments. They use overlay networking within Kubernetes, and NSX sees them through SNAT (Source NAT) translation.
Understanding the Root Cause
Let me explain what was actually happening. When a pod in Tanzu Kubernetes tries to reach an external service:
- The pod sends traffic to its default gateway (the Kubernetes service network)
- Traffic goes through the Tanzu Kubernetes cluster's load balancer
- NSX-T performs SNAT to translate the pod IP to the Tier-0 gateway IP
- The traffic then goes to the destination VM
The problem was in step 3. The NSX-T firewall rules were checking the SOURCE IP of the traffic. After SNAT, the source IP was no longer from the "Tanzu-Workload-Network" segment. It was coming from the Tier-0 gateway IP pool.
This is why some connections worked and some did not. It depended on which Tier-0 gateway IP got assigned during SNAT, and whether that IP was accidentally covered by other broader firewall rules.
Solution Part 1: Fix the NSX-T Firewall Rules
Once we understood the problem, the fix became clear. We needed to modify the firewall rules to account for the SNAT translation.
Step 1: Identify the Tier-0 Gateway IP Pool
First, we needed to find out which IP range NSX-T was using for SNAT when Tanzu traffic goes out.
- Log into NSX Manager
- Go to Networking > Tier-0 Gateways
- Click on your Tier-0 gateway (in their case it was called "T0-Gateway-01")
- Go to Service Interfaces section
- Note down the IP addresses configured there
In their environment, the Tier-0 gateway was using 192.168.10.1 for the external interface.
Step 2: Create a New Security Group for Tanzu Traffic
Instead of using the segment-based security group, we created a new one specifically for Tanzu traffic after SNAT:
- Go to Inventory > Groups
- Click "Add Group"
- Name: "Tanzu-K8s-External-Traffic"
- Under Membership Criteria, select "IP Address"
- Add the IP address: 192.168.10.1
- Save
Step 3: Update the Distributed Firewall Rules
Now we updated the firewall rule:
- Go to Security > Distributed Firewall
- Find the rule that allows access to Database Servers
- Edit the rule
- In the "Source" field, add the new "Tanzu-K8s-External-Traffic" group we just created
- Keep the original "Tanzu-Workload-Network" group as well (for direct VM-to-VM traffic if any)
- Make sure the rule is set to "Allow"
- Publish the changes
After publishing these changes, we tested again from the pod:
kubectl run test-pod --image=nicolaka/netshoot -it --rm -- /bin/bash curl -v telnet://192.168.50.25:5432 # Now it worked!
Success! But we were not done yet.
Solution Part 2: Fix Tanzu Network Policies
While testing, we found another issue. Some namespaces in the Tanzu cluster had NetworkPolicy objects that were blocking egress traffic by default. This is actually a good security practice, but it was not configured properly.
We checked the existing network policies:
kubectl get networkpolicies --all-namespaces
In the "production" namespace, they had a very restrictive policy:
kubectl get networkpolicy -n production default-deny-egress -o yaml
The output showed:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-egress
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress: []
This policy was blocking ALL egress traffic from pods in the production namespace. We needed to add specific rules to allow traffic to the database and external APIs.
Create a New NetworkPolicy to Allow Database Access
We created a new policy file called allow-database-access.yaml:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-database-access
namespace: production
spec:
podSelector:
matchLabels:
app: backend-api
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 192.168.50.0/24
ports:
- protocol: TCP
port: 5432
- to:
- ipBlock:
cidr: 0.0.0.0/0
ports:
- protocol: UDP
port: 53
Let me explain what this policy does:
- It applies to pods with label
app: backend-apiin the production namespace - It allows egress traffic to the 192.168.50.0/24 subnet (where the database lives) on port 5432
- It also allows DNS traffic (UDP port 53) to anywhere, because pods need to resolve domain names
Apply this policy:
kubectl apply -f allow-database-access.yaml
Verify it was created:
kubectl get networkpolicy -n production
Now test from a pod with the app: backend-api label:
kubectl run test-backend --image=nicolaka/netshoot -n production --labels="app=backend-api" -it --rm -- /bin/bash # Inside the pod curl -v telnet://192.168.50.25:5432 # Should work now # Try from a pod without the label kubectl run test-other --image=nicolaka/netshoot -n production -it --rm -- /bin/bash curl -v telnet://192.168.50.25:5432 # This should still be blocked (as intended)
Solution Part 3: Configure NSX-T Container Network Interface (CNI)
While we were fixing things, I also noticed their NSX-T CNI configuration was not optimal. By default, vSphere with Tanzu uses NSX-T CNI, but there are some settings that can cause connectivity issues if not configured properly.
Check the Current CNI Configuration
SSH into one of the Tanzu Kubernetes control plane nodes (you will need to enable SSH in the cluster configuration first).
Check the NSX CNI configuration:
cat /etc/nsx-ujo/ncp.ini
Look for these specific settings:
[nsx_v3] policy_nsxapi = True single_tier_topology = True [coe] cluster = your-cluster-name enable_snat = True
In their case, enable_snat was set to True, which is correct. But I have seen cases where this gets set to False, and that causes all sorts of connectivity issues.
If you need to change this setting, you cannot do it directly on the control plane node. You need to modify it through the Tanzu Kubernetes cluster spec.
Get your cluster configuration:
kubectl get tanzukubernetescluster -n your-namespace
Edit the cluster:
kubectl edit tanzukubernetescluster your-cluster-name -n your-namespace
Look for the network section and ensure it looks like this:
spec:
topology:
controlPlane:
...
workers:
...
settings:
network:
cni:
name: antrea
serviceDomain: cluster.local
services:
cidrBlocks:
- 10.96.0.0/12
pods:
cidrBlocks:
- 10.244.0.0/16
Save and exit. The cluster will reconcile the changes automatically.
Solution Part 4: Troubleshooting with NSX Intelligence
After making all these changes, we wanted to verify that traffic was flowing correctly. NSX-T has a great feature called NSX Intelligence that helped us visualize the traffic flows.
Enable NSX Intelligence (if not already enabled)
- Log into NSX Manager
- Go to System > NSX Intelligence
- Click "Enable NSX Intelligence"
- Wait for it to be enabled (takes about 5-10 minutes)
View Traffic Flows
- Go to Plan & Troubleshoot > NSX Intelligence
- In the search box, enter the source IP (the Tier-0 gateway IP: 192.168.10.1)
- Click on "Flows"
- You should see traffic flows from the Tier-0 IP to your database server IP
- Click on any flow to see detailed information
This visualization helped us confirm that traffic was now flowing properly through NSX-T from the Tanzu pods to the external database.
Additional Troubleshooting Commands
Here are some useful commands we used during troubleshooting that might help you:
Check NSX-T Container Plugin Logs
SSH to the Tanzu Kubernetes control plane node and check NCP logs:
tail -f /var/log/nsx-ujo/ncp.log
Look for any errors related to connectivity or SNAT.
Check Pod Network Configuration
From inside a pod, check its network configuration:
kubectl run test-pod --image=nicolaka/netshoot -it --rm -- /bin/bash # Inside the pod ip addr show ip route show iptables -L -t nat
The ip route show command will show you the default gateway the pod is using. This should point to the NSX-T virtual network.
Test Connectivity from Different Points
Test from the Tanzu Kubernetes node itself (not from inside a pod):
# SSH to TKG node curl -v telnet://192.168.50.25:5432
If this works but pod-to-database does not work, then the issue is definitely in the Kubernetes networking layer (NetworkPolicy or CNI configuration).
If even the node cannot reach the database, then the issue is in NSX-T routing or firewall.
Verification and Testing
After all these fixes, we created a comprehensive test to make sure everything was working:
Deploy a Test Application
We deployed a simple Python application that connects to the PostgreSQL database:
apiVersion: v1
kind: Pod
metadata:
name: db-test-app
namespace: production
labels:
app: backend-api
spec:
containers:
- name: postgres-client
image: postgres:14
command:
- sleep
- "3600"
env:
- name: PGHOST
value: "192.168.50.25"
- name: PGPORT
value: "5432"
- name: PGUSER
value: "appuser"
- name: PGPASSWORD
value: "yourpassword"
- name: PGDATABASE
value: "production_db"
Apply it:
kubectl apply -f db-test-app.yaml
Test the connection:
kubectl exec -it db-test-app -n production -- psql -c "SELECT version();"
If this returns the PostgreSQL version information, then the connectivity is working perfectly.
Lessons Learned
After spending two days troubleshooting this issue, here are the key things I learned:
- NSX-T SNAT changes the source IP: When creating firewall rules for Tanzu workloads accessing external services, remember that the source IP will be the Tier-0 gateway IP after SNAT, not the pod IP or node IP.
- NetworkPolicies and NSX-T DFW work together: Both layers need to allow the traffic. Even if NSX-T allows it, a restrictive NetworkPolicy in Kubernetes can block it, and vice versa.
- Test from multiple points: When troubleshooting, test from the pod, from the node, and from a regular VM. This helps you isolate where the problem is.
- NSX Intelligence is your friend: Use NSX Intelligence to visualize traffic flows. It saves hours of guessing where traffic is getting blocked.
- Document your IP ranges: Keep a clear document of what IP ranges are used for what purpose. In our case, knowing the Tier-0 gateway IPs was crucial for fixing the firewall rules.
- Start with less restrictive policies and tighten them: When first deploying Tanzu with NSX-T, start with more permissive firewall rules to get connectivity working, then gradually tighten them for security. Trying to get everything perfect from day one often leads to connectivity issues that are hard to troubleshoot.
Common Mistakes to Avoid
Based on this experience and similar issues I have seen with other customers, here are common mistakes people make:
- Using security groups based on VM segments for Tanzu traffic: This does not work because pods are not VMs on segments. They are containers with overlay networking.
- Forgetting about DNS: If you create a very restrictive NetworkPolicy, do not forget to allow DNS (UDP port 53). Otherwise pods cannot resolve any domain names.
- Not checking both ingress and egress: Sometimes the problem is not that your pod cannot send traffic out, but that the response cannot come back in. Check both directions.
- Assuming ping works means everything works: ICMP (ping) uses a different protocol than TCP. Just because ping works does not mean your application traffic will work. Always test the actual port your application uses.
- Not using labels consistently: NetworkPolicies use label selectors. If your pods do not have the right labels, the policies will not apply to them correctly.
Final Configuration Summary
For anyone facing similar issues, here is a summary of what a working configuration should look like:
NSX-T Side:
- Security group that includes the Tier-0 gateway IP(s) used for SNAT
- Distributed Firewall rule allowing traffic from that security group to your external services
- NSX Intelligence enabled for troubleshooting
Tanzu Kubernetes Side:
- NetworkPolicy allowing egress to the specific IP ranges and ports you need
- NetworkPolicy allowing DNS (UDP 53) for name resolution
- Proper pod labels so NetworkPolicies apply correctly
- NSX CNI with
enable_snat: Truein the configuration
Testing:
- Test with actual application ports, not just ping
- Test from inside pods, not just from nodes
- Use NSX Intelligence to verify traffic flows
- Check logs on both NSX-T and Tanzu sides
Conclusion
Networking in Tanzu Kubernetes with NSX-T can be complex because you have multiple layers of networking and security working together. When things go wrong, the key is to understand how traffic flows through the entire stack, from the pod to NSX-T to the destination.
The most important thing to remember is that NSX-T performs SNAT for Tanzu traffic going to external destinations, so your firewall rules need to account for the post-SNAT IP addresses, not the pod IPs.
I hope this helps anyone struggling with similar connectivity issues between Tanzu Kubernetes pods and external services through NSX-T. If you are still facing issues after trying these steps, double-check your NSX-T routing configuration and make sure the Tier-0 gateway is properly configured with external connectivity.
