Troubleshooting vSAN Storage Policy Migration Failures in VMware Cloud Foundation 9.0

Recently I had a customer who migrated their VMware infrastructure from vSphere 7.0 to VMware Cloud Foundation 9.0. After the migration, they wanted to update their vSAN storage policies to take advantage of the new vSAN ESA (Express Storage Architecture) features. However, when they tried to change the storage policy for their production VMs, the operation kept failing with a cryptic error message.

The error they were getting was:

"Cannot complete operation due to insufficient resources to satisfy current storage policy."

This was strange because they had plenty of disk space available. The vSAN cluster was only at 45% capacity, and all the hosts were healthy. Let me walk through how we troubleshooted and fixed this issue.

Understanding the Problem

First, let's understand what was happening in their environment:

They had a 6-node vSAN cluster running VCF 9.0
vSAN was configured with the new ESA architecture
They were trying to migrate VMs from the old "vSAN Default Storage Policy" to a new policy called "Production-ESA-Policy" which had FTT=2 (Failures to Tolerate) with RAID-6 erasure coding
Some VMs would migrate successfully, but most would fail

After looking at the environment, I found the root cause. The issue was not about disk space at all. It was about available disk groups and how vSAN ESA handles erasure coding differently than the traditional vSAN architecture.

Root Cause Analysis

In vSAN ESA with RAID-6 erasure coding and FTT=2, you need at least 5 nodes to satisfy the policy requirements. But here is the tricky part that most people miss. When you are migrating VMs from one policy to another, vSAN needs to create the new object layout BEFORE it can delete the old one. This means during the migration, you temporarily need DOUBLE the capacity.

In my customer's case:

Their old policy was using RAID-1 mirroring with FTT=1 (requires 2 copies of data)
The new policy was RAID-6 with FTT=2 (requires data + 2 parity blocks distributed across 5 nodes minimum)
During migration, both layouts exist simultaneously until the migration completes

This temporary doubling of capacity requirements was causing the "insufficient resources" error, even though they had plenty of raw disk space.

Solution Part 1: Check vSAN Capacity Before Migration

Before attempting large scale storage policy migrations, you should always check your vSAN slack space. Here is how to do it properly:

Step 1: Check vSAN Capacity using vSphere Client

Log into vSphere Client
Navigate to your vSAN cluster
Go to Monitor tab > vSAN > Capacity
Look at the "Deduplication and Compression Savings" section
More importantly, look at "Slack Space" at the bottom

In this case, the customer had:

Total capacity: 48 TB
Used capacity: 21.6 TB (45%)
But slack space available for rebuild operations: Only 8.2 TB

The slack space is the actual usable space for new object creation during policy changes. This was the real bottleneck.

Step 2: Calculate Required Slack Space for Migration

For policy migrations, you need at least 1.5x to 2x the size of the VM you are migrating as free slack space. For example, if you are migrating a 2TB VM, you need at least 3-4TB of slack space available.

You can check the actual VM disk usage using PowerCLI:


Connect-VIServer -Server vcenter.domain.com

$vms = Get-VM
foreach ($vm in $vms) {
    $vmSize = ($vm | Get-HardDisk | Measure-Object -Property CapacityGB -Sum).Sum
    Write-Host "VM: $($vm.Name) - Total Disk Size: $vmSize GB"
}

Solution Part 2: Temporary Workaround

If you do not have enough slack space for all VMs at once, you need to do the migration in batches. Here is the approach we took:

Option 1: Migrate VMs in Small Batches

Identify your smallest VMs first (less than 500GB)
Migrate those VMs first to the new policy
Wait for migration to complete (you can check progress in vSAN > Resyncing Objects)
Once the first batch completes, the old objects are deleted and slack space is freed up
Then migrate the next batch

We created a simple script to do this automatically:

 

$vms = Get-VM | Sort-Object -Property UsedSpaceGB
$newPolicy = Get-SpbmStoragePolicy -Name "Production-ESA-Policy"
$batchSize = 5

for ($i = 0; $i -lt $vms.Count; $i += $batchSize) {
    $batch = $vms[$i..($i + $batchSize - 1)]
    
    Write-Host "Migrating batch starting at VM: $($batch[0].Name)"
    
    foreach ($vm in $batch) {
        Set-VM -VM $vm -StoragePolicy $newPolicy -Confirm:$false
        Write-Host "Started migration for $($vm.Name)"
    }
    
    # Wait for resyncing to complete before next batch
    Write-Host "Waiting for resync to complete..."
    do {
        $resyncObjects = Get-VsanResyncingComponent -Cluster (Get-Cluster)
        Start-Sleep -Seconds 60
    } while ($resyncObjects.Count -gt 0)
    
    Write-Host "Batch complete. Moving to next batch."
}

Option 2: Temporarily Add More Capacity

If you cannot wait for batch migrations, you can temporarily add capacity to the vSAN cluster:

Add a new disk group to existing hosts (if they have free slots)
Or add a new host to the cluster temporarily
After all migrations complete, you can remove the temporary capacity

Solution Part 3: Long Term Fix Using vSAN Configuration

For the long term, we adjusted the vSAN configuration to handle this better in future migrations.

Configure vSAN Advanced Options for Better Migration Handling

There are some advanced vSAN settings that can help with policy migrations:

Go to vSphere Client
Select your vSAN cluster
Configure > vSAN > Services > Performance Service
Enable if not already enabled (this helps monitor resync progress better)

Then adjust the resync throttling:

Go to Configure > vSAN > Services > Advanced Options
Find the following parameters and adjust them:
- VSAN.DomResyncThrottleRate - Default is 0 (unlimited). If migrations are impacting production, set to 80 (limits resync to 80% of backend bandwidth)
- VSAN.DomOwnerForceWarmCache - Set to 1 to improve performance during migrations

Note: These settings should be adjusted based on your environment. If you have maintenance windows, leave throttling at 0 for fastest migration. If migrations must happen during production hours, throttle to 60-80%.

Solution Part 4: Using vSphere Storage vMotion as Alternative

In some cases, if the direct storage policy change keeps failing, you can use Storage vMotion as a workaround:

Create a new datastore (even a small one, 100GB is enough for temporary use)
Storage vMotion the VM to this temporary datastore
This frees up the vSAN object
Then Storage vMotion back to vSAN with the new storage policy

Example using PowerCLI:


$vm = Get-VM -Name "ProductionVM01"
$tempDatastore = Get-Datastore -Name "Temp-Datastore"
$vsanDatastore = Get-Datastore -Name "vsanDatastore"
$newPolicy = Get-SpbmStoragePolicy -Name "Production-ESA-Policy"

# Move to temp datastore first
Move-VM -VM $vm -Datastore $tempDatastore

# Wait a bit
Start-Sleep -Seconds 30

# Move back to vSAN with new policy
Move-VM -VM $vm -Datastore $vsanDatastore -StoragePolicy $newPolicy

This two-step approach avoids the double capacity requirement because the VM is completely removed from vSAN before being added back with the new policy.

Verification Steps

After migrating your VMs, verify everything is working correctly:

Step 1: Check VM Storage Policy Compliance

Go to VMs and Templates view
Right-click the VM > VM Policies > Check VM Storage Policy Compliance
You should see "Compliant" status

Step 2: Verify vSAN Object Health


$cluster = Get-Cluster -Name "YourClusterName"
Get-VsanHealthSummary -Cluster $cluster

All health checks should be green, especially the "Data" section.

Step 3: Check for Any Orphaned Objects

Sometimes failed migrations leave orphaned objects that consume space:

Go to vSAN cluster > Monitor > vSAN > Capacity
Look for "Orphaned Objects" section
If you see any, you can delete them using:
- RVC (Ruby vSphere Console)
- Or PowerCLI: Remove-VsanOrphanedVMDKs -Cluster $cluster

Lessons Learned and Best Practices

After going through this with the customer, here are the key takeaways:

Always check slack space, not just total capacity: vSAN capacity monitoring can be misleading. The total available space is not the same as slack space available for operations.
Plan for 2x capacity during migrations: When changing storage policies, especially moving from RAID-1 to RAID-6, the migration temporarily needs double the capacity.
Migrate in batches during production hours: Do not try to migrate all VMs at once. Start with small VMs in small batches.
Test with non-production VMs first: Always test your migration process on dev or test VMs before touching production.
Monitor resync progress: Use vSAN Performance Service to monitor resync operations. This helps you understand how long migrations will take.
Consider maintenance windows for large VMs: For very large VMs (multi-TB), schedule migrations during maintenance windows when you can remove resync throttling for faster completion.
Document your vSAN configuration: Keep track of your disk groups, capacity groups, and policy settings. This makes troubleshooting much faster.

Additional Notes for VCF 9.0 Specific Considerations

If you are running VMware Cloud Foundation 9.0 specifically, there are a few additional things to be aware of:

vSAN ESA is the default: New VCF 9.0 deployments use vSAN ESA by default. Make sure you understand the differences from vSAN OSA (Original Storage Architecture).
Storage policies are managed through VCF: While you can change them in vSphere, it is better to use SDDC Manager for policy changes to keep everything in sync.
Lifecycle management considerations: When you update VCF components, storage policies may need to be revalidated. Plan accordingly.

Conclusion

Storage policy migrations in vSAN are not as straightforward as they might seem, especially when moving to more advanced erasure coding configurations. The key is understanding that vSAN needs temporary extra capacity during the migration process.

By following the steps above, you should be able to successfully migrate your VMs to new storage policies without hitting capacity errors. Remember to always test first, migrate in batches, and monitor the resync progress.

If you run into issues even after following these steps, check the vSAN health service for any underlying problems with your cluster configuration. Sometimes issues like network latency, disk performance problems, or host hardware issues can also cause migration failures that show up as capacity errors.

Hope this helps anyone facing similar issues with vSAN storage policy migrations in VCF 9.0!

My IT Blog

Search This Blog