Skip to main content

Troubleshooting vSAN Storage Policy Migration Failures in VMware Cloud Foundation 9.0

Recently I had a customer who migrated their VMware infrastructure from vSphere 7.0 to VMware Cloud Foundation 9.0. After the migration, they wanted to update their vSAN storage policies to take advantage of the new vSAN ESA (Express Storage Architecture) features. However, when they tried to change the storage policy for their production VMs, the operation kept failing with a cryptic error message.

The error they were getting was:

"Cannot complete operation due to insufficient resources to satisfy current storage policy."

This was strange because they had plenty of disk space available. The vSAN cluster was only at 45% capacity, and all the hosts were healthy. Let me walk through how we troubleshooted and fixed this issue.

Understanding the Problem

First, let's understand what was happening in their environment:

  • They had a 6-node vSAN cluster running VCF 9.0
  • vSAN was configured with the new ESA architecture
  • They were trying to migrate VMs from the old "vSAN Default Storage Policy" to a new policy called "Production-ESA-Policy" which had FTT=2 (Failures to Tolerate) with RAID-6 erasure coding
  • Some VMs would migrate successfully, but most would fail

After looking at the environment, I found the root cause. The issue was not about disk space at all. It was about available disk groups and how vSAN ESA handles erasure coding differently than the traditional vSAN architecture.

Root Cause Analysis

In vSAN ESA with RAID-6 erasure coding and FTT=2, you need at least 5 nodes to satisfy the policy requirements. But here is the tricky part that most people miss. When you are migrating VMs from one policy to another, vSAN needs to create the new object layout BEFORE it can delete the old one. This means during the migration, you temporarily need DOUBLE the capacity.

In my customer's case:

  • Their old policy was using RAID-1 mirroring with FTT=1 (requires 2 copies of data)
  • The new policy was RAID-6 with FTT=2 (requires data + 2 parity blocks distributed across 5 nodes minimum)
  • During migration, both layouts exist simultaneously until the migration completes

This temporary doubling of capacity requirements was causing the "insufficient resources" error, even though they had plenty of raw disk space.

Solution Part 1: Check vSAN Capacity Before Migration

Before attempting large scale storage policy migrations, you should always check your vSAN slack space. Here is how to do it properly:

Step 1: Check vSAN Capacity using vSphere Client

  1. Log into vSphere Client
  2. Navigate to your vSAN cluster
  3. Go to Monitor tab > vSAN > Capacity
  4. Look at the "Deduplication and Compression Savings" section
  5. More importantly, look at "Slack Space" at the bottom

In this case, the customer had:

  • Total capacity: 48 TB
  • Used capacity: 21.6 TB (45%)
  • But slack space available for rebuild operations: Only 8.2 TB

The slack space is the actual usable space for new object creation during policy changes. This was the real bottleneck.

Step 2: Calculate Required Slack Space for Migration

For policy migrations, you need at least 1.5x to 2x the size of the VM you are migrating as free slack space. For example, if you are migrating a 2TB VM, you need at least 3-4TB of slack space available.

You can check the actual VM disk usage using PowerCLI:


Connect-VIServer -Server vcenter.domain.com

$vms = Get-VM
foreach ($vm in $vms) {
    $vmSize = ($vm | Get-HardDisk | Measure-Object -Property CapacityGB -Sum).Sum
    Write-Host "VM: $($vm.Name) - Total Disk Size: $vmSize GB"
}

Solution Part 2: Temporary Workaround

If you do not have enough slack space for all VMs at once, you need to do the migration in batches. Here is the approach we took:

Option 1: Migrate VMs in Small Batches

  1. Identify your smallest VMs first (less than 500GB)
  2. Migrate those VMs first to the new policy
  3. Wait for migration to complete (you can check progress in vSAN > Resyncing Objects)
  4. Once the first batch completes, the old objects are deleted and slack space is freed up
  5. Then migrate the next batch

We created a simple script to do this automatically:

 

$vms = Get-VM | Sort-Object -Property UsedSpaceGB
$newPolicy = Get-SpbmStoragePolicy -Name "Production-ESA-Policy"
$batchSize = 5

for ($i = 0; $i -lt $vms.Count; $i += $batchSize) {
    $batch = $vms[$i..($i + $batchSize - 1)]
    
    Write-Host "Migrating batch starting at VM: $($batch[0].Name)"
    
    foreach ($vm in $batch) {
        Set-VM -VM $vm -StoragePolicy $newPolicy -Confirm:$false
        Write-Host "Started migration for $($vm.Name)"
    }
    
    # Wait for resyncing to complete before next batch
    Write-Host "Waiting for resync to complete..."
    do {
        $resyncObjects = Get-VsanResyncingComponent -Cluster (Get-Cluster)
        Start-Sleep -Seconds 60
    } while ($resyncObjects.Count -gt 0)
    
    Write-Host "Batch complete. Moving to next batch."
}

Option 2: Temporarily Add More Capacity

If you cannot wait for batch migrations, you can temporarily add capacity to the vSAN cluster:

  1. Add a new disk group to existing hosts (if they have free slots)
  2. Or add a new host to the cluster temporarily
  3. After all migrations complete, you can remove the temporary capacity

Solution Part 3: Long Term Fix Using vSAN Configuration

For the long term, we adjusted the vSAN configuration to handle this better in future migrations.

Configure vSAN Advanced Options for Better Migration Handling

There are some advanced vSAN settings that can help with policy migrations:

  1. Go to vSphere Client
  2. Select your vSAN cluster
  3. Configure > vSAN > Services > Performance Service
  4. Enable if not already enabled (this helps monitor resync progress better)

Then adjust the resync throttling:

  1. Go to Configure > vSAN > Services > Advanced Options
  2. Find the following parameters and adjust them:
    • VSAN.DomResyncThrottleRate - Default is 0 (unlimited). If migrations are impacting production, set to 80 (limits resync to 80% of backend bandwidth)
    • VSAN.DomOwnerForceWarmCache - Set to 1 to improve performance during migrations

Note: These settings should be adjusted based on your environment. If you have maintenance windows, leave throttling at 0 for fastest migration. If migrations must happen during production hours, throttle to 60-80%.

Solution Part 4: Using vSphere Storage vMotion as Alternative

In some cases, if the direct storage policy change keeps failing, you can use Storage vMotion as a workaround:

  1. Create a new datastore (even a small one, 100GB is enough for temporary use)
  2. Storage vMotion the VM to this temporary datastore
  3. This frees up the vSAN object
  4. Then Storage vMotion back to vSAN with the new storage policy

Example using PowerCLI:


$vm = Get-VM -Name "ProductionVM01"
$tempDatastore = Get-Datastore -Name "Temp-Datastore"
$vsanDatastore = Get-Datastore -Name "vsanDatastore"
$newPolicy = Get-SpbmStoragePolicy -Name "Production-ESA-Policy"

# Move to temp datastore first
Move-VM -VM $vm -Datastore $tempDatastore

# Wait a bit
Start-Sleep -Seconds 30

# Move back to vSAN with new policy
Move-VM -VM $vm -Datastore $vsanDatastore -StoragePolicy $newPolicy

This two-step approach avoids the double capacity requirement because the VM is completely removed from vSAN before being added back with the new policy.

Verification Steps

After migrating your VMs, verify everything is working correctly:

Step 1: Check VM Storage Policy Compliance

  1. Go to VMs and Templates view
  2. Right-click the VM > VM Policies > Check VM Storage Policy Compliance
  3. You should see "Compliant" status

Step 2: Verify vSAN Object Health


$cluster = Get-Cluster -Name "YourClusterName"
Get-VsanHealthSummary -Cluster $cluster

All health checks should be green, especially the "Data" section.

Step 3: Check for Any Orphaned Objects

Sometimes failed migrations leave orphaned objects that consume space:

  1. Go to vSAN cluster > Monitor > vSAN > Capacity
  2. Look for "Orphaned Objects" section
  3. If you see any, you can delete them using:
    • RVC (Ruby vSphere Console)
    • Or PowerCLI: Remove-VsanOrphanedVMDKs -Cluster $cluster

Lessons Learned and Best Practices

After going through this with the customer, here are the key takeaways:

  1. Always check slack space, not just total capacity: vSAN capacity monitoring can be misleading. The total available space is not the same as slack space available for operations.
  2. Plan for 2x capacity during migrations: When changing storage policies, especially moving from RAID-1 to RAID-6, the migration temporarily needs double the capacity.
  3. Migrate in batches during production hours: Do not try to migrate all VMs at once. Start with small VMs in small batches.
  4. Test with non-production VMs first: Always test your migration process on dev or test VMs before touching production.
  5. Monitor resync progress: Use vSAN Performance Service to monitor resync operations. This helps you understand how long migrations will take.
  6. Consider maintenance windows for large VMs: For very large VMs (multi-TB), schedule migrations during maintenance windows when you can remove resync throttling for faster completion.
  7. Document your vSAN configuration: Keep track of your disk groups, capacity groups, and policy settings. This makes troubleshooting much faster.

Additional Notes for VCF 9.0 Specific Considerations

If you are running VMware Cloud Foundation 9.0 specifically, there are a few additional things to be aware of:

  • vSAN ESA is the default: New VCF 9.0 deployments use vSAN ESA by default. Make sure you understand the differences from vSAN OSA (Original Storage Architecture).
  • Storage policies are managed through VCF: While you can change them in vSphere, it is better to use SDDC Manager for policy changes to keep everything in sync.
  • Lifecycle management considerations: When you update VCF components, storage policies may need to be revalidated. Plan accordingly.

Conclusion

Storage policy migrations in vSAN are not as straightforward as they might seem, especially when moving to more advanced erasure coding configurations. The key is understanding that vSAN needs temporary extra capacity during the migration process.

By following the steps above, you should be able to successfully migrate your VMs to new storage policies without hitting capacity errors. Remember to always test first, migrate in batches, and monitor the resync progress.

If you run into issues even after following these steps, check the vSAN health service for any underlying problems with your cluster configuration. Sometimes issues like network latency, disk performance problems, or host hardware issues can also cause migration failures that show up as capacity errors.

Hope this helps anyone facing similar issues with vSAN storage policy migrations in VCF 9.0!

Popular posts from this blog

AD LDS – Syncronizing AD LDS with Active Directory

First, we will install the AD LDS Instance: 1. Create and AD LDS instance by clicking Start -> Administrative Tools -> Active Directory Lightweight Directory Services Setup Wizard. The Setup Wizard appears. 2. Click Next . The Setup Options dialog box appears. For the sake of this guide, a unique instance will be the primary focus. I will have a separate post regarding AD LDS replication at some point in the near future. 3. Select A unique instance . 4. Click Next and the Instance Name dialog box appears. The instance name will help you identify and differentiate it from other instances that you may have installed on the same end point. The instance name will be listed in the data directory for the instance as well as in the Add or Remove Programs snap-in. 5. Enter a unique instance name, for example IDG. 6. Click Next to display the Ports configuration dialog box. 7. Leave ports at their default values unless you have conflicts with the default values. 8. Click N...

DNS Scavenging.

                        DNS Scavenging is a great answer to a problem that has been nagging everyone since RFC 2136 came out way back in 1997.  Despite many clever methods of ensuring that clients and DHCP servers that perform dynamic updates clean up after themselves sometimes DNS can get messy.  Remember that old test server that you built two years ago that caught fire before it could be used?  Probably not.  DNS still remembers it though.  There are two big issues with DNS scavenging that seem to come up a lot: "I'm hitting this 'scavenge now' button like a snare drum and nothing is happening.  Why?" or "I woke up this morning, my DNS zones are nearly empty and Active Directory is sitting in a corner rocking back and forth crying.  What happened?" This post should help us figure out when the first issue will happen and completely av...

HOW TO EDIT THE BCD REGISTRY FILE

The BCD registry file controls which operating system installation starts and how long the boot manager waits before starting Windows. Basically, it’s like the Boot.ini file in earlier versions of Windows. If you need to edit it, the easiest way is to use the Startup And Recovery tool from within Vista. Just follow these steps: 1. Click Start. Right-click Computer, and then click Properties. 2. Click Advanced System Settings. 3. On the Advanced tab, under Startup and Recovery, click Settings. 4. Click the Default Operating System list, and edit other startup settings. Then, click OK. Same as Windows XP, right? But you’re probably not here because you couldn’t find that dialog box. You’re probably here because Windows Vista won’t start. In that case, you shouldn’t even worry about editing the BCD. Just run Startup Repair, and let the tool do what it’s supposed to. If you’re an advanced user, like an IT guy, you might want to edit the BCD file yourself. You can do this...