Joining VMware templates to custom Organizational Unit with customization specification

By default, customization specification has domain join function. Sad part is that is doesn’t allow for selecting your custom organizational unit. Also you can’t upload your custom unattended XML and preserve the option of entering desired VM name during template deployment. Therefor you’re stuck with default CN=Computer or whereever this is redirected. In bigger environments this might be an issue as you might need to join templates to join different OUs depending on different requirements.

One option is enabling autologin for built-in Administrator once and using RunOnce commands to run NetDom.

netdom.exe join %COMPUTERNAME% /domain:my.domain.com /userd:NETBIOS\domainjoinserviceaccount /passwordd:PaS$W0rd /ou:"OU=my,OU=custom,OU=Organizational Unit,DC=my,DC=domain,DC=com" /reboot

This is old news and used to work fine until a few months ago and I unexpectedly discovered that variable substitution was done before changing computer name and NetDom used name in template (something random as by default), causing netdom to fail (as it needs to realistically be local computer name).

After some head scratching, simple workaround was to simply wrap it in PowerShell to hide the batch variable so it doesn’t get substituted until the last moment. Might have done native CmdLet but it’d likely require a very complex oneliner to prepare a credential object.

powershell netdom.exe join $env:computername /domain:my.domain.com /userd:NETBIOS\domainjoinserviceaccount /passwordd:PaS$W0rd /ou:"OU=my,OU=custom,OU=Organizational Unit,DC=my,DC=domain,DC=com" /reboot

The main problem with this approach is that plaintext passwords are written to unattended.xml that is not cleaned up after process completes. Windows cleans up explicit unattended domain join credentials after specialization but credentials in runonce commands get left behind.

First try was to just delete file in next runonce command however unattended.xml still seems to be in use during command execution and you can’t simply delete it. One option would be to leave a custom script in template that would register unattended.xml in PendingFileRenameOperations to be deleted on restart. Simpler way is to apply GPO that would delete the answer file.

Don’t leaks your privileged credentials.

Quirks in permission management with vCenter Content Libraries

First of all, Content Libraries are a pretty useful concept in larger environments. I especially use it for automatically sync between different vCenters that are physically separated. It also saves the user (usually a clueless sys/app admin) from browsing and finding files, replacing it with a flat list of items. Great huh?

Now the bad parts. Read all the way through because some things have implications and workarounds below.

No default access

That is normal. Annoying thing is that you also don’t get a default role for regular users (as in content consumers, not managers). I’m going to save you the hassle. You need a custom role with these privileges:

  • Content Library – Download files
  • Content Library – Read storage
  • Content Library – View configuration settings

Global Permissions with required inheritance

Content Library permissions are Global Permissions only and Content Libraries only inherit permissions – they do not have any explicit permissions of their own so you are forced to use “Propagate to children” flag.

Why is this a problem? Several things:

  • No privilege separation between libraries. You can’t have internal… “tenants” with separated content – it’s all shared. Yes – there’s overarching products for that but I’m talking basic vCenter functionality.
  • If you have several vCenters (with ELM), all permissions propagate to all libraries in all vCenters.

And the thing I hate the most. It’s impossible to create a custom role without implicit “Read-only” privileges. Believe me I’ve tried with different APIs. If you create a role, it always includes read-only privileges. Try it out and check results in PowerCLI. There are some privileges that cannot be removed. According to GSS, it’s by design.

Implication is that it’s harder to have delegated minimal permissions on objects. Everybody that needs library access will see every object in all vCenters due to inherited implicit read-only, even if you haven’t delegated any permissions. Pretty bad (confidentiality between delegated users) or just annoying (seeing possibly thousands of objects that have no relevance to user) depending on your environment.

Luckily there’s a simple workaround – overwrite permissions with “No access” on vCenter level (every vCenter that is). This built-in role is the only one that does not include read-only. That is – if your delegated permissions don’t require vCenter permissions, I can’t see a reason for that right now. As you probably have delegated permissions somewhere below, they will overwrite “No access” again and delegated access will work. Funny thing – if you clone “No access” role, the new role gets read-only added…

ISO mount requires “Read-only” on Content Library datastore

This took some thinking and trying to figure out. Let’s say your library is stored on a VMFS datastore that is not visible to your users. Sounds reasonable, it’s backend stuff after all and users should have no business there directly.

Now, deploying templates from this library on hidden datastore will work fine. However when you want to mount ISOs, you get an empty list. If you add “Read-only” to this datastore, it’ll start working. Keep in mind that this role will only show object metadata (and show it in any datastore list with no actionable features), but users can’t see contents or change/write anything.

Maybe will update, if I’ll find something else.

VMware EVC now exposes Spectre mitigation MSRs with latest patches

Edit: speak of the devil… new vCenter and vSphere patches just released: https://www.vmware.com/security/advisories/VMSA-2018-0004.html Headline revised to reflect update.

Edit 2: As this update requires shutting down and starting VMs (full power cycle, simply restart does not work), use this PowerCLI command to find VMs that don’t yet have new features exposed

Get-VM |? {$_.extensiondata.runtime.featurerequirement.key -notcontains 'cpuid.IBRS'  -or $_.extensiondata.runtime.featurerequirement.key -notcontains 'cpuid.IBPB'}

While you can apply VMware patches and BIOS microcode updates, guests will not see any mitigation options if EVC is enabled (as these options were not in original CPU specification). It’s the same for KVM/QEMU CPU masking, however Hyper-V allows exposing new flags (probably because it doesn’t have anything like EVC besides “compatibility” flag).

I haven’t yet tested without EVC but with all things patched up, clients with Broadwell EVC don’t see required MSRs with ESXi 6.5.

Running UNMAP on snapshotted VMware hardware 11+ thin VMs may cause them to inflate to full size

Scenario:

  1. ESXi 6.X (6.5 in my case)
  2. VMware hardware 11+ (13)
  3. Thin VM
  4. UNMAP aware OS (Windows 2012+)

If you snapshot VM and run UNMAP (for example retrim from defrag utility), VM may (not always) inflate to full size during snapshot commit. It also results in really long commit times.

I’ve seen it happen quite a few times and it’s really annoying if you for some historical reason have tons of free space on drives (for example NTFS dedup was enabled long after deployment) and may even cause datastores to become full (needless to say, really bad). Also it tends to happen during backup windows that keep snapshots open for quite a while (usually at night, terrible). While I could disable automatic retrim (bad with lots of small file operations, normal UNMAP isn’t very effective on them due to alignment issues) or UNMAP (even worse), it’s an acceptable risk for now if you keep enough free space on datastore to absorb inflation of the biggest VM. You can retrim after snapshot commit and it drops down to normal size quickly (minutes).

I haven’t seen this anywhere else but I guess I’ll do a reproducable PoC, contact VMware support and do an update.

vSphere 6.5 guest UNMAP may cause VM I/O latency spikes – fixed in update 1

I converted some VMs to thin and upgraded VM hardware version to 13 to test out savings. Initial retrim caused transient I/O slowdown in VM but the issue kept reappearing randomly. I/O latency just spikes to 400ms for minutes for no apparent reason. It also seems to affect other surrounding VMs, just not as badly. After several days, I converted VMs back to thick and issues disappeared.

I’m not sure where the problem is and I can’t look into it anymore. Might be a bug in vSphere. Might be the IBM v7000 G2 SAN that goes crazy. As I said, I cannot investigate it any further but I’ll update the post if I ever hear anything.

PS! Savings were great, on some systems nearly 100% from VMFS perspective. On some larger VMs with possible alignment issues, reclamation takes several days though. For example, a 9TB thick file server took 3 days to shrink to 5TB.

Update 2017.o6.29:

Veeam’s (or Anton Gostev’s) newsletter mentioned a similar issue just as I came across this issue again in a new vSphere cluster. In the end VMware support confirmed the issue with expected release of 6.5 Update 1 at the end of July.

Update much later in november

I’ve been running Update 1 since pretty much release date and UNMAP works great! No particular performance hit. Sure, it might be a bit slower during UNMAP run but it’s basically invisible for most workloads.

I’ve noticed that for some VM’s, you don’t space back immediately. On some more internally fragmented huge (multi-TB) VMs, particularly those with 4K clusters, space usage seems to reduce slowly over days or weeks. I’m not sure what’s going on but perhaps ESXi is doing some kind of defrag operation in VMDK…? And yeah, doing a defrag (you can do it manually form command line in Windows 2012+) and then UNMAP helps too.

vSphere 6.5 virtual NVMe does not support TRIM/UNMAP/Deallocate

Update 2018.10.15

It works more-less fine in 6.7. Known issues/notes so far:

  • Ugly warning/errors is Linux kernel log if Discard is blocked (snapshot create/commit) – harmless
  • Linux NVMe controller has a default timeout of 30s. With VMTools, only SCSI gets increase to 180s so you might want to manually increase nvme module timeout just in case. “CRAZY FAST, CRAZY LOW LATENCY!!!” you scream? Well fabrics and transport layers still may have hickups and tolerating transient issues might be better than being broken.
  • When increasing VMDK sizes, Linux NVME driver doesn’t notice namespace resize. Newer kernels (4.9+ ?) have configuration device to rescan, older require VM reboot
  • One VMFS6 locking issue that may or not be related to vNVME. Will update if I remember to (or get feedback from VMware).
  • It seems to be VERY slightly faster and have VERY slightly lower CPU overhead. It’s within the margin of error, in real life it’s basically the same as PVSCSI.
  • The nice thing is that it works with Windows 7 and Windows 2008 R2! Remember that they don’t support SCSI UNMAP. However NVME Discard seems to work. Delete reclaims space, (ironically) manual defrag frees space, also sdelete zero successfully reclaims space.

I was playing with guest TRIM/UNMAP the other day and looked at new shiny virtual NVMe controller. While it would not help much in my workloads, cutting overhead never hurts. So I tried to do “defrag /L” in VM and it return that device doesn’t support it.

So I looked up release notes. Virtual NVMe device: “Supports NVMe Specification v1.0e mandatory admin and I/O commands”.

The thing is that NVMe part that deals with Deallocate (ATA TRIM/SCSI UNMAP in NVMe-speak) is optional. So back to pvscsi for space savings…