In defence of cumulative updates

Windows CUs get a lot of hate these days. Rightfully so, occasionally. But you must consider times before CUs, and these were arguably even worse.

Going back to era before Windows 8, there was service pack + hotfix model. Deploy SP and get hotfixes for a few years. Deploy SP and cycle starts again. But over time less and less SPs came out and years between SP releases got longer. The worse with Vista+ releases. Vista SP2 came out in early 2009 that left 8 years of hotfix-only years until EOL. Windows 7 SP1 was early 2010 so we were 6 years in before CUs begun.

The ugly part. Massive majority of hotfixes were limited release. This meant that they never showed up on WU/WSUS. You just couldn’t find them. There was no general list of updates. Some of them couldn’t be downloaded at all. Some MS teams had their private lists of recommended updates. Better but always out of date. And still, most updates went under the radar. At one point I found out that Microsoft KB portal had a per-product RSS feed. It was a great somewhat obscure and semi-hidden feature to be up-to-date, sadly it stopped working about 2 years ago it’s back with a respin, see here , I think around the time CUs became the new black.

Before Windows 7 2016 convenience update, I think I had ~500 hotfixes in my image building workflow. Maybe a quarter of them were public ones. Sure, quite a few were for obscure features and problems but I believe in proactive patching. But the really bad part was patching already deployed systems. These hotfixes couldn’t be used in WSUS/SCCM so custom scripting it was. But as WU detection is really slow from script and because of sheer number of patches and plumbing required to handle supersedence… it was unfeasible to deploy more then maybe a dozen or two most critical ones.

And there were a quite a few. I think folder redirection and offline files required 5 patches to different components to work properly. ALL had to be hunted down quite manually. These were dark times…

Over the years, some community projects started to mitigate the problem. MyDigitalLife’s WHDownloader worked best for me, it’s main maintainer Abbodi86 is a Windows servicing genius. I built a image building framework around it that I use to this day.

Windows 8 era started with monthly optional rollups. And these were great! Just great! Oh how much I miss them! Pretty much (or totally?) every optional hotfix was quickly rolled up into monthly rollup. These were not cumulative so you could skip buggy ones (there were a few…) and still deploy next month’s one. And they had proper detailed release notes. Every issue fixed, each with reasonably detailed symptoms, cause and fix. Sure, you had to deploy quite a few updates each month, but not having to hunt down limited hotfixes was a breeze. However this model was abruptly stopped at the end of 2014, I never saw an announcement about this.

Windows 10 came and later in 2016, cumulative updates came to downlevel OS. While not perfect, it’s a HUGE upgrade over what we had before Windows 8. I believe that Windows 8 model was still superior. If you think now is bad, you didn’t know the pain or you just didn’t know better

vSphere 6.5 and 6.7 qfle3 driver is really unstable

Edit 2019.03.08

In the end RSS and iSCSI were separate issues. RSS is to be fixed in vSphere 6.7U2 sometime this spring. Update Marvell (wow, Broadcom -> QLogic -> Cavium -> Marvell, I’m not sure what to call it by now) drivers are on VMware’s support portal. I haven’t tested them yet as I don’t currently have any Marvell NICs to try out.

Some details from my ServerFault answer to a similar issue: https://serverfault.com/a/950302/8334

Edit 2018.10.15

Three months have passed and QLogic/Cavium drivers are still broken. I’ve gotten a few debug drivers (and others have as well) but there’s no solution. Initial suspicion about bad optics was a red herring (optics really was bad but it was unrelated). Currently there are 2 issues:

  • Hardware iSCSI offload will PSOD the system (in my case in 5-30 minutes, in other cases randomly)
  • NIC RSS configuration will randomly fail (once every few weeks), causing total loss of network connectivity or PSOD or a NMI by BIOS/BMC (or a combination of 3).

So far I’ve had to swap everything to Intels (being between a rock and a hard place). They have their own set of problems, but at least no PSODs or networking losses. Beacon probing doesn’t seem to work with Intel X710 based cards (confirmed by HPE) – incoming packets just disappear in NIC/driver. Compared to random PSOD, I can live with that.

Edit 2018.07.11

HPE support confirmed that qfle3 bundle is dead in water. Our VAR was astonished that sales branch was completely unaware of severe stability issues. Edited subject to reflect findings.

Edit 2018.07.09

Qlogic qfle3i (and whole Qlogic 57810 driver bundle) seems to be just fucked. qfle3i crashes on no matter what. Even basic NIC driver qfle3 crashes occasionally. So if you’re planning to switch from bnx2 to qfle3 as required by HPE, don’t! bnx2 is at least stable for now. Latest HPE images already contain this fix – however it doesn’t fix these specific crashes. VMware support also confirmed that there’s an ongoing investigation into this known common issue and it also affects vSphere 6.5. I’m suffering on HPE 534FLR-SFP+ adapters but your OEM may have other names for Qlogic/Cavium/Broadcom 57810 chipset.

A few days ago I was setting up a new green-field VMware deployment. As a team effort, we were ironing out configuration bugs and oversights, but all despite all the fixes, vSpheres kept PSODing consistently. Stack showed crashes in Qlogic hardware iSCSI adapter driver qfle3i.

Firmwares were updated and updates were installed, to no effect. After looking around and trial-and-errors, one fiber cable turned out to be faulty and caused occasional packet loss on SAN to switch path. TCP is supposed to fix that in theory but hardware adapters seem to be much more picky. Monitoring was not yet configured so it was quite annoying to track down. Also as SAN was not properly accessible, no persistent storage for logs nor dumps.

So if you’re using hardware adapters and seeing PSODs, check for packet loss in switches. I won’t engage support for this as I have no logs nor dumps. But if you see “qfle3i_tear_down_conn” in PSOD, look for Ethernet problems.

Installing Orace Developer 6 / Oracle Forms 6i on 64-bit systems

A few years ago I really needed to get Oracle Forms Runtime 6i working on 64-bit Windows. The only setup I had, was Oracle Developer 6 setup given by application vendor 15 years ago. What is legacy, may never die (Game of Thrones pun intended).

The setup would throw error at some point (I’ve forgotten the error message but it was something useless) and I took a deep dive in ORAINST setup architecture. You know, the one before Oracle Universal Installer that nobody remembers. Automating already worked on 32bit architectures but I didn’t quite get it why it failed on 64bit systems.

In the end I found out that ORAINST called a few self-extracing archives deep in setup folders. And these archives turned out to be created with shareware (!) versions of PKWARE PKZIP. Wow, Oracle – really?. The actual problem is that self-extracting module is 16bit and 64bit Windows doesn’t have NTVDM. And modern PKZIP versions don’t use the same command line parameters anymore…

I wanted to preserve ORAINST as much as possible so the workaround involved:

  • Extract PKZIP archive and recompress it with 7-Zip self-extractor module. This results in 32/64bit code depending on your target.
  • Small wrapper script to translate PKZIP arguments to 7-Zip
    start "" /wait "%~dp0d2q60-32b.exe" -o"%3" -y

    Where d2q60-32b.exe is filename created by 7-Zip.

  • Put them in the same folder and run them through BAT2EXE
  • Replace 16bit file with output of BAT2EXE

Now when you run ORAINST setup, following will happen:

  • ORAINST calls problematic file with PKZIP specific parameters
  • BAT2EXE bootstrapper will extract itself and payload to a temporary folder
  • BAT2EXE calls wrapper script with whatever was passed into itself
  • Script will take just 3rd attribute (target path) and pass it to 7-Zip extractor
  • 7-Zip will extract data to target location (some help files and documentation)
  • All components clean up after themselves
  • ORAINST continues setup
  • Profit!

There might have been additional files but for my needed features only one file needed to be replaced: “win32\d2dh\6_0_5_6_0\doc\d2q60.exe”. ORAINST and Oracle 8 generation is mostly forgotten so it’s hard to say if there were alternative install medias or considerations. This way, Forms 6 worked on at least Windows 8.1 64bit and maybe Windows 10 early builds as well, I’ve forgotten over the years. ORAINST required XP SP3 compatibility and a few scriptable tweaks as well but these were quite trivial.

Before this workaround I had noticed a few threads in various forums with the same issue. Now if you find this article and you still need to use Orace Forms 6, congrats.

Working around slow PST import in Exchange Online

If you’ve tried Exchange Online PST import then you probably know that it’s as slow as molasses in January and sucks in pretty much every way.

  • “PST file is imported to an Office 365 mailbox at a rate of at least 1 GB per hour” is pure fantasy, 0,5GB per hour should be considered excellent throughput and in test runs I achieved only ~0,3 GB/h. Running in one batch seems to import PSTs with limited parallel throughput (almost serially).
  • Security & Compliance Center is just unusably slow.
  • I had to wait 5 days for Mail Import Export role to propagate for Import to activate. Documented 24 hours, you wish.
  • Feedback
  • I’ll just stop here…

I had a dataset to import and I didn’t plan to wait for a month so I looked around a bit. Only hint was in a lost Google result that you should separate imports into separate batches. However GUI is so slow that it’s just infeasible. So I went poking around in the backend.

This blog looked promising and quite helpful  but was concerned with other limitations of GUI import. Nevertheless, you should read it to understand the workflow.

PowerShell access exists and works quite well. There’s talk of “New-o365MailboxImportRequest” CmdLet but that’s just ancient history. New-MailboxImportRequest works fine, just source syntax is different from on-prem version.

Notes:

  • You MUST use generic Azure Blob Storage. Autoprovisioned one ONLY works with GUI. If you try to access it via PowerShell, you just get 403 or 404 error for whatever reason.
  • Generate one batch per PST.
  • Azure blobs are Case Sensitive. Keep that in mind when creating your mapping tables.

So in the end I ran something like that. Script had a lot of additional logic but I cut parts unrelated to the problem at hand.

#base URL for PSTs, your blob storage
$azblobaccount = 'https://blablabla.blob.core.windows.net/blablabla'
#the one like '?sv=...'
$azblobkey = 'yourSASkey'
#I used mapping table just as in Microsoft instructions and adapted my script. My locale uses semicolon as separator
$o365mapping = Import-Csv -Path "C:\Dev\o365mapping.csv" -Encoding Default -Delimiter ';'
ForEach ($account in $o365mapping) {
	#In case you have some soft-deleted mailboxes or other name collisions, get real mailbox name
	$activename= (get-mailbox -identity $account.mailbox).name
	#Name = PST filename
	#CASE SENSITIVE!!!
	$pstfile = ($azblobaccount + '/' + $account.name)
	#Just to differentiate jobs
	$batch = $account.mailbox
	#targetrootfolder and baditemlimit are optional. Batchname might be optional but I left it in just in case
	new-mailboximportrequest -mailbox $activename -AzureBlobStorageAccountUri $pstfile -AzureSharedAccessSignatureToken $azblobkey -targetrootfolder '/' -baditemlimit 50 -batchname $batch
}

So how did it work? Quite well actually. I had 68 PSTs to import (total of ~350GB). Creating all batches took roughly an hour as I hit command throttling. But as created jobs were already running, it didn’t really matter.

 (get-mailboximportrequest|measure).count
68

Exchange Online seems to heavily distribute batches over servers, hugely helping in parallel throughput.

((Get-MailboxImportRequest|Get-MailboxImportRequestStatistics).targetserver|select -unique|measure).count
65

As Exchange Online is quite restricted in resources, expect some imports to always stall.

Get-MailboxImportRequest|Get-MailboxImportRequestStatistics|group statusdetail|ft count,name -auto

Count Name
----- ----
   43 CopyingMessages
   13 Completed
    8 StalledDueToTarget_Processor
    1 StalledDueToTarget_MdbAvailability
    2 StalledDueToTarget_DiskLatency
    1 CreatingFolderHierarchy

And now numbers

((Get-MailboxImportRequest|Get-MailboxImportRequestStatistics).BytesTransferredPerMinute|%{$_.tostring().split('(')[1].split(' ')[0].replace(',','')}|measure -sum).sum / 1GB
1,41345038358122

That’s 1,4GB per minute. That’s like… a hundred times faster. I checked it at a random point when import had been running for a while when some smaller PSTs were already complete. Keep in mind that large PSTs run relatively slower and may still take a while to complete. When processing last and largest PSTs, throughput slowed to ~0,3GB/m but that’s still a lot faster than GUI. Throughput scales with number of parallel batches so probably more jobs would probably result in even better throughput.

PowerShell oneliners to check Spectre/Meltdown mitigations

Microsoft script (https://gallery.technet.microsoft.com/scriptcenter/Speculation-Control-e36f0050) is somewhat inconvenient to use. While being a fully-functional module, it’s sometimes easier to just paste code into PowerShell window to do quick check. Or do a Zabbix check with a oneliner. So I adapted Microsoft script to be more compact.

  • Results (with no additional details as with Microsoft script)
    • -1 unsupported by kernel (not patched or unsupported OS)
    • 0 disabled (go find out why, for example Meltdown is always disabled on AMD)
    • 1 enabled
  • Should work on pretty much any PowerShell, Windows 2003 with WMF2.0 gave proper result (-1)
  • Works without admin privileges (I presume, original worked as well, never checked), needs full language mode
  • They’re almost the same, only differences are variable names (just as they were in IDE when I was writing/testing) and NtQuerySystemInformation parameter
  • Should fit within Zabbix key if you put 256 chars (strings are 466 chars before escaping) in a helper macro.
  • Corners were cut (some explicit casts shortened variables) but there might be more. I don’t fully understand P/Invoke and Win32 variable casting, so there might still be more clutter to remove to reduce size
  • By varying parameters, you can query any data Microsoft Script can query. Just take a look at original script’s source.

Spectre

[IntPtr]$a=[System.Runtime.InteropServices.Marshal]::AllocHGlobal(4);If(!((Add-Type -Me "[DllImport(`"ntdll.dll`")]`npublic static extern int NtQuerySystemInformation(uint systemInformationClass,IntPtr systemInformation,uint systemInformationLength,IntPtr returnLength);" -name a -Pas)::NtQuerySystemInformation(201,$a,4,[IntPtr][System.Runtime.InteropServices.Marshal]::AllocHGlobal(4)))){[System.Runtime.InteropServices.Marshal]::ReadInt32($a) -band 0x01}Else{-1}

Meltdown

[IntPtr]$b=[System.Runtime.InteropServices.Marshal]::AllocHGlobal(4);If(!((Add-Type -Me "[DllImport(`"ntdll.dll`")]`npublic static extern int NtQuerySystemInformation(uint systemInformationClass,IntPtr systemInformation,uint systemInformationLength,IntPtr returnLength);" -name b -Pas)::NtQuerySystemInformation(196,$b,4,[IntPtr][System.Runtime.InteropServices.Marshal]::AllocHGlobal(4)))){[System.Runtime.InteropServices.Marshal]::ReadInt32($b) -band 0x01}Else{-1}

VMware EVC now exposes Spectre mitigation MSRs with latest patches

Edit: speak of the devil… new vCenter and vSphere patches just released: https://www.vmware.com/security/advisories/VMSA-2018-0004.html Headline revised to reflect update.

Edit 2: As this update requires shutting down and starting VMs (full power cycle, simply restart does not work), use this PowerCLI command to find VMs that don’t yet have new features exposed

Get-VM |? {$_.extensiondata.runtime.featurerequirement.key -notcontains 'cpuid.IBRS'  -or $_.extensiondata.runtime.featurerequirement.key -notcontains 'cpuid.IBPB'}

While you can apply VMware patches and BIOS microcode updates, guests will not see any mitigation options if EVC is enabled (as these options were not in original CPU specification). It’s the same for KVM/QEMU CPU masking, however Hyper-V allows exposing new flags (probably because it doesn’t have anything like EVC besides “compatibility” flag).

I haven’t yet tested without EVC but with all things patched up, clients with Broadwell EVC don’t see required MSRs with ESXi 6.5.

Running UNMAP on snapshotted VMware hardware 11+ thin VMs may cause them to inflate to full size

Scenario:

  1. ESXi 6.X (6.5 in my case)
  2. VMware hardware 11+ (13)
  3. Thin VM
  4. UNMAP aware OS (Windows 2012+)

If you snapshot VM and run UNMAP (for example retrim from defrag utility), VM may (not always) inflate to full size during snapshot commit. It also results in really long commit times.

I’ve seen it happen quite a few times and it’s really annoying if you for some historical reason have tons of free space on drives (for example NTFS dedup was enabled long after deployment) and may even cause datastores to become full (needless to say, really bad). Also it tends to happen during backup windows that keep snapshots open for quite a while (usually at night, terrible). While I could disable automatic retrim (bad with lots of small file operations, normal UNMAP isn’t very effective on them due to alignment issues) or UNMAP (even worse), it’s an acceptable risk for now if you keep enough free space on datastore to absorb inflation of the biggest VM. You can retrim after snapshot commit and it drops down to normal size quickly (minutes).

I haven’t seen this anywhere else but I guess I’ll do a reproducable PoC, contact VMware support and do an update.

SCOM management packs in Zabbix – a year later

I discussed this about a year ago but in the end I didn’t publish anything. I actually did get “Windows  Server Operating System” MP to be pretty much feature-complete (no to little OS metadata – health checks only) and it pretty much blows away any Zabbix built-in template and any other I’ve seen. There’s a few addition bits that I found useful. Works fine on Windows 2012+ and… more-less fine on 2008 and 2008R2. Some items are missing due to different performance monitors but I really haven’t bothered to edit it (physical disk and networking if I remember correctly). All items and triggers use macros so it’s easy to override checks.

The main issue remains 256 char item limit. I did make some progress in packing extra PowerShell in this small limit so previous posts may not be up to date, so templates still don’t require any changes to agent or any local scripts. Another issue is that I can’t reference items from other (linked) templates in triggers. And as you can’t add the same item in another template, it makes some templates REALLY annoying. 30 second command timeout remains an issue so you can’t actively defrag/chkdsk/unmap/trim or do very expensive checks. Command timeout with proxy seems to cause proxy to reissue commands every few minutes, causing performance issues as commands never complete and just repeat indefinitely. I did leave the checks in but disabled them. File system health is checked from just dirty flag and fragmentation information is checked from registry last run data. It seems to trigger false positives occasionally from VMware snapshots but works reasonably well. I did figure out how to change disk optimization from weekly to daily in PowerShell but it’s waaaaay too big to fit in item for all OS. I did consider building item command from multiple macros but this change would have little value. For reference (2012+ only):

$v=[environment]::OSVersion.Version;If($v.major -gt 6 -or ($v.major -eq 6 -and $v.minor -ge 2)){$s='ScheduledDefrag';$t=Get-ScheduledTask $s|export-scheduledtask;$t.Task.Settings.MaintenanceSettings.Period='P1D';register-scheduledtask -TaskN $s -TaskP '\Microsoft\Windows\Defrag' -X $t.outerxml -F}

I did some work on ADDS and File server MPs but it’s really time-consuming and they remain incomplete (they have helped to catch a few incidents though). I did mostly complete Exchange template but it’s mostly telemetry (as in original MP) and alerting mostly works by querying health monitor – but again, it has helped to diagnose issues and catch incidents early.

I’ll try to clean them up and release somehow… someday.

PS! I still think that Zabbix sucks but it’s one of the best among free stuff. 🙂

Workaround for NTFS deduplication error 0x8007000E Not enough storage is available to complete this operation

This can pop up when starting an optimization job, even when you have plenty of RAM, even if you give tons of memory to job. Error message is misleading, storage here means memory.

Workaround is to just increase page file. I came across this issue on a Server Core 2016 that had 24GB of RAM for a 16TB volume. Analysis job caused commit to grow to almost 90% (without releasing it in time) so optimization could not allocate any memory. I didn’t go in depth (RAMMap etc) though. After increasing page file from automatic ~2GB to 16GB, jobs work just fine.

Keep in mind that commit does not mean that memory or page file is actually used. It just means that application has been promised that this memory will be available when it will be actually used. Unused commit is taken from pagefile first so it’s basically free performance-wise, except for increased disk space use.

Online P2V of domain controllers

Don’t do it or do it in DSRM. Until for various reasons you just… can’t. Unacceptable downtime, Exchange/SBS, Windows 2003 (can’t stop AD services), etc. Doesn’t matter, you just have to do the P2V online.

It’s not supported (probably) or recommended but if you really need to then (skipping obvious steps):

  1. Stop replication some time before finalizing conversion
    repadmin /options %COMPUTERNAME% +DISABLE_OUTBOUND_REPL
    repadmin /options %COMPUTERNAME% +DISABLE_INBOUND_REPL
  2. Disconnect target VM network and boot to DSRM.
  3. Set “database restored from backup” flag in registry – just in case!
    https://technet.microsoft.com/nl-nl/library/dd363545(v=ws.10).aspx
  4. Boot normally
  5. Enable replication
    repadmin /options %COMPUTERNAME% -DISABLE_OUTBOUND_REPL
    repadmin /options %COMPUTERNAME% -DISABLE_INBOUND_REPL

     

Again, not supported nor recommended but it has worked for me.