Discovering multi-instance performance counters in Zabbix

I’m not a fan of Zabbix but you can’t always select your tools. I’m no expert on Zabbix so feel free to improve my solution.

The original problem was that most Zabbix templates available online for Windows are plain rubbish. Pretty much everything monitored is hardcoded (N volumes to check for free space, N SQL Server instances to check etc). Needless to say, this is ugly and doesn’t work well with more complex scenarios (think mount points or volumes without disk letter…). Agent built-in discovery is also quite limited.

My first instinct was to use Performance Counters but agent doesn’t know how to discover counter instances, once again requiring hardcoding. Someone actually patched agent to allow that but it has never been included in official agent.

Low Level Discovery is your way out but it’s implied to use local scripts. I used it with local scripts for a while but keeping them in sync and in-place was quite annoying. Another option is to use UserParameter in agent configuration. There are less limitations but this requires custom configuration on client and I’d like to keep agent basically stateless. I did use this implementation as inspiration though.

So one day I tried to squeeze it in 255 characters allowed for a run command. And i got to work.


  • It’s trimmed every way possible to reduce characters as best as I could.
  • 255 characters is actually very little and you need to be really conservative…
  • …because you need to escape special characters 3 times. First escape strings in PowerShell. Then escape special characters to execute PowerShell commands directly in CMD. And finally escape some characters for Zabbix run command.
  • Double quotes are the main problem. I think that this is the best solution as I can’t use single quotes for JSON values.
  • If counter doesn’t exist or there are no instances, returns NULL.
  • You should be reasonably proficient in PowerShell and Zabbix to use that
  • Should work with reasonably modern Zabbix server and agents (2.2+)
  • I only used it on Server 2012 R2 but it should work also on 2008 R2 (not 2008) and 2012. Let me know how it works for you.

Update 2.09.2016
I’ve update the script to shave off a few more characters. I’ll update when I have some time.

So let’s figure this out. The original PowerShell script:

'{"data":['+(((Get-Counter -L 'PhysicalDisk'2>$null).PathsWithInstances|%{If($_){$_.Split('\')[1].Trim(')').Split('(')[1]}}|?{$_ -ne '_Total'}|Select -U|%{"{`"{#PCI}`":`"$_`"}"}) -join ',')+']}'

Phew, that’s hard to read even for myself. But remember, characters matter. I’ll explain it in parts.


That’s just JSON header for LLD. I found it easier and to use less characters to hardcode some data rather than format data for JSON CmdLets.

(Get-Counter -L 'PhysicalDisk'2>$null).PathsWithInstances

As you might think, this retrieves instances of PhysicalDisk. You need it keep track on IO queues for examples. Replace it with counter you need. This actually retrieves all instances for all counters but we’ll clear this up later.
Sending errors to null allows to discover counters that might not exist on all servers (think IIS or SQL Server) – otherwise you’d get error (Zabbix reads back both StdOut and StdErr) but now it just returns NULL (eg nothing was discovered).
You can use * wildcard. For SQL Server, this is a must.


First I check if there was anything in pipeline. Without this, you’d get a pipeline error if there was no counter or no instances. Then I cut out the name on the instance.

Actually you can leave out the cutting part. In multi-instance SQL Server servers (when you used wildcard for counter name) you actually have to keep full name (both counter and counter instance) as counter name contains SQL Server instance name. For example:


I usually prefer to keep only instance names but it’s optional. Let’s go on…

?{$_ -ne '_Total'}

This is optional and can be omitted. Most counters have “_Total” aggregated instance that may or may not useful based on the instance. For example with PhysicalDisk, it’s more or less useless as you’d need per-instance counters for anything useful. On the other hand, Processor Information can be used to get both total and per-CPU/core/NUMA-node metrics.

Select -U

Remember that we’re actually working with all counters for all instances? This cleans them up, keeping single entry for instance.


Builds JSON entry for each discovered instance. {#PCI} is macro name for prototypes. PCI is arbitrary name – Performance Counter Instances. You can change that or trim to just one character – {#I}.

-join ','

Concentrates all instance JSON entries into one string.


JSON footer, nothing fancy, hardcoded.

Now the escaping. First PowerShell to CMD:

  • ” –> “””
  • | –> ^|
  • > –> ^>
  • prefix with “powershell -c”

Result that should run without errors in CMD and return instances in JSON.

powershell -c '{"""data""":['+(((Get-Counter -L 'PhysicalDisk'2^>$null).PathsWithInstances^|%{If($_){$_.Split('\')[1].Trim(')').Split('(')[1]}}^|?{$_ -ne '_Total'}^|Select -U^|%{"""{`"""{#I}`""":`"""$_`"""}"""}) -join ',')+']}'

Escaping for Zabbix

  • ” –> \”
  • Add[” to start
  • Add “] to end["powershell -c '{\"\"\"data\"\"\":['+(((Get-Counter -L 'PhysicalDisk'2^>$null).PathsWithInstances^|%{If($_){$_.Split('\')[1].Trim(')').Split('(')[1]}}^|?{$_ -ne '_Total'}^|Select -U^|%{\"\"\"{`\"\"\"{#PCI}`\"\"\":`\"\"\"$_`\"\"\"}\"\"\"}) -join ',')+']}'"]

But oh no, it’s now 268 characters! You need to cut something out. Luckily you now have some examples for that. Here’s some more Zabbix formatted examples:["powershell -c '{\"\"\"data\"\"\":['+(((Get-Counter -L 'Processor Information'2^>$null).PathsWithInstances^|%{If($_){$_.Split('\')[1].Trim(')').Split('(')[1]}}^|Select -U^|%{\"\"\"{`\"\"\"{#I}`\"\"\":`\"\"\"$_`\"\"\"}\"\"\"}) -join ',')+']}'"]["powershell -c '{\"\"\"data\"\"\":['+(((Get-Counter -L 'MSSQL*Databases'2^>$null).PathsWithInstances^|%{If($_){$_.Split('\')[1]}}^|Select -U^|%{\"\"\"{`\"\"\"{#I}`\"\"\":`\"\"\"$_`\"\"\"}\"\"\"}) -join ',')+']}'"]

Now for item prototypes, if you cut instance down to counter instance name.

  • Name: IO Read Latency {#PCI}
  • Key: perf_counter[“\PhysicalDisk({#PCI})\Avg. Disk sec/Read”,60]

If you didn’t trim name and kept counter name

  • Name: IO Read Latency {#PCI}
  • Key: perf_counter[“\{#PCI}\Avg. Disk sec/Read”,60]

Keep in mind that name will now be something like “IO Read Latency PhysicalDisk\0 C:”

Again, if you have any improvements, especially to cut character count – let me know.

Superseding dependencies in System Center Configuration Manager‏

Yeah, it’s difficult, error-prone and somewhat buggy.

Imagine following scenario:

  • Library application X v1
  • Main application Y, depends on X
  • You need to upgrade X v1 to v2 by first uninstalling old version and then installing new version

The only way I’ve seen this work is deleting dependency, deploying X upgrade semi-manually and then setting dependency to v2. Any other attempt will get you “Rule is in conflict with other rules” in deployment monitoring as agent will refuse to remove v1.

The second scenario:

  • Library application Z v1
  • Main application Q, depends on Z
  • You need to upgrade Z to v2, no uninstall is necessary

This was explicitly added in 2012 R2 SP1 that made this scenario possible if v2 superseded v1. In real life I’ve found this very unreliable. The worst I’ve seen was agent gobbling up GBs of RAM grinding systems to a halt. Application detection got stuck in loop and leaking memory as some old applications were set to depend on v1 and some newer on v2. In better cases – good old “Rule is in conflict with other rules”.

There used to be a workaround. Add both Z v1 and Z v2 to the same dependency group but clear Install Automatically flag on v1. This seems to have stopped working in 1602 or 1606 as client will stop dependency processing if v1 is not found. I only tested this very briefly in new builds so do not trust me on that. Might be a bug or maybe original behavior in 2012 R2 was buggy…

This makes you wish for something like MDT application bundles. Group Z versions into bundle and create dependancy on bundle.

Generally application model is great but with a lot of annoying gotchas. Or maybe it’s just me…

Leave a note if comments if you’ve found a better way.

Checking Estonian ID code correctness in PowerShell

This is based on an implementation in another language I found many years ago on Google, I’ve forgotten the details or the exact source.
As usual, it’s not the most elegant version but works just fine and hasn’t been modified in years. For formal validation algorithm, use Google. I haven’t seen any official public document for it but there are a few implementation examples out there (PHP, Delphi, C#, JS etc.).

I originally used it for automatically loading ID Card certificates to Active Directory for SmartCard login. I’ll build up to releasing that by going over various pieces to making it work.


  • Wrap function call in Try-Catch and If. Function parameter validation returns error but actual ID code validation returns true-false. I know it’s ugly but it’s good enough for me.
  • It really only checks if string contains exactly 11 numbers and checksum is correct. There is no guarantee that a person with that code actually exists.
Function Validate-Isikukood {
	[char[]]$IsikukoodArray = $Isikukood.ToCharArray()
	$IDCheck1 = [convert]::ToInt32($IsikukoodArray[0],10) * 1 + [convert]::ToInt32($IsikukoodArray[1],10) * 2 + [convert]::ToInt32($IsikukoodArray[2],10) * 3 + [convert]::ToInt32($IsikukoodArray[3],10) * 4 + [convert]::ToInt32($IsikukoodArray[4],10) * 5 + [convert]::ToInt32($IsikukoodArray[5],10) * 6 + [convert]::ToInt32($IsikukoodArray[6],10) * 7 + [convert]::ToInt32($IsikukoodArray[7],10) * 8 + [convert]::ToInt32($IsikukoodArray[8],10) * 9 + [convert]::ToInt32($IsikukoodArray[9],10) * 1
	$IDCheckSum = $IDCheck1 % 11
	If ($IDCheckSum -eq 10) {
		$IDCheck2 = [convert]::ToInt32($IsikukoodArray[0],10) * 3 + [convert]::ToInt32($IsikukoodArray[1],10) * 4 + [convert]::ToInt32($IsikukoodArray[2],10) * 5 + [convert]::ToInt32($IsikukoodArray[3],10) * 6 + [convert]::ToInt32($IsikukoodArray[4],10) * 7 + [convert]::ToInt32($IsikukoodArray[5],10) * 8 + [convert]::ToInt32($IsikukoodArray[6],10) * 9 + [convert]::ToInt32($IsikukoodArray[7],10) * 1 + [convert]::ToInt32($IsikukoodArray[8],10) * 2 + [convert]::ToInt32($IsikukoodArray[9],10) * 3
		$IDCheckSum = $IDCheck2 % 11
		If (($IDCheckSum -eq 10) -and ([convert]::ToInt32($IsikukoodArray[10],10) -eq 0)) {
			Return $True
		} ElseIf (($IDCheckSum -ne 10) -and ([convert]::ToInt32($IsikukoodArray[10],10) -eq $IDCheckSum)) {
			Return $True
		} Else {
			Return $False
	} ElseIf (($IDCheckSum -ne 10) -and ([convert]::ToInt32($IsikukoodArray[10],10) -eq $IDCheckSum)) {
		Return $True
	} Else {
		Return $False

Calculating size of user’s mailbox and any delegated mailboxes

Outlook by default limits OST to 50GB (modern versions) but some users may have tons of delegated mailboxes and run into this limit. This script retrieves users that have more than 50GB of delegated and personal mailboxes visible. You might not want to increase OST limit for everyone…

Possible use case is situation where you have delegated several large mailboxes to multiple users. As tickets start coming in as mailboxes grow, you want to proactively find out problematic users.

This really becomes an issue when you delegate mailboxes to groups. I’ll post script to update msExchDelegateListBL for group memberships in a few days as Exchange doesn’t do that automatically. TL;DR: If you delegate mailbox to group, it doesn’t get autoloaded by Outlook. I have a script to remediate that.


  • This is a slow and ugly one-off. But as I only needed it once, it just works. As always, read the disclaimer on the left.
  • You need Exchange Management Tools installed on your PC. It doesn’t work with remote management PowerShell session as you don’t have proper data types loaded. Install management tools on your PC and run Exchange Management Shell.
  • This script looks up only admin-delegated mailboxes. Any folders or mailboxes or public folders shared and loaded by users themselves are not included. This is server-side view only.
$userlist = get-aduser -Filter *
foreach ($user in $userlist) {
	$usermailbox = get-mailbox $user.distinguishedname 2>$null
	If ($usermailbox) {
		$DelegationList = (get-aduser -Identity $user.distinguishedname -Properties msExchDelegateListBL).msExchDelegateListBL
		If ($DelegationList) {
			$usermailboxsize = (Get-mailboxstatistics -identity $usermailbox | select @{label=”TotalSizeBytes”;expression={$_.TotalItemSize.Value.ToBytes()}}).TotalSizeBytes
			$SharedSize = ($DelegationList | %{get-mailbox -Identity $_ | Get-MailboxStatistics | select displayname,@{label=”TotalSizeBytes”;expression={$_.TotalItemSize.Value.ToBytes()}},totalitemsize} | measure -sum totalsizebytes).sum
			$TotalVisibleSize = ( ($usermailboxsize + $SharedSize) / 1GB)
			If ($TotalVisibleSize -gt 50) {
				Write-Host $user.Name
				Write-Host $TotalVisibleSize

Powershell arrays are passed by reference, unlike basic variables‏

PowerShell is great in many ways yet very unintuitive in others.

Consider following example:

$a = 0
$b = $a
$b = 1
$a #0
$b #1

All seems good and logical? Now introduce arrays:

$a #1
$a #2!
$b #2

What? How did $a change? Surely this is an artifact of direct modification or something. Let’s try passing array to a function.

function b {param($c);$d=$c;$d[0]=2;$d}
$a #1
b $a #2
$a #2!

Now that’s annoying if you’re passing the same array around in a script. No level of scoping or any other tinkering will fix that. A bit of MSDN and StackOverflow reveals that arrays are always, and I mean always, passed by reference, something inherited from .Net. There are a few not-so-pretty workarounds.

use .Clone() method. Caveat is that it only works one level. So it you use multidimensional arrays, you’re out of luck. Example:

function b {param($c);$d=$c.Clone();$d[0]=2;$d[1][0]=2;$d}
$a #1,1
b $a #2,2
$a #1,2!

As you can see, first level of array works fine but second does not.

Serialize-deserialize array. That’s a really ugly workaround but it’s guaranteed to work. Take a look here. I haven’t tested it because cloning worked for my needs but I have a feeling that it is much slower. That may or not be an issue depending on your requirements. Might be a good idea to wrap it in a function for easy use.

Wishlist: runtime flag or global variable to pass arrays by value.

Deleting System Center Configuration Manager application will not delete supersedence relationship‏

Imagine a scenario where you have following applications is SCCM:

  • Application X v1.0
  • Application X v1.1 – supersedes v1.0
  • Application X v1.2 – supersedes v1.1, deployed to clients

At one point you might want to delete v1.0 from SCCM. At this point you must keep in mind that supersedence data will not be updated for v1.1. v1.1 will still contain broken supersedence information that will break deployment for even v1.2 as client can no longer build supersedence chain (v1.1 references nonexistent application). You must manually remove supersedence information from v1.1.

Observed in 1602 and 1606. I’ve thought about scripting this validation-remediation but SCCM PowerShell is quite cryptic past very basic operations (WQL or sparsely documented .Net classes for most things). I guess a deep dive into SCCM SDK is in order someday.

Funny thing is that this seems to be only relation that is not enforced. And no, I haven’t contacted MS support about this as it’s not that important for me to burn a support ticket.

Sertifitseerimiskeskus OCSP is not RFC compliant

This issue appeared a few months ago when SK introduced OCSP for KLASS-SK 2010 CA. Previously there was no OCSP at all, only CRL.

The issue is that OCSP responds “revoked” to expired certificates. You might think one should never use an expired certificate. True, but world is not always so black and white. You might not really care for retired-archived systems or internal services. One might simply forget to renew certificate or admin is on vacation etc. People are imperfect and processes do fail. Previously you’d get a warning that certificate is expired but it’s easy to click through that, no worries. Now you get hardblocked.

Current revision is RFC 6960 that basically says that you may reply “revoked” only if certificate actually is revoked or if it has never been issued. For any other case, correct response is “good” or “unknown”. Obsoleted RFC 2560 makes basically the same statement.

SK support is aware of the issue but their statement was that this will not be fixed. I guess that this is a business decision (you must order new one – $$$ – or use self-signed/internal CA) as I know of no other major CA that behaves like that. I’m not a security guy, but I don’t think it’s really an issue if certificate is used a few days past expiration date in case of a human mistake.

Windows 2016 deduplication data corruption with huge files – fixed in KB3216755

Update 27.01.2017: Corruption issue has been fixed in KB3216755 and newer!

Pre-emptive remark! Currently NTFS dedup supports up to 1TB files, anything beyond that may or (in this case) may not work. Windows Server 2016 is not generally available so any bugs may be fixed by GA.

First, some background and info on 2012R2 dedup.

As known, WS2016 RTM bits are already out there (same version as Windows 10 1607). Deduplication is supposed to be much better this time. It wasn’t too bad in WS2012R2 but optimization was slow on big volumes or with lots of modified data.

File size limit is still 1TB though. Why would you ever want to go higher? Image-based backup such as Veeam. Big VMs result in multi-TB files. Storage may be cheap but it’s not that cheap. NTFS dedup will still allow you to reduce storage costs, even when using cheap solutions.

WS2012R2 could go much higher in practice if you followed some guidelines. Main limitation is not dedup engine itself but NTFS that has limitations on how fragmented big files can become. Fragmentation causes number of file extents to increase. If you hit File Record limitations, file can no longer grow. Deduped size on disk is often reported as zero bytes but behind the scenes, reparse point seems to hold all the metadata to rebuild the file. It is also likely that modified data gets stored in reparse point until optimization. I’m not 100% sure about that but you can make some reasonable guesses from this article The Four Stages of NTFS File Growth, Part 2 and real life behavior.

Anyways, guidelines

  • Format NTFS with Large File Records. This is IMHO critical for any NTFS volume containing big (dozens of GiB) files, especially if using NTFS compression (yeah, it actually is a good idea… in some rare scenarios) and/or 4KiB clusters. In this case it allows for file record to be bigger before hitting fragmentation limits.
  • Format NTFS with 64KiB cluster size. Might increase performance but mainly helps to reduce amount of metadata per file, again allowing to keep NTFS limitations at bay.
  • Write files out in one run and never modify them later. Optimizing huge files is much slower anyways, reprocessing them is even slower and a good way to fragment your data.
  • Defragment file system. This will reduce amount of extents, allowing bigger files.
  • Apply registry flag EnablePriorityOptimization. This causes Dedup engine to aggressively defrag fragmented files.
  • Keep reasonable free space (20%), with thin provision if possible. Again, to keep fragmentation under control as NTFS tends to heavily fragment when free disk space runs low. Thin provisioning allows to not waste actual disk space.

Some optional stuff

  • Do free space consolidation in addition to usual defrag. This will reduce fragmentation of future data.
  • Use contig.exe from SysInternals. This is basically per-file defrag. This may help if you do hit limitations. Just make sure you have enough free space and it is more-less continuous.

With care, you could easily get 4TB files and even more. I’ve tested with 6TB files but processing gets sloooooooow (2 days at least) due to single-threaded nature of dedup. If you do want to run it this way, select a CPU with high clock rate, additional cores won’t help you (unless you have a lot of volumes). If you look around forums (mainly Veeam), you’ll see people using 4TB files in production. Again, YMMV, it’s unsupported but it works with some care.

There are also some forum threads with information from product team. Supposedly WS2016 will skip processing data over 4TB. And WS2012R2 dedup will work fine with 1TB+ files if you are careful. But unofficial information is unofficial.

So WS2016 is supposed to be multi-threading. And it is, processing is much faster. With my testbed, it used 4 threads. I didn’t have server-grade hardware with sufficient storage at hand so I threw together some surplus desktop hardware. Test environment:

  • Windows Server 2016 RTM patched to .82 (no changes to dedup components though)
  • HP desktop with i5 4rd generation
  • 32GB RAM
  • 3*6TB HGST UltraStar – Storage Spaces striped (for performance): 16.4TB total

I wanted to generate data that was semirandom. Not in crypto sense but somewhat random but still compressible to simulate realistic workload. Dummy File Generator works great for that. It’s much faster than true random and results in about 1:40 dedup ratio. Could be less but good enough. I used this great tool to create huge files in 1TB increments.

Aaand I get chunk store corruption with files over 1TB. It will process one point something TB, report an error in event log and abort on file. Get-DedupMetaData recommends to run scrubbing. If you do, it will report that some of your files have been corrupted.

Rinse and repeat. Well actually, wipe, test disks, reinstall, test. Still corruption. Well that sucks. I didn’t actually try to read back data.

I haven’t yet ruled out all variables but WS2012R2 works great on that very same system. But WS2016 general availability is still over a month away. It could get patched as even in an unsupported scenario – data corruption is still data corruption.

I’ll add more info when I have some.

Update 23.08.2016

Several runs of Memtest, a few hours of Prime95, checked disks again, reinstalled Windows. Still corruption.

After doublechecking I discovered that WS2012R2 ran fine on another but otherwise identical PC.

So I grabbed another similar system from shelf and started over, even only using inbox drivers this time. I also replaced Corsair MX100 boot SSD. It has awful reputation for stability and could be to blame here. Might be that during memory-intensive processing some RAM spilled to pagefile and got corrupted.

Still, another test run just failed at 2,1TiB of 3TiB. I don’t have any other ideas to check. I think that deduplication is currently simply broken for huge files.

Some event log bits.

Data Deduplication aborted a file range.
FileId: 0x40000000000B1
FilePath: D:\1-3
RangeOffset: 0x25000000000
RangeLength: 0x1000000000
ErrorCode: 0x80070017
ErrorMessage: Data error (cyclic redundancy check).
Details: Failed to notify optimization for range
Data Deduplication aborted a file.
FileId: 0x40000000000B1
FilePath: D:\1-3
FileSize: 0x30000000000
Flags: 0x0
TotalRanges: 48
SkippedRanges: 0
AbortedRanges: 0
CommittedRanges: 33
ErrorCode: 0x80070017
ErrorMessage: Data error (cyclic redundancy check).
Details: Failed to add range to session

Created at the start of the run several hours earlier. Not sure if relevant but there are several similar emergency files created.

Data Deduplication created emergency file GCReservedSpaceBitmap.tmp.
   Running the deduplication job.
   Volume name: D: (\\?\Volume{de37720c-fc77-4cdb-9a4d-1864a0e07953}\)

Scrubbing hasn’t yet started. I’ll report it’s results when it completes.

Update 26.08.2016

I forgot to export event logs before wiping the system. I installed 2012R2 to be sure and it has processed 15TB+ data so far and is currently processing 6TB test file with 7TB in queue – no problems so far. If all goes well, it’ll probably will have processed a 10TB file by Monday. Main bottleneck is dedup data processing performance, only about 200MB/s at 3,8GHz clock (i7-4770 this time). RAID can write new files at 600-650MB/s but it’s really one or another. Even when trying to make dedup job high performance (no throttling, high priority, realtime process, increase IO priority…), it even doesn’t attempt to compete with other IO, slowing down about 10 times. WS2016 does much better but we know the story with that…

I’m not a fan of new less detailed release notes for updates. But new production preview patch (.103?) has updated components for NTFS with vague notes about improved reliability. I’d like to know what the original issue was without updating blindly… But that’s new normality, I’ll try it out next week.

Update 28.08.2016

WS2012R2 had successfully deduped 9TB file and was midway through 10TB before I cancelled the process and installed WS2016 with .103. It’s running overnight, we’ll see the results in the morning.

Update 30.08.2016

Corrupted. I’ll put this on hold until further patches or more information.

Why did I start blogging again?

I had a blog for many years but at the time, it had little focus so I decided to discard it as having little value to anybody.

In years since previous blog’s closure, every now and then I’ve had subjects that might need sharing with world. Perhaps solution to some problem or even just documenting some issue. So today I installed WordPress and I actually have a few posts in mind. Focus is around my work (IT) so let’s hope that I’ll have time to keep writing.

I try to write only on subjects that have had very little coverage (edge cases, implementation details, little-known problems etc) or are very hard to find on Google. No generic news or RTFM/UTFG tutorials.

Scripts are mostly unedited (some irrelevant details, logging functions etc are usually removed). A lazy sysadmin is a good sysadmin so I usually don’t polish my scripts beyond “it works fine and is reasonably readable”. If you need something better, DIY or improve my examples. It’d by nice if you shared improvements though.