Porting System Center Operations Manager Management Pack to Zabbix

Figuring out performance counter discovery inspired me to investigate possibility of porting SCOM MP to Zabbix. I’ve spent a few days playing with the idea and Windows Server MP and I think that fairly similar experience can be achieved. My objectives:

  • Minimal configuration on the target server – only allow server commands and increase command timeout.
  • Minimal dependencies on target server – PowerShell only
  • No scripts must be deployed on target server
  • Multi-instance items are auto-discovered
  • Functionally similar alerts and gathered data

After few days of tinkering, there have to be compromises:

  • 255 character key limit implies a lot of compromises
  • Some counters have to be changed because of that (Processor vs Processor Information)
  • SCOM MP has some huge scripts with extended error checking and data collection that cannot be fully re-implemented due to key limitations
  • Due to that there will be compatibility and support issues
  • Scripts use different interfaces based on operating system version and edition (full or core) to work around bugs and issues. This cannot be faithfully emulated. Workarounds might be version-edition based templates or flipping discoveries on-off manually. I don’t think you can automatically flip discoveries based on other queries.
  • Edge cases might be missed naturally
  • Pretty much everything requires custom LLD script as agent discovery is useless
  • Item prototypes across discoveries have to be unique even though generated items are guaranteed to be unique. This again runs into problems with key length on some objects.
  • Some items have to be discovered multiple times due to subtle differences between interfaces. For example Network Adapter performance counters and MSFT_NetAdapter have different interface names due to some characters not being supported in perfmon (various brackets are changed, # gets changed to _). Another example is LogicalDisk perfmon that uses disk letter (where possible) or object manager name (for example boot volume). However volume metadata cannot be queried using object manager name so you must rediscover volume GUIDs.
  • Unit monitors and rules might use the same counters but Zabbix doesn’t allow duplicate items/keys. So far the best solution is to use “perf_counter[counter]” for unit monitors/triggers and averaged “perf_counter[counter,interval]” for rules. There might be fewer alerts as measurements are short but at least historical data collection is more accurate. It’s really one or the other way…
  • Many triggers need anti-flap measures as Zabbix has no global solution for that.
  • Key limit means that realistically only one macro or data value (or two in some cases) can be collected per LLD or item. Some metadata is lost, especially in Event Logs.
  • Event Log based rules may have to be split over multiple items as XPath queries are fairly long. They can be gathered under single trigger though.
  • I haven’t decided whether to go for discovery based Event Log processing or simple item based. Discovery based means that multiple items-triggers-alerts could be generated for distinct events, however I’m concerned that I’ll hit key length limitations and this would be abusing LLD functionality. Simple item based however is much simpler but you only get indication that something is wrong that requires more investigation.
  • As maximum agent timeout is 30 seconds, some long checks are likely to time out, such as defrag analyze or readonly chkdsk.

I’m halfway done so I guess I’ll publish on GitHub or something when I have something useful. Some community cooperation would be nice for some cases. Some analysis/compromises might change as I find workarounds to problems.

Update 6.9.2016

Monitors and rules are pretty much all implemented. I’m still polishing the scripts to put as much logic as possible in LLD keys but I’ve worked around some issues. I guess I’ve also spotted some bugs in MP. Current notes list:

  • It turns out that I’m an idiot and some LLD discoveries do work out better with ConvertTo-JSON. I can avoid expensive double quotes this way (expensive as 1 double quote results in 6 characters in final LLD key string if square brackets are also involved), allowing more logic and more item macros to be returned if necessary. This implies PS/WMF 3.0 but I think that’s a reasonable compromise.
  • Some LLD queries get “Not supported” error on some servers for no apparent reason, must debug.
  • I’m working on applications. So far it’s a mess but I guess I’ll stick to 3 applications (Collection, Alert, Monitor) per category (Logical Disk, Operating System…)
  • I haven’t touched views to graphs much but some issues:
    • You can’t create horizontal graphs (for example add counter X for each LLD discovered item X to one graph) for LLD items without ugly server-side scripted workarounds.
    • As some views reference items that I’ve discovered under different LLD queries. No reasonable way to add them to single graph.
  • No overrides for most items for most triggers. I did a few for items that regularly hit thresholds in my environment but macros are really uncomfortable to use so I skipped over that.
  • Event Log based items check for events in last 24 hours. Anything more would take forever for alerts to clear. It’s quite simple to implement and works reasonably well.
  • Some Event Log rules in MP specify plain wrong event sources (eg quota events are from NTFS, not Disk). Some sources have different names but I can’t test them all as I have no samples.
  • Most event log rules can’t be tested as I have no samples to collect.
  • Checks are not consistent. Some return number of events, some full message from last event, some an attribute from last event. It depends on how I thought it’d work best.
  • I’ve added a few extra checks that MP itself doesn’t cover. For example
    • Agent ping to detect downtime.
    • .NET assemblies get updated (ngen update) daily as some scripts require libraries to be up-to-date and compiled for maximum performance to fit in timeout window.
    • Defrag analyze gets invoked daily. It surprisingly mostly fits gets done in 30 seconds, unless volume is really badly fragmented. VSS dedicated volumes trigger an alert (I guess you can’t defrag VSS snapshot data)without reasonable way to automatically exclude but you can always disable problematic trigger on host.
    • ChkDsk and Defrag (if over threshold, regardless of previous analyze result) get invoked daily as maximum update interval is 24 hours. So far it seems to work well. Items report errors because of timeout but as WMI keeps running on client, jobs actually complete. I’m not sure if ChkDsk sets dirty flag if read-only ChkDsk finds issues but I hope it does so another item can detect an issue.
  • Support for non-English locales are not an issue for me so I will not likely implement that. I’m currently using English strings for Perfmon, looking up registry ke6 for each item… maybe later.
  • I decided that there is little reason to distinguish between system volume and others when monitoring free disk space. An extra macro in LLD would do but catch-all seems like a better idea.
  • Currently I copied KB article contents to item descriptions. I guess it sounds like a copyright issue so I have to remove them again.

I also peeked around File Server MP. Checking firewall port rule seems like a good idea but a compact implementation looks next to impossible…

Discovering multi-instance performance counters in Zabbix

I’m not a fan of Zabbix but you can’t always select your tools. I’m no expert on Zabbix so feel free to improve my solution.

The original problem was that most Zabbix templates available online for Windows are plain rubbish. Pretty much everything monitored is hardcoded (N volumes to check for free space, N SQL Server instances to check etc). Needless to say, this is ugly and doesn’t work well with more complex scenarios (think mount points or volumes without disk letter…). Agent built-in discovery is also quite limited.

My first instinct was to use Performance Counters but agent doesn’t know how to discover counter instances, once again requiring hardcoding. Someone actually patched agent to allow that but it has never been included in official agent.

Low Level Discovery is your way out but it’s implied to use local scripts. I used it with local scripts for a while but keeping them in sync and in-place was quite annoying. Another option is to use UserParameter in agent configuration. There are less limitations but this requires custom configuration on client and I’d like to keep agent basically stateless. I did use this implementation as inspiration though.

So one day I tried to squeeze it in 255 characters allowed for a run command. And i got to work.

Notes:

  • It’s trimmed every way possible to reduce characters as best as I could.
  • 255 characters is actually very little and you need to be really conservative…
  • …because you need to escape special characters 3 times. First escape strings in PowerShell. Then escape special characters to execute PowerShell commands directly in CMD. And finally escape some characters for Zabbix run command.
  • Double quotes are the main problem. I think that this is the best solution as I can’t use single quotes for JSON values.
  • If counter doesn’t exist or there are no instances, returns NULL.
  • You should be reasonably proficient in PowerShell and Zabbix to use that
  • Should work with reasonably modern Zabbix server and agents (2.2+)
  • I only used it on Server 2012 R2 but it should work also on 2008 R2 (not 2008) and 2012. Let me know how it works for you.

Update 2.09.2016
I’ve update the script to shave off a few more characters. I’ll update when I have some time.

So let’s figure this out. The original PowerShell script:

'{"data":['+(((Get-Counter -L 'PhysicalDisk'2>$null).PathsWithInstances|%{If($_){$_.Split('\')[1].Trim(')').Split('(')[1]}}|?{$_ -ne '_Total'}|Select -U|%{"{`"{#PCI}`":`"$_`"}"}) -join ',')+']}'

Phew, that’s hard to read even for myself. But remember, characters matter. I’ll explain it in parts.

'{"data":['

That’s just JSON header for LLD. I found it easier and to use less characters to hardcode some data rather than format data for JSON CmdLets.

(Get-Counter -L 'PhysicalDisk'2>$null).PathsWithInstances

As you might think, this retrieves instances of PhysicalDisk. You need it keep track on IO queues for examples. Replace it with counter you need. This actually retrieves all instances for all counters but we’ll clear this up later.
Sending errors to null allows to discover counters that might not exist on all servers (think IIS or SQL Server) – otherwise you’d get error (Zabbix reads back both StdOut and StdErr) but now it just returns NULL (eg nothing was discovered).
You can use * wildcard. For SQL Server, this is a must.

%{If($_){$_.Split('\')[1].Trim(')').Split('(')[1]}}

First I check if there was anything in pipeline. Without this, you’d get a pipeline error if there was no counter or no instances. Then I cut out the name on the instance.

Actually you can leave out the cutting part. In multi-instance SQL Server servers (when you used wildcard for counter name) you actually have to keep full name (both counter and counter instance) as counter name contains SQL Server instance name. For example:

%{If($_){$_.Split('\')[1]}}

I usually prefer to keep only instance names but it’s optional. Let’s go on…

?{$_ -ne '_Total'}

This is optional and can be omitted. Most counters have “_Total” aggregated instance that may or may not useful based on the instance. For example with PhysicalDisk, it’s more or less useless as you’d need per-instance counters for anything useful. On the other hand, Processor Information can be used to get both total and per-CPU/core/NUMA-node metrics.

Select -U

Remember that we’re actually working with all counters for all instances? This cleans them up, keeping single entry for instance.

%{"{`"{#PCI}`":`"$_`"}"}

Builds JSON entry for each discovered instance. {#PCI} is macro name for prototypes. PCI is arbitrary name – Performance Counter Instances. You can change that or trim to just one character – {#I}.

-join ','

Concentrates all instance JSON entries into one string.

']}'

JSON footer, nothing fancy, hardcoded.

Now the escaping. First PowerShell to CMD:

  • ” –> “””
  • | –> ^|
  • > –> ^>
  • prefix with “powershell -c”

Result that should run without errors in CMD and return instances in JSON.

powershell -c '{"""data""":['+(((Get-Counter -L 'PhysicalDisk'2^>$null).PathsWithInstances^|%{If($_){$_.Split('\')[1].Trim(')').Split('(')[1]}}^|?{$_ -ne '_Total'}^|Select -U^|%{"""{`"""{#I}`""":`"""$_`"""}"""}) -join ',')+']}'

Escaping for Zabbix

  • ” –> \”
  • Add system.run[” to start
  • Add “] to end
system.run["powershell -c '{\"\"\"data\"\"\":['+(((Get-Counter -L 'PhysicalDisk'2^>$null).PathsWithInstances^|%{If($_){$_.Split('\')[1].Trim(')').Split('(')[1]}}^|?{$_ -ne '_Total'}^|Select -U^|%{\"\"\"{`\"\"\"{#PCI}`\"\"\":`\"\"\"$_`\"\"\"}\"\"\"}) -join ',')+']}'"]

But oh no, it’s now 268 characters! You need to cut something out. Luckily you now have some examples for that. Here’s some more Zabbix formatted examples:

system.run["powershell -c '{\"\"\"data\"\"\":['+(((Get-Counter -L 'Processor Information'2^>$null).PathsWithInstances^|%{If($_){$_.Split('\')[1].Trim(')').Split('(')[1]}}^|Select -U^|%{\"\"\"{`\"\"\"{#I}`\"\"\":`\"\"\"$_`\"\"\"}\"\"\"}) -join ',')+']}'"]
system.run["powershell -c '{\"\"\"data\"\"\":['+(((Get-Counter -L 'MSSQL*Databases'2^>$null).PathsWithInstances^|%{If($_){$_.Split('\')[1]}}^|Select -U^|%{\"\"\"{`\"\"\"{#I}`\"\"\":`\"\"\"$_`\"\"\"}\"\"\"}) -join ',')+']}'"]

Now for item prototypes, if you cut instance down to counter instance name.

  • Name: IO Read Latency {#PCI}
  • Key: perf_counter[“\PhysicalDisk({#PCI})\Avg. Disk sec/Read”,60]

If you didn’t trim name and kept counter name

  • Name: IO Read Latency {#PCI}
  • Key: perf_counter[“\{#PCI}\Avg. Disk sec/Read”,60]

Keep in mind that name will now be something like “IO Read Latency PhysicalDisk\0 C:”

Again, if you have any improvements, especially to cut character count – let me know.