Porting System Center Operations Manager Management Pack to Zabbix

Figuring out performance counter discovery inspired me to investigate possibility of porting SCOM MP to Zabbix. I’ve spent a few days playing with the idea and Windows Server MP and I think that fairly similar experience can be achieved. My objectives:

  • Minimal configuration on the target server – only allow server commands and increase command timeout.
  • Minimal dependencies on target server – PowerShell only
  • No scripts must be deployed on target server
  • Multi-instance items are auto-discovered
  • Functionally similar alerts and gathered data

After few days of tinkering, there have to be compromises:

  • 255 character key limit implies a lot of compromises
  • Some counters have to be changed because of that (Processor vs Processor Information)
  • SCOM MP has some huge scripts with extended error checking and data collection that cannot be fully re-implemented due to key limitations
  • Due to that there will be compatibility and support issues
  • Scripts use different interfaces based on operating system version and edition (full or core) to work around bugs and issues. This cannot be faithfully emulated. Workarounds might be version-edition based templates or flipping discoveries on-off manually. I don’t think you can automatically flip discoveries based on other queries.
  • Edge cases might be missed naturally
  • Pretty much everything requires custom LLD script as agent discovery is useless
  • Item prototypes across discoveries have to be unique even though generated items are guaranteed to be unique. This again runs into problems with key length on some objects.
  • Some items have to be discovered multiple times due to subtle differences between interfaces. For example Network Adapter performance counters and MSFT_NetAdapter have different interface names due to some characters not being supported in perfmon (various brackets are changed, # gets changed to _). Another example is LogicalDisk perfmon that uses disk letter (where possible) or object manager name (for example boot volume). However volume metadata cannot be queried using object manager name so you must rediscover volume GUIDs.
  • Unit monitors and rules might use the same counters but Zabbix doesn’t allow duplicate items/keys. So far the best solution is to use “perf_counter[counter]” for unit monitors/triggers and averaged “perf_counter[counter,interval]” for rules. There might be fewer alerts as measurements are short but at least historical data collection is more accurate. It’s really one or the other way…
  • Many triggers need anti-flap measures as Zabbix has no global solution for that.
  • Key limit means that realistically only one macro or data value (or two in some cases) can be collected per LLD or item. Some metadata is lost, especially in Event Logs.
  • Event Log based rules may have to be split over multiple items as XPath queries are fairly long. They can be gathered under single trigger though.
  • I haven’t decided whether to go for discovery based Event Log processing or simple item based. Discovery based means that multiple items-triggers-alerts could be generated for distinct events, however I’m concerned that I’ll hit key length limitations and this would be abusing LLD functionality. Simple item based however is much simpler but you only get indication that something is wrong that requires more investigation.
  • As maximum agent timeout is 30 seconds, some long checks are likely to time out, such as defrag analyze or readonly chkdsk.

I’m halfway done so I guess I’ll publish on GitHub or something when I have something useful. Some community cooperation would be nice for some cases. Some analysis/compromises might change as I find workarounds to problems.

Update 6.9.2016

Monitors and rules are pretty much all implemented. I’m still polishing the scripts to put as much logic as possible in LLD keys but I’ve worked around some issues. I guess I’ve also spotted some bugs in MP. Current notes list:

  • It turns out that I’m an idiot and some LLD discoveries do work out better with ConvertTo-JSON. I can avoid expensive double quotes this way (expensive as 1 double quote results in 6 characters in final LLD key string if square brackets are also involved), allowing more logic and more item macros to be returned if necessary. This implies PS/WMF 3.0 but I think that’s a reasonable compromise.
  • Some LLD queries get “Not supported” error on some servers for no apparent reason, must debug.
  • I’m working on applications. So far it’s a mess but I guess I’ll stick to 3 applications (Collection, Alert, Monitor) per category (Logical Disk, Operating System…)
  • I haven’t touched views to graphs much but some issues:
    • You can’t create horizontal graphs (for example add counter X for each LLD discovered item X to one graph) for LLD items without ugly server-side scripted workarounds.
    • As some views reference items that I’ve discovered under different LLD queries. No reasonable way to add them to single graph.
  • No overrides for most items for most triggers. I did a few for items that regularly hit thresholds in my environment but macros are really uncomfortable to use so I skipped over that.
  • Event Log based items check for events in last 24 hours. Anything more would take forever for alerts to clear. It’s quite simple to implement and works reasonably well.
  • Some Event Log rules in MP specify plain wrong event sources (eg quota events are from NTFS, not Disk). Some sources have different names but I can’t test them all as I have no samples.
  • Most event log rules can’t be tested as I have no samples to collect.
  • Checks are not consistent. Some return number of events, some full message from last event, some an attribute from last event. It depends on how I thought it’d work best.
  • I’ve added a few extra checks that MP itself doesn’t cover. For example
    • Agent ping to detect downtime.
    • .NET assemblies get updated (ngen update) daily as some scripts require libraries to be up-to-date and compiled for maximum performance to fit in timeout window.
    • Defrag analyze gets invoked daily. It surprisingly mostly fits gets done in 30 seconds, unless volume is really badly fragmented. VSS dedicated volumes trigger an alert (I guess you can’t defrag VSS snapshot data)without reasonable way to automatically exclude but you can always disable problematic trigger on host.
    • ChkDsk and Defrag (if over threshold, regardless of previous analyze result) get invoked daily as maximum update interval is 24 hours. So far it seems to work well. Items report errors because of timeout but as WMI keeps running on client, jobs actually complete. I’m not sure if ChkDsk sets dirty flag if read-only ChkDsk finds issues but I hope it does so another item can detect an issue.
  • Support for non-English locales are not an issue for me so I will not likely implement that. I’m currently using English strings for Perfmon, looking up registry ke6 for each item… maybe later.
  • I decided that there is little reason to distinguish between system volume and others when monitoring free disk space. An extra macro in LLD would do but catch-all seems like a better idea.
  • Currently I copied KB article contents to item descriptions. I guess it sounds like a copyright issue so I have to remove them again.

I also peeked around File Server MP. Checking firewall port rule seems like a good idea but a compact implementation looks next to impossible…