We are having a new check proposed by one of our customers who had an issue with a single process eating up all the CPU time on a filer. It’s easy to identify the culprit once you are on the command-line of the filer (priv-mode) by issuing the ps command.

To automate that sort of monitoring and getting an alarm immediately if a process is getting out of control we offer check_netapp_processes now.

Example

This is how one would check using the default thresholds:

$ ./check_netapp_process.pl -H filer -s filer-01 -u admin -p ******
NETAPP_PROCESS CRITICAL - 1064 processes checked, 2 critical and 0 warning
idle: cpu0: 102.0 (CRITICAL)
idle: cpu1: 102.0 (CRITICAL)
ontap_dead_bsd_thre: 1.0
worker_thread_38: 0.0
iswts_sockio: 0.0
# very long list deleted
SMBOff [...] | worker_thread_38=0.00%;20;50;0;100 iswts_sockio=0.00%;20;50;0;100 wafl_blog_early_kickout_worke=0.00%;20;50;0;100
# lot of perfdata delete

Filtering the processes

Since the list is quite  long filters can be set by means of --exclude and --include. E.g. if you do not want to check the idle-procesess you would configure the check like this:

$ ./check_netapp_process.pl -H filer ... -X ^idle:

For other tips on how to deal with the very long output of that check have look into the article Overly Long Outputs.

Availability

This check will be available in the next unstable version (3.10.1_12) for testing which we will release today. At the moment this check only supports cdot and not 7-mode. If you would like to get this check for 7m too, please provide us with the CLI commands used on 7m to get the process-list. You can try that with check_netapp_anycli and send us the --in values. (For your reference, these are the commands we use to get the list on cdot: set advanced -confirmations off;node run -node <node-name> -command ps )