An exciting feature release for check_netapp_pro is the new –rm_ack_handler parameter, which enables users to construct their own method to reset remote service acknowledgements for checks which have failed.
This is not a matter of checking what has failed, but a way to manage the acknowledgements of what has failed; do we want to disable the notifications (acknowledge the service problem) while keeping the failed status, or do we want to resume getting notifications for failed components (remove the acknowledgement) and, more importantly, be able to control this decision based on whether or not the cause for the failure has changed or is simply ongoing. We had unexpected reports of failures for this pre-existing functionality on some systems and the issue turned out to be some environment variables which were no longer set by the monitoring daemons.
The acknowledgement mechanism used to be handled internally by check_netapp_pro and relying on these environment variables left the software vulnerable to upgrades elsewhere. Fixing this code also gave us the opportunity to modernize the rm_ack functionality and, at the same time, to implement the new –rm_ack_handler to add increased flexibility.
check_netapp_pro.pl Usage -H $HOST -o volume ‑‑rm_ack=reason_change ‑‑rm_ack_handler=/usr/local/nagios/bin/rm_ack_handler.sh
Given the above example, let’s say that check_netapp_pro has detected that there was an original error of the usage of one volume exceeding it’s critical boundary, which error had been acknowledged thus stopping further notifications going out for this filer. Next the software notices that a different volume has also exceeded it’s warning or critical boundary, because this is a different reason (differerent volume in this case) the –rm_ack will kick in. At this point the script specified by the new –rm_ack_handler parameter will be called with the appropriate arguments to reset the service acknowledgement and new notifications will start going out again.