Collectors have some great advantages: the monitored device is put under less computational load and a greater flexibility for the configuration of service checks. But this approach is also faced with a new challenge.

Distributed Monitoring (op5) with 3 peer-nodes

If the monitoring was setup with multiple peer nodes (“distributed monitoring”), it will happen that a peer node first collects the data and another one then executes the check. The store of the peer node that is executing the check is not up to date and the check will refuse to utilize this older data.

Peer2 ]# /opt/plugins/custom/check_netapp_pro.pl Head -H 10.135.1.40 ‑‑storedir /tmp/netapp_data ‑o power ‑‑max_age 60
Store file (10.135.1.40.head) is out of date! (770.0 min., --max_age is set to 60)

Possible Solutions

The following approaches could be utilized in such cases:

  • Saving the store file on a network drive accessible for all peer nodes. This can be achieved quite easily by using the switch ‑‑storedir.
  • Using a sync mechanism that keeps the store files for each peer node up-to-date.

In this case, it surely is interesting to know the background on how the store files are saved. The collectors (get_netapp_*.pl) begin with reading all the data from the NetApp that is being monitored, prepare the data in the memory and only then write the data into the store file. Before the writing process, they will try and get an exclusive lock via flock on the file (when supported by the operating system, which is probably not the case with NFS).