Real-time Monitoring with Nagios
By: Aaron Cieslicki
Here at Nagios we get asked whether our solutions provide real-time monitoring. The short answer is yes.
To many of us at Nagios, though, the interesting question becomes: is real-time monitoring always better? Some people may be surprised to hear us answer that no, real-time monitoring, while useful in many cases, is not always the best monitoring technique across all monitoring use cases.
In this article we take a look at how Nagios handles real-time monitoring, as well as use cases where real-time monitoring actually provides no benefit at all, and may also be both a distraction and a detriment to monitoring teams and admins.
How Nagios Does Real-time Monitoring
SNMP traps are a classic example of real-time monitoring. For example, you might have a switch configured to send an alert to Nagios XI when a cable is plugged into an interface. When that plug goes in, the device sends the message to Nagios immediately. A monitored event happens, Nagios knows immediately, and Nagios notifies the right people … in real time.
Beyond just networking devices and other SNMP-enabled things, you can use NCPA with Nagios XI to get real-time alerts from server infrastructure. NCPA has a passive configuration to do just that. A monitored event happens, NCPA sends an alert, and Nagios notifies the right people … again in real time.
Let’s not forget about the real-time monitoring capabilities of Nagios Log Server, which can alert and notify nearly instantaneously on log data it receives from servers, networking devices, applications, or anything else that logs data.
When Real-time Monitoring is Not Helpful
Real-time monitoring is having a monitoring system that takes immediate action upon receiving a piece of information, like sending a notification about a received SNMP trap. There might also be some additional idea out there where “real-time” monitoring involves a continuous flow of performance data delivered on very short (perhaps one-second) intervals.
Whatever your definition, the question remains: is real-time monitoring the best way to monitor in every situation?
To see a use case where real-time monitoring is not a benefit, take a look at the graph below, which is copied from the interface of a virtual machine host. The graph shows disk I/O from the host’s internal metrics.
Clearly, there is a spike in the data.
However, the question now is: do we want to notify the monitoring team, or any other team, about this spike in the data? No. We wouldn’t want to wake an on-call technician at 3 am about the situation represented in the graph above. This is a transitory spike in a performance metric. It’s not persistent. Lots of performance metrics spike momentarily and recover normal performance.
Sending out a notification as soon as it happens about a performance data spike like the one in the graph above leads to a problem called notification fatigue (or in the case of waking a sleeping on-call technician, it can lead to actual physical fatigue). Teams get so overwhelmed by notifications that aren’t meaningful, that they start to ignore them. That’s not good for the organization.
Additionally, definitions of “real-time” monitoring that include some sense of streams of data on very short intervals are problematic in a couple of ways. First, this example makes clear that what matters for many performance data notifications is not the absolute length of the interval between data points but the persistence of the issue over time. Second, sampling performance metrics too frequently can have a significant negative impact on the performance of devices being monitored.
This is exactly why Nagios has built-in check logic that can be configured to notify the team only about persistent issues. When disk I/O or CPU utilization or bandwidth rises above a certain problematic threshold and remains there for a certain amount of time, that is something we might want to wake a technician about.
What’s the moral of the story? Real-time monitoring is tremendously useful in the right situations. But it’s important to keep in mind that it’s not the only way to monitor, and especially in cases where we want to limit notifications to persistent problems, a “real-time” focus can actually lead to notification overload for teams.
Still have questions? Our Sales Team would be happy to answer them: firstname.lastname@example.org