Systems can crash unexpectedly, users may make claims about how “the internet is slow,” and managers might ask for historical statistics that make you wonder how to collect data without hitting the refresh button and writing down numbers for half a day just to get a baseline report. If you’ve been part of a federal IT team for longer than 15 minutes, this is “situation normal.”
The answer to these challenges lies in effectively monitoring your environment. But saying “let’s monitor our network” presumes you know what you should be looking for, how to find it, and how to get it without affecting the system you’re monitoring. You’re also expected to know where to store the values, what thresholds indicate a problem situation, and how to let people know about a problem in a timely fashion.
Establishing the “What”
Here’s the bottom line: to build an effective monitoring solution, the true starting point is learning the underlying concepts. You have to know what monitoring is before you can set up what monitoring does.
Regardless of the software, protocol, or technique you use, a few fundamental aspects of a monitoring system exist across the board:
- Element: This is a single aspect of the device you’re monitoring.
- Acquisition: How do you get your information? Does your monitoring routine wait for the device to send you a status update (push), or does it proactively go out and poll the device (pull)?
- Frequency: How often do you receive information? Does the device send a “heartbeat” every few minutes? Does it send data only when there’s a problem?
- Data retention: Monitoring is data-intensive. At its simplest level, data retention determines whether statistics are 1) collected, evaluated, acted upon, and forgotten, or 2) kept in a datastore of some sort.
- Data aggregation: For example, you might collect statistics every five minutes. After a week, those five-minute values are aggregated to an hourly average; after a month, those hourly values are further aggregated to a daily average.
- Threshold: The idea of fault monitoring is to collect a statistic and see whether it crosses a line of some kind—a threshold. It can be a simple line (is the server on or off?) or it can be more complex.
- Reset: Reset marks the point where a device is considered “back to normal.”
- Response: The response defines what happens when a threshold is breached. A response could be to send an email, play a sound file, or run a predefined script.
- Alert noise: Alert configuration can be as much an art as it is a science. On the one hand, you want to be alerted when an issue occurs. On the other hand, you don’t want to create alert rules capable of drowning you in noise and ultimately masking real issues. Machine learning shows promise in solving this problem.
Understanding the “How”
Now we know the terms necessary for a foundational understanding of monitoring—the “what.” The “how” is just as important.
There are various monitoring techniques, from classic pinging and using the Simple Network Management Protocol (SNMP) to vendor-specific methods. Additionally, some offerings use agents for monitoring while others use agentless technology. None of these are right or wrong; it’s important to choose based on your own system and agency demands.
At the end of the day, these are the four most important things to consider when strengthening your monitoring process:
- Ease of deployment, configuration, and maintenance
- Availability of the data (to external systems and other modules within the solution) once it’s collected
- Intelligently filtering alert noise
Monitoring may not be the sexiest discipline for the federal IT pro, but it’s critical in ensuring systems are optimized and the mission is uninterrupted. Download Solarwind’s Monitoring 101 Whitepaper to learn more about the philosophy, theory, and fundamental concepts involved in systems monitoring.