Various intelligence agencies have been accused of having the motto “In God We Trust, All Others We Monitor,” but the saying is just as, if not more, applicable to technical services of all descriptions. Any time we provide a service, we need to make sure that service is operating as our customers expect, and that any issues that might be threatening it are swiftly dealt with. We can only fix issues we can detect, so our monitoring software provides key information and alerts when the unexpected occurs.
Levels of Monitoring
The top level of our monitoring is service monitoring. That’s checking to make sure websites, email servers, nameservers, and all the other services we provide to clients are up and accessible from the locations we expect them to be. These are our top-level alerts; an alarm at this level is a wake-me-up-in-the-middle-of-the-night emergency that needs to be immediately resolved.
The next level is security monitoring. Any publicly accessible server has a constant steady rain of malicious requests and traffic hitting it, almost all of which is bounced by the firewalls. Even so, passwords get stolen, users get phished, and new vulnerabilities appear. We have systems that report on suspicious logins or anomalous behaviors, and we follow up to make sure our users’ data is as safe and secure as we can possibly make it.
Finally, we have performance monitoring. This is looking at how fast webpages are being served, databases are answering requests, and so on. These metrics help us tell if our servers are in good health, and if our infrastructure is keeping up with our growing customer base. This helps us make longer-term decisions, like when we need to invest in new or better hardware, when older devices might need maintenance, or when we need to talk to a customer about upgrading their service levels.
Internally, Deep Core Data uses a combination of Prometheus and Grafana for monitoring its systems. These tools are widely used in the cloud computing community and are powerful enough to meet any needs we’ve come across in our environment.
Prometheus is a time-series database. It reaches out to all the various systems around Deep Core’s environment and asks them for a status report. That status reports will have metrics that vary based on what the system is, but usually has values indicating if the device or software is up and working properly, and often key performance attributes, such as how many requests it processed recently or how much memory is in use, or any of thousands of other numbers. Prometheus keeps a record of every report it’s gotten, and when it got that report.
Grafana is a web interface that lets us visualize the data from Prometheus. It draws graphs and diagrams and lets us set alerts for when certain values go out of a predetermined range. All over the Deep Core office there are monitors on the walls showing graphs of the current system status, powered by the Grafana system.
We’re Always Watching
Providing high-quality service requires knowing when there’s a problem, and our monitoring tools allow us to do just that. They point out to use when we have a condition we need to address, and make sure we know about and resolve issues before our customers know about them.
If you’re having issues with only finding out your systems aren’t working until you go to use them, reach out to us. We’d like to help make your organization more effective.