Mantras to Help Monitoring Suck Less

2012-10-23

Looking online about monitoring and gathering metrics from an IT infrastructure will usually show plenty of people complaining about monitoring or metric gathering sucks. They’ll only do it when forced or set it and forget. However, this is a bad mindset because both monitors and metrics are very useful to day to day operations and long term planning. Below are a couple mantras I use to help keep me on what I consider the right track in terms to monitoring and metric gathering.

Collect All The Metrics

If a metric is easy enough to collect and it should be, then it should be collected. I might not think I need this metric now but it could come in handy in the future. An example I came across at work was that our server room would have humidity issues during the winter, which is an easy fix just fill the humidifiers and make sure they were running. I noticed that the humidity alerts only happened when it was below a certain dew point outside and the building at a certain temp. If I had been collecting those metrics I could have saved my self some unneeded alerts by allowing me to better space out humidifiers according to the outside weather. Also gather metrics on tripped monitors that way you know what alerts you the most and you can fix it.

Monitor All The Things

Anything that you can think of or know will cause an interruption to customer services or degraded performance due to a failure should be monitored. This isn’t only status of machines and services but also high level monitoring. For example monitoring if a user can reach your site or if they can do a task like logging in or uploading a file.

Alert Me Only When Fucks Are Given

This doesn’t mean only alert me during business hours. If I could have a system that only failed during business hours I could make a fortune selling it to IT departments. It means only alert me when it is a failure that needs me to take action to solve. For instance, if you have a database replicate set that is setup to handle X number of failures don’t wake me up at 4am for a single node failure. You should be notified of the failure but not alerted until it is getting into the range of ‘Oh Shit, pretty soon shit will break’.

TLDR

Collect All The Metrics

Monitor All The Things

Alert Me Only When Fucks Are Given