Monitoring tools

We have a collection of tools that we use for various monitoring. Some are the classical standards (iostat, vmstat, …), the somewhat more heavyweight (collectl, dstat), the simple (not in a bad way) graphical tools (munin, ganglia, …).

We’ve found tools like Zabbix do a good job of killing some machines, as there are often memory leaks in these tools.

What we’ve not found, anywhere, has been a good set of simple measurement tools that provide data in an easy manner that allow for easy inclusion into something akin to a dashboard. No, not talking Ganglia or Munin or others. They aren’t dashboards.

Think of a screen to give you an overall view of your system at a glance. You might have speedometers, odometers, fuel gauges, various pressure sensors, … on a typical automobile. You might have analogues to these on the system dashboard.

What we’ve not found is anything akin to this dashboard in a graphical or text sense of the word. And the data coming from numerous sources … tends to be what the author wanted at that moment, not necessarily what made sense for us.

So now we have a collection of our own tools, that I’ve been thinking of how to tie together, and present the information in a meaningful manner. I think I solved the “tie together” portion. Have a few more experiments to make, but this should be a fairly low impact monitoring tool to provide live, and historical data. We’ve been using it for demos and internal tests. It seems to fit the bill fairly well.

But the user interface for a text based dashboard. Something like top but more meaningful at a system level. There’s htop, atop, top, and other tools. All do some of what we want. Most don’t do specific things we want. They are aimed at a different audience.

This also all ties into our system control CLI (and GUI atop the CLI). Which also all ties in to our target config code. Which all ties in …

Its an ongoing process. Part of my problem is I don’t like kicking out code I am not completely happy with. For ever project I’ve announced or talked about, there might be 20 or so in various stages of completion, ones I am anywhere from unhappy with to moderately happy.

Our internal testing systems now include 4 DV4 machines, 3 JR machines (1 JR5, 1 JR4, 1 JR3), and 8 compute nodes with both a QDR and 10GbE switch network. This is where our monitoring and control dashboard has to play. The goal is 1 pane of glass to see/control anything on the systems. Tiburon was a huge step in that direction (the level of effort to get the big memory machine booted up was to install the right network wire to the unit). The monitoring tools will tie into this. As will the target tool.

Its coming together nicely, albeit slowly.

Viewed 27404 times by 4357 viewers