Monitoring tools

We have a collection of tools that we use for various monitoring. Some are the classical standards (iostat, vmstat, …), the somewhat more heavyweight (collectl, dstat), the simple (not in a bad way) graphical tools (munin, ganglia, …).
We’ve found tools like Zabbix do a good job of killing some machines, as there are often memory leaks in these tools.
What we’ve not found, anywhere, has been a good set of simple measurement tools that provide data in an easy manner that allow for easy inclusion into something akin to a dashboard. No, not talking Ganglia or Munin or others. They aren’t dashboards.

Think of a screen to give you an overall view of your system at a glance. You might have speedometers, odometers, fuel gauges, various pressure sensors, … on a typical automobile. You might have analogues to these on the system dashboard.
What we’ve not found is anything akin to this dashboard in a graphical or text sense of the word. And the data coming from numerous sources … tends to be what the author wanted at that moment, not necessarily what made sense for us.
So now we have a collection of our own tools, that I’ve been thinking of how to tie together, and present the information in a meaningful manner. I think I solved the “tie together” portion. Have a few more experiments to make, but this should be a fairly low impact monitoring tool to provide live, and historical data. We’ve been using it for demos and internal tests. It seems to fit the bill fairly well.
But the user interface for a text based dashboard. Something like top but more meaningful at a system level. There’s htop, atop, top, and other tools. All do some of what we want. Most don’t do specific things we want. They are aimed at a different audience.
This also all ties into our system control CLI (and GUI atop the CLI). Which also all ties in to our target config code. Which all ties in …
Its an ongoing process. Part of my problem is I don’t like kicking out code I am not completely happy with. For ever project I’ve announced or talked about, there might be 20 or so in various stages of completion, ones I am anywhere from unhappy with to moderately happy.
Our internal testing systems now include 4 DV4 machines, 3 JR machines (1 JR5, 1 JR4, 1 JR3), and 8 compute nodes with both a QDR and 10GbE switch network. This is where our monitoring and control dashboard has to play. The goal is 1 pane of glass to see/control anything on the systems. Tiburon was a huge step in that direction (the level of effort to get the big memory machine booted up was to install the right network wire to the unit). The monitoring tools will tie into this. As will the target tool.
Its coming together nicely, albeit slowly.

4 thoughts on “Monitoring tools”

  1. I hear what you’re saying about tying things together and that’s what I’ve tried to do with colmux, sort of. Since you’re already familiar with collectl, think of ‘collectl meets top’ or something like that. In this case, you can run almost ANY collectl command across a cluster of nodes (I’ve tested this at over 1K nodes) and present an integrated top-like view of that command’s output sorted by the column of you choice!
    So not only do you get a traditional cluster-top command, you can do something similar to cluster-iostat, cluster-mstat, cluster-xyzstat. For those you are big fans of vmstat, using collectls –vmstat switch you can also do cluster-vmstat.
    But wait – there’s more! colmux can also play back recorded data, so you can look at top historical data too.
    In fact, if you’re really into something fancier, you could hack up colmux’s output routine and replace it with your own. In that way colmux can still collectl/integrate all the data for you. Just a suggestion.

  2. @Mark
    Colmux and others aren’t quite what we needed. I am not saying negative things about collectl or colmux, in fact I’ve developed a bunch of tools that take the plot file output and allow you to aggregate it, renormalize it, etc.
    What we are looking for is much lighter weight and much simpler on the data collection side. On the display side, we haven’t found a tool which does anything close to what we want for all the machines at once. dstat is close for a single machine, but not so good for many. Collectl/colmux’s displays aren’t what we want right now for this, and some of the transport concepts are not what we want to use for a number of reasons (which become important later on).
    Collectl is good, but its use case is different than what we want for these tools.

  3. I definitely hear what you’re saying as difference situations require different models. From a collectl/load perspective, my initial plan was to prototype it in perl and rewrite it to make it more efficient. Much to my surprise when it came in around 0.1%, when collecting everything at 10 second samples, and process/slab data at 60 second rates I felt that was good enough and the ease of modification offset the need for more efficiency.
    I do know of users who would agree that this is too much of a load while for others it’s no problem. As they say, it depends. 😉 Out of curiosity what kind of load are you trying to achieve? And at what sampling rate?
    I’d also be curious to hear what your thoughts are on transport as I’d’ve thought socket level communications was as reasonable fast as one could get, at least over a tcp network. Or are you thinking of a native infiniband stack? Again, I’m always looking to improve things when not too hard to do.
    From a cluster size, what are you talking about from a size perspective? Ks of nodes? One thing I’d suggest if you do write your own tools it to use a push model like collectl does so that all samples are taken at the same time across the cluster rather than a pull model which some tools do. I’ve found by synchronizing samples as close as possible with a high resolution timer, I can come very close to taking samples everywhere on a cluster of any size with 2 major benefits — the system noise is synchronized and have actually measured a lower impact on fine-grained mpi jobs AND since all the samples are at the same time it makes it much easier analyze cross-cluster behaviors.

  4. Something I’ve recently covered that you might be interested in is that newer kernels on many-core systems are having some serious performance issues reading /proc. As a result ALL tools suffer. I wrote something up here – if you want to read more.

Comments are closed.