New monitoring tool, and a very subtle bug

I’ve been working on coding up some additional monitoring capability, and had an idea a long time ago for a very general monitoring concept. Nothing terribly original, not quite nagios, but something easier to use/deploy. Finally I decided to work on it today.

The monitoring code talks to a graphite backend. Could talk to statsd, or other things. In this case, we are using the InfluxDB plugin for graphite. I wanted an insanely simple local data collector. And I want it controllable in a very simple config file manner. This is to run on the client being monitored, and the data pushed back to the database.

Here is the basic config file layout:

# config-file-type: JSON 1
{
   "global" : {
      "log_to_file" : "1",
      "log_name" : "/tmp/metrics-$system.log",
    },

    "db" : {
      "host"    : "a.b.c.d",
      "port"    : "2003",
      "proto"   : "tcp"
    },

   "metrics" : {      
      "uptime" : {
          "command"   : "/home/landman/work/development/gui/plugins/uptime.pl",
          "interval"  : 5,
          "timeout"   : 2                
      },      
   } 
}

The database is pointed to by the db section, and the metrics are contained in the data structure as indicated. Command could be a script or a command line. Interval is the time in seconds between runs, and timeout is the maximum length in seconds before a child process is killed (preventing a run-away and accumulation of zombies).

The code reads this config, creates one thread per metric, opens its own connection to the database (yeah, potentially problematic for large numbers of metrics, will address that later). Then it takes the output of the command, does brain dead simple “parsing”.

The scripts look like this (can use any language, we don’t care)

#!/usr/bin/perl

use strict;
my ($rc,$u,$n);

chomp($rc = `cat /proc/uptime`);
if ($rc =~ /(\d+\.\d+)\s+(\d+\.\d+)/) {
   printf "uptime:%.2f\n",$1;
}

and the data they spit out is very simple as well.

landman@lightning:~/work/development/gui$ plugins/uptime.pl 
uptime:104612.03

Simple future optimizations include launching a process once that wakes up at a configurable time to return data and then goes back to sleep. Potentially important on busy systems.

The metrics.pl code then pulls this data in, slightly reformats it for graphite, and sends it off.

It, well, mostly … sort of … worked. I had to fix two bugs. Technically, I didn’t fix the bugs. I worked around them. They were not good bugs, and they are showing me I might need to rethink writing this code in Perl.

The first bug I caught was quite surprising. Using the IPC::Run module, which I’ve used for almost a decade, I create a run harness, and start the code running. Everything executes correctly. Output is generated correctly. Gets pulled into the program.

Notice how I didn’t say “gets pulled into the program correctly”. It took me a while to find this, and I had to resort to “first principles”.

 # I can't find where this bug is, but the last character of mvalue is wrong ...
 @c =split(//, $mvalue);
 pop @c;
 $mvalue = join('',@c);
 # ... so lop it off

For some reason, and I’ve not figured it out, we were getting a carriage return appended to the end of the output. Chomp, the “smart” code that lops off newline characters on input lines, was unable to handle this.

I only saw it in my debugging output, when output lines were corrupted. Something that should never happen. Did.

Ok. Si the code above splits the mvalue into its characters. I included a


$mvalue = join("|",@c);

in the code so I could see what it thought that should be. And sure enough, thats how I caught the bug that should not be.

The work around is hackish, but I can live with it until I can figure the rest out.

Its the next bug that is brutal. I have a work around. And its evil. Which is why I am thinking about rewriting in another language, as this may point to a bug in part of the core functionality. I need this to work quickly, so I’ll use the hack in the near term, but longer term, I need to have something that works and is easy to work with.

I am using IO::Socket::INET to connect a client to a remote socket. I am doing this inside of the threads::shared paradigm in Perl. For a number of reasons, Perl doesn’t have real threads … well it does and it doesn’t. Long story, but threads::shared is the best compromise, leveraging forking and some “magic”. Sockets generally work well in forked environments … at least servers do. I am not sure about clients now.

Brain dead simple constructors, nothing odd. Checking return values. All appears to be well.

then I do a send($msg) and …

… not a bloody thing. It “succeeds” but the data never shows up in the database.

So, here comes the hack. The send($msg) call is logically equivalent to “echo $msg | nc $host $port”, so replace that one line with the send, with this external call. See what happens.

Now data starts showing up.

Of course, the single threaded version of the testing code, where I built the code to do the actual sends, works great. Data shows up.

But not when the identical code (both calling the same methods in the module), is running in the threads::shared environment.

Grrrrr….

I’ll figure it out later. But this is a subtle bug. Very hard to characterize, and then I have to chase it down.

Viewed 51248 times by 6013 viewers

Facebooktwittergoogle_plusredditpinterestlinkedinmail