sios-metrics core rewritten

By joe

October 27, 2015 - 4 minutes read - 673 words

This was a long time coming. Something I needed to do, in order to build a far better code capable of using less network, less CPU power, and providing a better overall system. In short, I ripped out the graphite bits and wrote a native interface to InfluxDB. This interface will also be adapted to kdb+ (32 bit edition), and graphite as time allows. In the process, I cleaned up a tremendous amount of code. I removed lots of excess debugging bits. Fixed some very annoying problems. I changed the way data is transmitted. Part of the reason I ripped out the graphite bits was that I felt that it encouraged a very suboptimal metric specification/transmission mechanism. Sure, there is a “pickling” version, but even that is highly inefficient. The mechanism I have now is far denser, though it is still not perfect. I’ve got a nice idea for an even denser mechanism (very easy to parse) that should work out nicely in the coming months, and will require only a slight change to the output code. The output pathway had been quite fragile, and it was unable to easily cope with a server going offline for a bit. I’ve improved this some, but will do a better job on connecting to primary/secondary/tertiary servers in the coming months. And, by the way, the configuration and plugin system is much better. Configuration is now for global system:

# config-file-type: JSON 1
{
   "global" : {
      "log_to_file" : "1",
      "log_name" : "/tmp/metrics-$system.log",
      "run_dir"  : "/dev/shm",
    },
    "db" : {
    "default" : {
                "host"    : "localhost",
                "port"    : "8086",
                "proto"   : "http",
            "db"	  : "unison"
            },
    "second" : {
                        "host"    : "192.168.101.250",
                        "port"    : "2003",
                        "proto"   : "tcp",
            "db"      : "fastpath"
                    },
    },
   "metrics" : {
     "plugin_dirs" : ["plugins/"]
   }
}

Note that you specify plug-in directories, and the code automagically searches for “.json” files in there, which handle the plugins. An example of such a file is this:

# config-file-type: JSON 1
{
  "metric": {
          "enabled" 	: 1,
          "command" 	: "plugins/cputemp.pl",
          "interval"	: 1,
          "timeout" 	: 2,
          "persistent" 	: 1,
              "xmit"       	: 1
  },
  "alerts" : {
    "hot" : {
          "condition" : "_coretemp_ > 80.0",
          "message"   : "Warning: CPU temp greater than 80",
          "severity"  : 5,
          "action"    : ["alert"]
    },
  },
}

In this file, I specify a plugin code (can be in ANY language that can run on a machine), the sampling interval in seconds, the response timeout in seconds, whether or not the code is persistent (e.g. runs as a process and sends output to STDIO instead of being invoked each time it is used), and whether or not to transmit the results. When I run the plugins (they must be able to run entirely on their own) I get something like this:

landman@lightning:~/work/development/sios-metrics$ plugins/cputemp.pl
#### sync:1445922362
cputemp,core=0,machine=lightning,socket=0 coretemp=54.0
cputemp,core=1,machine=lightning,socket=0 coretemp=55.0
cputemp,core=2,machine=lightning,socket=0 coretemp=56.0
cputemp,core=3,machine=lightning,socket=0 coretemp=53.0
#### sync:1445922363
cputemp,core=0,machine=lightning,socket=0 coretemp=54.0
cputemp,core=1,machine=lightning,socket=0 coretemp=54.0
cputemp,core=2,machine=lightning,socket=0 coretemp=56.0
cputemp,core=3,machine=lightning,socket=0 coretemp=53.0

This is my laptop BTW. I have hooks in place to have the system respond to different signals (HUP to close and rotate logs, USR1 to reread configs). Also, notice the “alert” section. This is coding in process, but the idea is to locally decide upon alerts at an appliance level. There are global/holistic issues, and local issues. Getting one system to handle/decide upon both is an exercise in futility. So local alerts will generate signals to the alerting system. This is decidedly not a re-invention of a wheel. We have very different goals for this measurement, monitoring and alerting system than most of the others we’ve seen. And these will be unfolding and becoming more obvious over the next several months. Once I get the rest of the json files constructed for the plugins, and rewriting the relevant plugins, I’ll update the public repo with a new branch/tags. More soon. And for those really interested, I spent far too long trying to figure out why I wasn’t seeing output in some of the plugins. Turns out $| is very important when running in a subshell. Go figure.