I needed to look at processes on the machine I’d been spending time debugging, in terms of what was running, what the state, the allocations, the IO, etc. Something was causing a hard panic, and it seemed correlated with an application issue.
I didn’t have a process space sampler, so I wrote one. Takes one sample per second right now (configurable) across the whole process space. Uses 1% CPU or so normally. I filter out a number of things I don’t care about (kernel threads and related worker bits).
Using this, I was able to get the aggregate memory for each type of application, along with the once a second play by play on attempted allocations, VmPeak/VmSize numbers, IO by each process, etc. This is very illuminating.
And, as with something that generates quite a bit of output, I caught some interesting bugs in SIOS-metrics itself, and fixed them (well, mostly).
One of the major things was the way I grabbed persistent metric code output. I had created a simple sync frame boundary that made detecting the last output in a stream very easy. But … one of the more interesting aspects with the amount of data I was generating was that my sampling rate was such that I might copy over only a partial buffer (not a problem for smaller IO output). I noticed this after some of the metric lines were truncated, only to have the rest of the line show up in the next time stamp.
So I put an end marker in, and between the sync and end markers, I have a data frame of arbitrary size. I guess I could simplify my parsing code even more by computing the size and putting it in the sync and end marker. But this would complicate the metrics a bit, and I want the metric side to be as lightweight as possible.
Once I fixed that, I was able to fix a few other bugs as well. Expect a commit on that later tomorrow.
This said, I caught an interesting er … feature … in influxdb on aggregation queries as you downsample. The larger the range, if you use a sum query, it will sum everything in the smallest interval, and present that back as the result. So it won’t be a sum over the time or an average of sum over the smallest interval. Which makes the graphs based upon it problematic at best.
I am looking at using kdb+ for this (32 bit version), and splayed tables for the storage (inbound data upsert to permanent storage), with a separate query engine using those files for the graphs using grafana. Grafana 3 is coming out soon with nice support for adding additional data sources, so my plan is to get a simple feed going from SIOS metrics directly into kdb+ (this part shouldn’t be too painful), and then working out the grafana logic to talk to kdb+. Seems the grafana folk are quite interested in this themselves, so hopefully I can get something going quickly in my “free” time.