Have a nice cli for InfluxDB

I tried the nodejs version and … well … it was horrible. Basic things didn’t work. Made life very annoying.

So, being a good engineering type, I wrote my own. It will be up on our site soon. Here’s an example

./influxdb-cli.pl --host 192.168.5.117 --user test --pass test --db metrics


metrics> \list series

.----------------------------------.
| series name                      |
+----------------------------------+
| lightning.cpuload.avg1           |
| lightning.cputotals.idle         |
| lightning.cputotals.irq          |
| lightning.diskinfo.writes.sda    |
| lightning.nettotals.pktin.tap0   |
| lightning.nettotals.pktout.eth1  |
| lightning.swapinfo.free          |
| lightning.swapinfo.total         |
| lightning.diskinfo.wait.sda      |
   .
   .
   .

Nice, eh?

Now

metrics> select * from lightning.diskinfo.writes.sda limit 10

.--------------------------------------.
|     lightning.diskinfo.writes.sda    |
+------------+-----------------+-------+
| time       | sequence_number | value |
+------------+-----------------+-------+
| 1408138307 |               1 |     4 |
| 1408138306 |               1 |     2 |
| 1408138305 |               1 |     3 |
| 1408138304 |               1 |     0 |
| 1408138303 |               1 |     2 |
| 1408138302 |               1 |    39 |
| 1408138301 |               1 |     2 |
| 1408138300 |               1 |     0 |
| 1408138299 |               1 |     0 |
| 1408138298 |               1 |    15 |
'------------+-----------------+-------'
                                                                          metrics> \quit

Its not perfect, and I can’t change much yet, but a few more hours and this thing will be good enough to be very functional.

I used this already to fix something wrong with the Grafana display.

Note: there are other TSDB that were recommended to me. Two in particular I will be seeing what I can do with in the next few weeks, but I really wanted to get InfluxDB working the way I needed.

Viewed 16507 times by 1387 viewers

Scalable Informatics 12 year anniversary

I had forgotten to mention, but we hit our 12 year mark on the 1st of August. We’ve grown from a small “garage” based company (really “basement-based” in Michigan, as garages aren’t heated in winter, nor cooled in summer here), with one guy doing consulting, cluster system builds, tuning, benchmarking, white paper writing … to a 10 person outfit building the worlds fastest and densest tightly coupled storage and computing systems.

We’ve got a great team, terrific customers, and wonderful products that are second to none.

I’ll talk about this in a later post, but we are running 12th anniversary specials on a number of our products (JackRabbit, siFlash, Unison racks).

I’ll have a reflection post on what we’ve accomplished during our 12 years in business, and what you might see coming soon and in the longer view.

Viewed 20514 times by 1626 viewers

Time series databases and system metrics

I am working on updating our FastPath appliance web management/monitoring gui for the day job. Trying to push data into databases for later analysis.

Many tools have been written on the collection side, statsd, fluentd, … and some are actually pretty cool. The concern for me is the way these tools express their analytical and storage opinions, which is done on the storage side. The data collection side isn’t an issue, if anything, its a breath of fresh air relative to what else I’ve seen.

The problem is that I want to collect very detailed metrics, store them, and then use analytical tools to perform analytics. I don’t want to pre-aggregate the data.

This puts a little higher load on the storage side, but this isn’t an issue as our FastPath manager nodes have ample processing/storage power on their own.

The issue is on the analytical side. There are a large number of dashboarding solutions out there, some are quite good looking. But, most are written client side in javascript/nodejs, and not so curiously, are great for small amounts of data. Once you start getting serious amounts of data into them, they choke rather badly. Server side dashboarding means moving bits around efficiently, which makes mobile device support more complex.

Its a balance.

This said, I am experimenting with pulling data from collectl dumping it into influxdb. I can ingest detailed plotfiles from it easily, and get the data in quickly. The dashboards can then extract the data, and generate nice plots. Working with Grafana and Tasseo for the moment. I played a little with influga as well.

In all the cases, the time series database is the limiting factor, though in Grafana, everything looks beautiful, as long as you adhere precisely to their opinionated view of how data should be queried (which means I need to adapt our storage of data to the presentation tool to make it work well).

I’d love to use kdb+ for the TSDB. Just need to figure out either how to plug it into influxdb backend as the storage DB, or create a graphite/carbon/ceres frontend for kdb+ that drives it. kdb+ is blisteringly fast, and purpose built for the types of analytics we want to use. If we could have Grafana and Tasseo talk directly to kdb+ that’s also an option.

Sadly I don’t have much time right now to do the research around what it would take to make this happen. Maybe in a few weeks.

Viewed 20577 times by 1649 viewers

Comcast finally fixed their latency issue

This has been a point of contention for us for years. Our office has multiple network attachments, using Comcast is part of it. This is the main office, not the home office.

Latency on the link, as measured by DNS pings, have always been fairly high, in the multiple 2-3ms region, as compared to our other connection (using a different provider and a different technology) which has been consistently, 0.5ms for the last 2 years.

I’ve got some nice long term logs of this, and I can clearly see that Comcast made some change a few weeks ago. Now our latency on the link is 0.3-0.4ms consistently.

No more of these bursts to 200ms+ causing our FW to redirect traffic out the slower but more stable link. Oh no … this one is fast and lower latency.

I’d love to get real GbE to the premises, but the cost is currently prohibitive, and it involves my paying for the providers to rip up the road to run a cable. Unless I use AT&T that is, whom already have a pipe under the road.

Viewed 49257 times by 3273 viewers

π kernel achieved ….

From kernel.org

Viewed 57155 times by 3665 viewers

Be on the lookout for ‘pauses’ in CentOS/RHEL 6.5 on Sandy Bridge

Probably on Ivy Bridge as well.

Short version. The pauses that plagued Nehalem and Westmere are baaaack. In RHEL/CentOS 6.5 anyway. A customer just ran into one.

We helped diagnose/work around this a few years ago when a hedge fund customer ran into this … then a post-production shop … then …

Basically the problem came in from the C-states. The deeper the sleep state, in some instances, the processor would not come out of it, or get stuck in the lower levels. This would manifest as a stutter … a momentary transient pause that was not easily reproducible. In the truest definition of the word, it was a Heisenbug.

We could make the problems go away by reducing the space of C-state transitions available to the processors. And tell the processors to be less active about idling.

We needed a little more this time, so we had to add the requisite kernel boot parameters to tweak idle and cstate, as well as have this code running in the background on startup.

#!/usr/bin/perl

my $lat = shift;

$lat = ($lat < 0 ? 0 : $lat);
$lat = ($lat > 250 ? 250 : $lat);

printf "Setting CPU latency to %i to control C state\n",$lat;
open(my $fh,">/dev/cpu_dma_latency") or die "FATAL ERROR: unable to set C state\n";
syswrite $fh,$lat or die "FATAL ERROR: unable to set C state\n";
while (1) { sleep 60 ; } # loop forever, as the file needs to remain open to force the C-state correctly. 

and run it in the background

nohup /opt/scalable/bin/set_cpu_lat.pl 0 > /var/log/c-state 2>&1 & 

Then we needed to make sure this was correct in terms of the processor states, so we lit up powertop. If you run the command w/o options, you can get instantaneous snapshots at a moment in time. And it shows all C0′s when we are done.

… though …

If you run it with the –csv option, and look at the idle report, you can see the impact of your changes … or non-impact.

This is where the previous post about tuned came from. Assume that somewhere in the system is an

alias pure_evil tuned

or something like that

This is the obligatory Time Bandits reference

No seriously, tuned … just say no.

Customer will test with this now. But this echos very much like the previous problem on the nehalem/westmere platforms. As I remember, it had a component that was silicon based in the original problem (an issue with a timer in the IOH/PCH or something) coupled with buggy software. Given that it isn’t in 6.4, I am just gonna call this a software bug and move on.

Viewed 61230 times by 4012 viewers

The best thing one can do with the tuned system is

yum remove tuned tuned-utils

This isn’t quite as bad as THP, but its close.

Viewed 49717 times by 3870 viewers

Soon … 12g goodness in new chassis

This is one of our engineering prototypes that we had to clear space for. A couple of new features I’ll talk about soon, but you should know that these are 12g SAS machines (will do 6g SATA of course as well).


Front of unit:



Note the new logo/hand bar. The rails are also brand new, and are set to enable easy slide in/out even with 100+ lbs of disk in them.


Backplanes:



We’ve aggregated the 15 backplanes into 5 physical units. Easier to install/manufacture, lower costs, tastes great, …

These are 12g ports, and the design is still our great direct-attached mechanism. This matters for performance, and we can always add in an expander as needed into our design outside of the backplane. Keeps costs down and performance way … way up.

Working on getting some 12g SSD goodness to do some testing. We’ve got early indicators of performance going, and yes, it is blowing our (collective) mind. This is a massive step forward past our current generation of siFlash in terms of raw performance, and that was easily the worlds fastest storage unit.

As I said … soon …



Viewed 48226 times by 3910 viewers

Comcast disabled port 25 mail on our business account

We have a business account at home. I work enough from home that I can easily justify it. Fixed IP, and I run services, mostly to back up my office services.

One of those services is SMTP. I’ve been running an SMTP server, complete with antispam/antivirus/… for years. Handles backup for some domains, but is also primary for this site.

This is allowable on business accounts. Or it was allowable.

3 days ago, they seem to have turned that off. My wife noted that mail had … stopped.

So I started looking into it. Checked the firewall, checked the server. Tried telneting into the mail port. Nothing. Tried the same thing within the firewall. Worked.

Tried with a machine outside of the firewall, but before the cable modem.

Worked.

At this stage in the story, gentle reader, you will have to imagine me shaking my head in disbelief. Again.

ISPs, as a general rule, are evil, with rare exceptions. I want a wire, a dumb stinking wire. I don’t need any other security outside of my perimeter, I handle that stuff just fine. I need speed, I need reliability. I just need a damned wire.

I don’t need a nanny ISP telling me what ports I can and cannot have open.

Thankfully, there is an easy solution for this, and I’ve been slowly working in that direction for a while.

Move our dns and mail service into a cloud machine.

So I spent time, between walking the dog, making coffee, making breakfast, doing just that. I had forgotten how wonderful setting up postfix is (it isn’t), especially our deep spam/virus filtering pipeline (not fun at all). Still have a few minor issues to iron out, but now the mail system is back, and its coupled with the dns system I wanted to setup anyway. And I used SSLmate to get some new certs for the email while I was at it.

I can easily see Comcast doing the same thing on port 80 or 443 in the future. So we are likely going to have to disaggregate more of our infrastructure and move it external to our site.

All I want is a dumb fast wire. Too bad Google Fibre won’t be showing up around here. But I bet I couldn’t use that for business either.

One would think, that with the advent of the cloud universe, that there would be demand for dumb fast wires.

Viewed 103155 times by 5302 viewers

Fantastic lecture from Michael Crichton

This is Michael Crichton of Andromeda Strain, Jurassic park, and other stories. Fantastic story teller, he absolutely nails his subject. The original was on his website, and I grabbed a copy from here.

One of the wonderful quotable paragraphs within is this:

And so, in this elastic anything-goes world where science?or non-science?is the handmaiden of questionable public policy, we arrive at last at global warming. It is not my purpose here to rehash the details of this most magnificent of the demons haunting the world. I would just remind you of the now-familiar pattern by which these things are established. Evidentiary uncertainties are glossed over in the unseemly rush for an overarching policy, and for grants to support the policy by delivering findings that are desired by the patron. Next, the isolation of those scientists who won’t get with the program, and the characterization of those scientists as outsiders and “skeptics” in quotation marks?suspect individuals with suspect motives, industry flunkies, reactionaries, or simply anti-environmental nutcases. In short order, debate ends, even though prominent scientists are uncomfortable about how things are being done.

When did “skeptic” become a dirty word in science? When did a skeptic require quotation marks around it?

A real scientist is, by its own very definition, a skeptic.

Viewed 89682 times by 5255 viewers