π kernel achieved ….

From kernel.org

Viewed 1967 times by 437 viewers

Be on the lookout for ‘pauses’ in CentOS/RHEL 6.5 on Sandy Bridge

Probably on Ivy Bridge as well.

Short version. The pauses that plagued Nehalem and Westmere are baaaack. In RHEL/CentOS 6.5 anyway. A customer just ran into one.

We helped diagnose/work around this a few years ago when a hedge fund customer ran into this … then a post-production shop … then …

Basically the problem came in from the C-states. The deeper the sleep state, in some instances, the processor would not come out of it, or get stuck in the lower levels. This would manifest as a stutter … a momentary transient pause that was not easily reproducible. In the truest definition of the word, it was a Heisenbug.

We could make the problems go away by reducing the space of C-state transitions available to the processors. And tell the processors to be less active about idling.

We needed a little more this time, so we had to add the requisite kernel boot parameters to tweak idle and cstate, as well as have this code running in the background on startup.


my $lat = shift;

$lat = ($lat < 0 ? 0 : $lat);
$lat = ($lat > 250 ? 250 : $lat);

printf "Setting CPU latency to %i to control C state\n",$lat;
open(my $fh,">/dev/cpu_dma_latency") or die "FATAL ERROR: unable to set C state\n";
syswrite $fh,$lat or die "FATAL ERROR: unable to set C state\n";
while (1) { sleep 60 ; } # loop forever, as the file needs to remain open to force the C-state correctly. 

and run it in the background

nohup /opt/scalable/bin/set_cpu_lat.pl 0 > /var/log/c-state 2>&1 & 

Then we needed to make sure this was correct in terms of the processor states, so we lit up powertop. If you run the command w/o options, you can get instantaneous snapshots at a moment in time. And it shows all C0′s when we are done.

… though …

If you run it with the –csv option, and look at the idle report, you can see the impact of your changes … or non-impact.

This is where the previous post about tuned came from. Assume that somewhere in the system is an

alias pure_evil tuned

or something like that

This is the obligatory Time Bandits reference

No seriously, tuned … just say no.

Customer will test with this now. But this echos very much like the previous problem on the nehalem/westmere platforms. As I remember, it had a component that was silicon based in the original problem (an issue with a timer in the IOH/PCH or something) coupled with buggy software. Given that it isn’t in 6.4, I am just gonna call this a software bug and move on.

Viewed 6443 times by 1010 viewers

The best thing one can do with the tuned system is

yum remove tuned tuned-utils

This isn’t quite as bad as THP, but its close.

Viewed 6498 times by 1002 viewers

Soon … 12g goodness in new chassis

This is one of our engineering prototypes that we had to clear space for. A couple of new features I’ll talk about soon, but you should know that these are 12g SAS machines (will do 6g SATA of course as well).

Front of unit:

Note the new logo/hand bar. The rails are also brand new, and are set to enable easy slide in/out even with 100+ lbs of disk in them.


We’ve aggregated the 15 backplanes into 5 physical units. Easier to install/manufacture, lower costs, tastes great, …

These are 12g ports, and the design is still our great direct-attached mechanism. This matters for performance, and we can always add in an expander as needed into our design outside of the backplane. Keeps costs down and performance way … way up.

Working on getting some 12g SSD goodness to do some testing. We’ve got early indicators of performance going, and yes, it is blowing our (collective) mind. This is a massive step forward past our current generation of siFlash in terms of raw performance, and that was easily the worlds fastest storage unit.

As I said … soon …

Viewed 7402 times by 1083 viewers

Comcast disabled port 25 mail on our business account

We have a business account at home. I work enough from home that I can easily justify it. Fixed IP, and I run services, mostly to back up my office services.

One of those services is SMTP. I’ve been running an SMTP server, complete with antispam/antivirus/… for years. Handles backup for some domains, but is also primary for this site.

This is allowable on business accounts. Or it was allowable.

3 days ago, they seem to have turned that off. My wife noted that mail had … stopped.

So I started looking into it. Checked the firewall, checked the server. Tried telneting into the mail port. Nothing. Tried the same thing within the firewall. Worked.

Tried with a machine outside of the firewall, but before the cable modem.


At this stage in the story, gentle reader, you will have to imagine me shaking my head in disbelief. Again.

ISPs, as a general rule, are evil, with rare exceptions. I want a wire, a dumb stinking wire. I don’t need any other security outside of my perimeter, I handle that stuff just fine. I need speed, I need reliability. I just need a damned wire.

I don’t need a nanny ISP telling me what ports I can and cannot have open.

Thankfully, there is an easy solution for this, and I’ve been slowly working in that direction for a while.

Move our dns and mail service into a cloud machine.

So I spent time, between walking the dog, making coffee, making breakfast, doing just that. I had forgotten how wonderful setting up postfix is (it isn’t), especially our deep spam/virus filtering pipeline (not fun at all). Still have a few minor issues to iron out, but now the mail system is back, and its coupled with the dns system I wanted to setup anyway. And I used SSLmate to get some new certs for the email while I was at it.

I can easily see Comcast doing the same thing on port 80 or 443 in the future. So we are likely going to have to disaggregate more of our infrastructure and move it external to our site.

All I want is a dumb fast wire. Too bad Google Fibre won’t be showing up around here. But I bet I couldn’t use that for business either.

One would think, that with the advent of the cloud universe, that there would be demand for dumb fast wires.

Viewed 63025 times by 2811 viewers

Fantastic lecture from Michael Crichton

This is Michael Crichton of Andromeda Strain, Jurassic park, and other stories. Fantastic story teller, he absolutely nails his subject. The original was on his website, and I grabbed a copy from here.

One of the wonderful quotable paragraphs within is this:

And so, in this elastic anything-goes world where science?or non-science?is the handmaiden of questionable public policy, we arrive at last at global warming. It is not my purpose here to rehash the details of this most magnificent of the demons haunting the world. I would just remind you of the now-familiar pattern by which these things are established. Evidentiary uncertainties are glossed over in the unseemly rush for an overarching policy, and for grants to support the policy by delivering findings that are desired by the patron. Next, the isolation of those scientists who won’t get with the program, and the characterization of those scientists as outsiders and “skeptics” in quotation marks?suspect individuals with suspect motives, industry flunkies, reactionaries, or simply anti-environmental nutcases. In short order, debate ends, even though prominent scientists are uncomfortable about how things are being done.

When did “skeptic” become a dirty word in science? When did a skeptic require quotation marks around it?

A real scientist is, by its own very definition, a skeptic.

Viewed 69152 times by 3024 viewers

But … GaAs is the material of the future … and always will be …

I read a note on IBM’s recent allocation of capital towards research projects. It had this tidbit in there:

III-V technologies
IBM researchers have demonstrated the world?s highest transconductance on a self-aligned III-V channel metal-oxide semiconductor (MOS) field-effect transistors (FETs) device structure that is compatible with CMOS scaling. These materials and structural innovation are expected to pave path for technology scaling at 7nm and beyond. With more than an order of magnitude higher electron mobility than silicon, integrating III-V materials into CMOS enables higher performance at lower power density, allowing for an extension to power/performance scaling to meet the demands of cloud computing and big data systems.

Well, there are a range of III-V materials. Not just GaAs.

One of the big issues is the lattice mis-match between SI and many of the III-V material. This strain introduces “artifacts” in the bandstructure, not to mention structural morphologies. This said, those artifacts may be what the engineers want. Aluminum Phosphate and Gallium Phosphate are pretty well matched to SI. And the other properties aren’t bad.

But alas, AlP is quite toxic. GaP is possible, and has a similar growth process as with Silicon wafers. Its used in LED applications today. InP is also quite possible and it is used today in other optical applications, such as laser diodes.

I remember that GaAs/InGaAs systems were starting to become interesting around the time I had finished up. I am not unhappy with the direction I took away from this, though it is interesting to review where its gone in the last 18 years since I was more seriously involved.

Viewed 71021 times by 3104 viewers

Too simple to be wrong

I’ve been exercising my mad-programming skillz for a while on a variety of things. I got it in my head to port the benchmarks posted on julialang.org to perl a while ago, so I’ve been working on this in the background for a few weeks. I also plan, at some point, to rewrite them in q/kdb+, as I’ve been really wanting to spend more time with it.

The benchmarks aren’t hard to rewrite. No, thats not been the challenge. The challenge has been to leverage tools I’ve not used much before, like PDL.

It boils down to this. Tools like Python etc. get quite a bit of attention for big data and analytical usage, while other tools, say some nice DSLs, possibly more appropriate to the tasks, get less attention. I wanted an excuse to spend more time with the DSLs.

And I am curious about the speed of them, and the core language. Perl isn’t slow as a language. The code is compiled down to an internal representation before execution, so as long as we don’t do dumb things (including using Red Hat builds of it), it should be reasonably fast. But more to the point, DSLs can provide significant programmer simplicity and performance benefits, to say the least, when used correctly.

So I set about to doing the port, and completed the basic elements of it. I ran the tests in C, Fortran, Perl, Python, Julia, and Octave on my laptop. The problems are toy sized problems though, and can’t be used for real comparisons … which is to the detriment of the presentation of the benchmarks on the Julia site. Actually, I’d argue that a set of real world problems, showing coding/development complexity, performance, etc. would be far better (and actually be quite useful for promoting Julia usage). FWIW, I am a fan of Julia, though I do wish for static compilation, to simplify distribution of a runtime version of code (lower space footprint).

For the perl port, I used relevant internal functions where it was wise to do so. Why should we recode quicksort, when the sort function already does quicksort by default? Where there were no internal functions, I looked at options from CPAN to provide the basis for the algorithm. Given that Python leveraged numpy, I thought PDL made sense to use in similar cases.

But I always started out with the original in pure perl to make sure the algorithm was correct. I used Python 3.4.0, Perl 5.18.2, gcc/gfortran 4.7.2, Octave 3.6.x, Julia 0.3.0.x.

One simple example coded up was the Fibonacci series computation. Usually used as a test of recursion. Code is relatively trivial.

Execution time measured over m sets of N iterations. Timer resolution +/- 0.625 ms, N has to be large enough to get enough of a signal so measurement is much larger than the tick resolution.

lang execution time (x 10-3s)
C 0.068
Fortran 0.074
Python 3.17
Julia 0.072
Perl 5.65

Interesting, but not surprising. What about a more computationally bound test, say sum reduction (as in the pi_sum test, which is quite similar to my Riemann zeta function test).

lang execution time (x 10-3s)
C 48.7
Fortran 48.6
Python 684.4
Julia 46.9
Perl 83.6

So how can Perl be within a factor of 2 or so of the compiled languages? What horrible things did I have to do to the code?

sub pisumvec {
    my $sum = 0.0;
    my $k;
    foreach my $i (1 .. 501) {
     $k     = sequence(10000,1) + 1;
     $sum   += sumover(1.0/($k*$k));
    return $sum;

A simple vector sum, repeated 500 times. Nothing complex here, the DSL is embedded in the language. The += in the sum line is to prevent the optimizer from making the inner loop go away, and be computed once.

Nice. PDL has some cool powers in this case.

I also used it on the random matrix multiply bit.

lang execution time (x 10-3s)
C 228.3
Fortran 904.6
Python 220.5
Julia 209.6
Perl 238.5

Ok, whats surprising to me is the lower performance of the fortran code. It is quite consistent … so I am guessing that we are hitting on an aliasing issue that isn’t apparent with the other codes. This has been a problem with Fortran for a long while, and can cause sudden performance loss on things that should be fast. Given that we were using the matmul intrinsic, this should be nearly optimal in performance.

Basically I am noting that Perl appears, in these microbenchmarks, cherry picked for the moment, to be holding its own.

The only outlier appears to be in the rand_mat_stat for Perl, and I think I might have made a coding error in it. Still looking it over (this is mostly for PDL exploration, and I am still trying to get my head around PDL.

But here’s where things go pear shaped. The Mandel code snippet. Basically its to compute the Mandelbrot set from (-2+-1i) to (0.5+1i). We know what it should look like.

I decided to try to get cute with this. Always a mistake … learn from my errors oh young programmer, and do not hyperoptimize before you get it correct.

So here I have this awesome DSL, PDL. Instead of computing this an element at a time, why not compute the entire region at once? So I define a complex PDL, get the range/domains correctly. Iteration is trivial. The hard part is computing whether or not a particular point has escaped, as we wish to do this an image at a time. Turns out there are conditional functions that apply to the entire image, and generate “masks” or indices for indexing. This allows you to do things like $n->indexND($mask), and only increment the pointers that should be incremented (for the abs($z) <= 2).

Very cool stuff.

Computation is fast this way.

But its also wrong.

So I go back to doing a row at a time using PDL, and a simpler version of this.

Still wrong.


Do element at a time. Print out the counts, so I can see the image.

Yes, I do see the set, and it looks correct, quite like the images of the other sets I’ve seen.

Ok, we are onto something.

Now look at the benchmark provided code.

results look … well … wrong.

So I played with the version in Octave, and got the same results, as the Python version. Which look wrong.

But the code is simply too simple to be wrong. Unless something else is going on. So I am delving into what this difference could be. Comparing the Octave and perl implementations. Making sure the code returns the same results for random sets of points, and then figuring out why or why not.

Its too simple to be wrong, but … looking at the output, it definitely is. The question is which of these are wrong.

Viewed 133746 times by 5275 viewers

OS and distro as a detail of a VM/container

An interesting debate came about on Beowulf list. Basically, someone asked if they could use Gentoo as a distro for building a cluster, after seeing a post from someone whom did something similar.

The answer of course is “yes”, with the more detailed answer being that you use what you need to build the cluster and provide the cycles that you or your users will consume.

Hey, look, if someone really, truly wants to run their DOS application, Tiburon/Scalable OS will boot it. This isn’t an ego issue. We have customers actively booting everything from older linux distros through SmartOS and solaris rebuilds, and just about everything else in between.

But the interesting responses, to me anyway, came from the folks with one particular distro in mind. They wanted everyone to conform to their view in that they believed the “enterpriseness” of their distro was a strong positive in their argument for one distro. Actually, its a strong negative, in that newer things that people want and need often don’t show up in it for years.

This point was driven home many times, but the people dug in.

I respect the viewpoints, and those making them, even if I disagree with them. CentOS and derivatives are no longer something anyone can ship in a commercial setting (seriously, read the sites fine print), which in part was the last nail in the coffin for us. Its very hard to add value to a system when you are not allowed to change a single bit (again read the fine print), never mind not being allowed to ship that system.

I view the OS or distro as a detail of the run or application. There are a set of distros that are based on Red Hat that make some of what we want to do very nearly impossible, and with the new copyright and licensing fine print, pushes the cost-benefit equation deep into the “look elsewhere” category. But for end users whom have an app that depends on a particular distro, use that distro. Use what you need to use. There is no one size fits all.

Use what makes the most sense. We do.

We have customers using SmartOS (OpenSolaris derived) to run their apps. We have customers using openstack. Customers using other bits. Use whatever reduces the friction from start of planning through execution.

I look at the OS/distro as a detail of what it is you need to accomplish your job. And whether or not to containerize/VMize it as an engineering decision on the part of the user/administrator. Not as an imposition from on high.

Because the ease with which people can move from service to service makes mandates like that very self defeating.

Viewed 146108 times by 5654 viewers

Scratching my head over a weird bonding issue

Trying to set up a channel bond into a 10GbE LAG. Set up bonding module, use the ‘miimon=200 mode=802.3ad’ options.

The switch was sending LACP packets, 1/sec to the NICs. The NICs bond formed. But it didn’t seem to negotiate the LACP circuit correctly with the switch. The switch never registered it.

I’ve not seen that one before.

With Mellanox, Arista, Cisco, others like that, the LACP circuit forms correctly and quickly. Here the bond came up on the machine and the LACP bond simply wasn’t active. Though I could see bootp packets over it, thats all I saw.

And I don’t understand why.

Possible NIC incompatibility with the switch? Haven’t seen such a thing in a LONG time. This is a Brocade (a.k.a Extreme Networks) switch. I’ve not used EN in a long time, haven’t used their 10GbE kit.

Will try again next week.

Viewed 145319 times by 5614 viewers