Friday morning/afternoon code optimization fun

Every now and then I sneak a little time in to play with code. I don’t get to do as much coding as I wish these days … or … not the type of stuff I really enjoy (e.g. squeezing as much performance out of stuff as I can).

The Ceph team are rewriting some code, and they needed a replacement log function that didn’t require FP, and was faster/smaller than table lookup. I played with it a little, starting with FP code (Abramowitz and Stegun is a good source for the underlying math, by copy is dog-eared and falling apart). Then I wanted to see what could be done with pure integer code. Log functions can be implemented in terms of series in FP, or shifts and logical operations(!) for ints.

So pulling out a number of options, I did some basic coding, and summed the logs of functions of argument 1 to 65535 inclusive. Compared this to simple casting of the DP log2 function. Code is simple/straightforward, and fast.

#include <stdio .h>
#include <math .h>
#include <sys /time.h>
 
#define NUMBER_OF_CALIPER_POINTS 10
struct timeval ti,tf,caliper[NUMBER_OF_CALIPER_POINTS];
struct timezone tzi,tzf;
 
 
int log_2 (unsigned int v) 
  {  	
	register unsigned int r; // result of log2(v) will go here
	register unsigned int shift;
 
	r =     (v > 0xFFFF) < < 4; v >>= r;
	shift = (v > 0xFF  ) < < 3; v >>= shift; r |= shift;
	shift = (v > 0xF   ) < < 2; v >>= shift; r |= shift;
	shift = (v > 0x3   ) < < 1; v >>= shift; r |= shift;
    r |= (v >> 1);
    return r;
  }
 
int main (int argc, char **argv)
 {
 	int i, x, milestone;
 
 	float logx;
 	int sx = 0 , sumlogx = 0 ;
 	double delta_t;
 
 	milestone = 0;
    gettimeofday(&caliper[milestone],&tzf);
    for(x=1; x&lt;65536; x++) sx+=(int)log2((double)x);
    milestone++;
   	gettimeofday(&caliper[milestone],&tzf);
 
    for(x=1; x&lt;65536; x++) sumlogx+=log_2((unsigned int)x);
    milestone++;
  	gettimeofday(&caliper[milestone],&tzf);
 
   	printf("library function sum: %i\n",sx);
	printf("local   function sum: %i\n",sumlogx);
  
   /* now report the milestone time differences */
   for (i=0;i< =(milestone-1);i++)
    {
      delta_t = (double)(caliper[i+1].tv_sec-caliper[i].tv_sec);
      delta_t += (double)(caliper[i+1].tv_usec-caliper[i].tv_usec)/1000000.0;
      printf("milestone=%i to %i: time = %.5f seconds\n",i,i+1,delta_t);
    }
 }

Compile and run it, and this is what I get

landman@metal:~/work/development/log$ gcc -o log2.x log2.c -lm ; ./log2.x 
library function sum: 917506
local   function sum: 917506
milestone=0 to 1: time = 0.00284 seconds
milestone=1 to 2: time = 0.00091 seconds

Milestone 0 to 1 is the library log function. Milestone 1 to 2 is the local version in the source. Now change out the log_2 function for a different variant

int log_2 (unsigned int v) 
{

int r; // result goes here

static const int MultiplyDeBruijnBitPosition[32] = 
  {
    0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
    8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
  };

v |= v >> 1; // first round down to one less than a power of 2 
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;

r = MultiplyDeBruijnBitPosition[(unsigned int)(v * 0x07C4ACDDU) >> 27];

return r;
}

and the results are even more interesting

landman@metal:~/work/development/log$ gcc -o log2a.x log2a.c -lm ; ./log2a.x 
library function sum: 917506
local   function sum: 917506
milestone=0 to 1: time = 0.00288 seconds
milestone=1 to 2: time = 0.00065 seconds

And it gets faster.

Not my code for the log_2 function, but its fun to play with things like this.

Viewed 17220 times by 1898 viewers

Inventory reduction @scalableinfo

Its that time of year, when the inventory fairies come out and begin their counting. Math isn’t hard, but the day job would like a faster and easier count this year.

So, the day job is working on selling off existing inventory. We have 4 units ready to go out the door to anyone in need of 70-144TB usable storage at 5-6 GB/s per unit. Specs are as follows:

16-24 processor cores
128 GB RAM
48x {2,3,4} TB top mount drives
4x rear mount SSDs (OS/metadata cache)
Scalable OS (Debian Wheezy based Linux OS)
3 year warranty

As this is inventory reduction, the more inventory you take, the happier we are (and the less work that the inventory fairies have to do). We have 4 units to sell off as follows: 2x 72 TB usable, 1x 108 TB usable, and 1x 144 TB usable.

Contact the day job for more info, first come, first served.

And of course, feel free to order many portable PetaBytes!

Viewed 21287 times by 2137 viewers

The #PortablePetaByte : Coming to a data center near you!

As seen at SC14. We have our Portable PetaByte systems available for sale. Half rack to many racks, 1 PB and upwards, 20GB/s and up. Faster with SSDs. See the link above!

Viewed 29054 times by 2651 viewers

Three years

Its been 3 years to the day since I wrote this.

As we’ve been doing before this happened, and after this happened, we are going to a TSO concert on the anniversary of the surgery. Its an affirmation of sorts.

I can tell you that 3 years in, it has changed me in some fairly profound ways … I no longer take some things for granted. I try to spend more time with the family, do more things with them. I waste less time with trivialities.

Only two more years to go and we can call ourselves “survivors”.

Randall Munroe of xkcd fame has, again, done a fairly awesome job of illustrating what goes through your mind as a family member of a survivor. I think Randall and Megan hit their 5 year mark soon. The comic called two years really nailed it.

Viewed 38047 times by 3166 viewers

Systemd, and the future of Linux init processing

An interesting thing happened over the last few months and years. Systemd, a replacement init process for Linux, gained more adherents, and supplanted the older style init.d/rc scripting in use by many distributions. Ubuntu famously abandoned init.d style processing in favor of upstart and others in the past, and has been rolling over to systemd. Red Hat rolled over to Systemd. As have a number of others.

Including, surprisingly, Debian.

For those whom don’t know what this is, think of it this way. When you turn your machine on, and it starts loading the OS, init is the first process that runs, and it handles starting up all the rest of the system. Its important, and it needs to do its job well.

We care about it for Scalable OS, as we take control the normal startup procedure to handle our use cases. We work well with init.d and a number of others right now. We’ll have to explore systemd a bit more, but in general I am not expecting anything earth shattering.

This is in part, because the vast majority of what we see in the init type systems … well … for lack of a better phrase, just sucks.

Linux has been transitioning to an event based system with udev for a while. Udev is a rule based mechanism to handle events, with a kernel and user space component. Woe be unto those whom mess with the way udev wants to work, as the scripts behind it are … broken … badly … . I say this as someone whom has tried, very hard in a number of cases, to fix the broken-ness. In many cases I’ve discovered its easier to ignore the broken section and add intelligence into the setup/config code to work around the udev brain death.

Specifically, Linux does a great job at diskless booting. That is, until you share some directories. That udev needs. And assumes it has private copies of. So you have to work around that. You can’t fix udev … your fixes won’t be accepted upstream, and will just break at next OS update. So its easier to hack around it, and use a very light diversionary slight-of-bits touch to make sure it does what you want.

Another great example is md RAID1 OS drives, and RHEL6/CentOS6. We actually had to hack the whole initramfs approach to work around the broken udev module that brings up raids (and failed to correctly bring up raids for the system to fully start up).

Yeah, I know … open source makes it possible. Terrible implementation makes it necessary.

Upstart was a little more sane, but still had some issues.

init.d/rc in Debian 7 is reasonable, though we’ve seen still quite a bit of breakage.

This all goes to the philosophy of the distro. Are they trying to be everything to everyone, or be a very well crafted system for a set of purposes. Too many want the former, not enough the latter. Scalable OS is all about the latter. Make it boot, easily (not quickly now, but thats coming), make it just work.

Systemd promises to make startup in the init process different, pluggable (think udev and its horror), and so forth. We’ll have to play with it to see if it is mostly harmless or not. I suspect its going to cause at least little grief with our startup mechanism, so we’ll see if we need to work around it, or throw it away.

During startup, many distros have concepts where they read (assumed local) configuration files to set up file systems, networks, functions. This is a lousy thing to do for clouds, clusters, etc. You really want a distributed control mechanism that provides these config options. Scalable OS has this implicit within it. But to get to this distributed control layer, you need network access.

And this is where most distros are sheer and utter crap in their network setup code. We have a far better way built into Scalable OS, that was born of the frustration of dealing with the distros broken network config mechanisms. Generally speaking, you should never start a dhcp process on an ethernet port that doesn’t have a carrier present (after bringing the port up). Yet, this is exactly what most distros do by default. It gets even more interesting when you invoke udev, pci scanning in the kernel (done in a different order from a previous kernel, so items are discovered in a different order), so that some machines are absolutely unable to get back onto the network after a kernel update.

Yeah, we’ve seen/experienced this. Quite common with RHEL/CentOS kernels updating to ours. And we’ve got work arounds to deal with udev when we need to.

The question we have is, will systemd make this better? Worse? Not impact this? I suspect that the pci scan done by the kernel won’t change much, its simply how systemd will respond to this. We know how udev/init.d respond to things, and we’ve done our process change to remove the terrible/useless sections whereever possible. Though we still, on occasion get bit by udev race conditions.

Udev is a piece of work. Fantastic for small machines without much stuff. Absolutely, completely borked for machines with lots of stuff. We see some occasional, fantastic, non-controllable race conditions in udev processes, that init is handling. My hope is that systemd is far smarter than its predecessors. I hate having to tell people that the solution to this seemingly mad system is to reboot it. Yet udev will drive you to this. Hopefully we can move past that.

But, if not, we’ll do what we’ve done with the others, and work around it, disabling what gets in the way.

That coupled with our configuration mechansim (not quite CoreOS like, we aren’t using etcd right now, but we’ll be evaluating it and other options … build vs “buy”), and we’ll be fine. Actually far better than any system that depends upon external mechanisms (like Chef, Puppet, et al) to configure machines post installation.

Hardware is code, and should be automatically configured. This is what Scalable OS does, and I am hoping that systemd won’t get in our way.

Even if it does, Debian just forked over this. So I am not worried at all.

There are some folks saying this is the “end of Linux” or other such fluff.

Not likely, but in the end, the operating system is an implementation detail (machine/container as code, its merely a configuration option). As long as I can use the hardware well, I am happy. Right now, for better or worse, Linux has the best driver support in market, albeit sometimes a maddening driver support (see binary only modules delivered by OEMs without a clue). There are other choices … I’d love to see better driver support for Illumos based machines, and *BSD (though these generally do have OK driver support).

But we need Infiniband support, we need 40GbE and above support, we need memory channel storage and NVMe support for our customers. Limited choices there now.

So systemd will be a challenge to get through, but I am not overly worried. I see the OS as a substrate upon which to run bare metal/containerized/VMed apps. Systemd shouldn’t impact that too much, and if it does, it will be swept away.

Viewed 40738 times by 3335 viewers

Brings a smile to my face

My soon to be 15 year old daughter was engrossed with something on her laptop yesterday. Thinking it was fan-fiction, I asked her what she was writing.

She knitted her brow for a moment, and looked up.

“Its code combat Dad.” she said, quite matter of factly.

I must have had a slightly startled expression on my face. I knew she had dabbled with it, and had recommended (/sigh) Python as a language, after she took (and aced) a Java class last year, as Python is inherently simpler. I would have loved to have introduced her to Perl, but she needs to figure out which tools to use on her own.

“Nice” I said. “What are you coding in?”

“Well” she said, “I had been using Python, but it was too annoying. So I started using Lua.”

I was stunned at several levels. None of which have anything to do with gender. Mostly having to do with what I thought I knew about what she wanted to do in her down time. And that she now has taught herself two computer languages for code combat and other activities.

And of course, the self-taught computer geek (started with Basic, Assembly, Fortran, C, …) in me was thinking “a chip off the old block”.

I don’t want to push her in any particular direction, she’s got to decide what she wants, and discover what she enjoys. And that appears to include learning new computer languages for programming contests and games.

I want to make sure I encourage this. Solving problems is fun, and coding should be fun. I think she’s getting this.

Viewed 44196 times by 3404 viewers

Learning to respect my gut feelings again

A “gut feeling” is, at a deep level, a fundamental sense of something that you can’t necessarily ascribe metrics to, you can’t quantify exactly. Its not always right. Its a subconscious set of facts, ideas, concepts that seem to suggest something below the analytical portion of your mind, and it could bias you into a particular set of directions. Or you could take it as an aberration and go with “facts”.

As an entrepreneur, I’ve had many gut feelings about what to do, when to do it. They aren’t always right. But when they are … they are usually whoppers.

In 2002, when Vipin and I were in his lab looking at 40 core DSP chips, and speaking aloud about building accelerators out of them … that was a gut feeling that the high performance computing market had no other choice than to go that route to continue to advance in performance. We built business plans, architected designs for the platform, pricing models, went to investors, told them things that later turned out to be remarkably prescient. No one saw fit to invest unfortunately.

This has been a feature of my career. Many very good ideas, that later turn out to be huge markets, developing almost exactly as we speculated, and we can’t get investment. Its fundamentally, profoundly frustrating. Not to mention discouraging. But we soldier on.

In 2005, when we had sold the concept of remote computer cycles (what was to be later called “cloud”) to a large company in Michigan, again, we tried to get investment going. We had a large committed customer, a good business model, a good operational model, even some investors lined up if we could get the state of Michigan to commit as part of its tri-corridor process. They only need have put in a token amount, and thats all we needed. I need not tell you this didn’t happen, and the reasons given were, sadly, laughably wrong at best.

Our gut feelings on both of these markets were that they were going to be huge.

To put it midly, we were right. Very very right.

The next epiphany was on the cluster and storage side. We’d been designing and building clusters up until then with embedded storage. Dell decided it wanted to own clusters, and it worked on depriving the small folks of oxygen with pricing gymnastics. It was easy for them to write off coming in under cost on clusters, much harder for a small outfit to justify paying a customer to take hardware. My gut feeling at the time was that clusters would become an impossibly hard market to work in, so we focused upon where we could add our unique value. Storage and storage based systems it was.

Along the way, we’ve seen many opportunities, some looking very good but bugging me something fierce, and some looking bad on the surface, but having the qualities that we needed. So I went with my gut on whether or not to pursue those.

And we’ve grown, by quite a bit during that time. There is much to be said for subconcious analytics.

As we’ve grown, we’ve brought on more people to work on opportunities. Recently we’ve had some opps and we’ve serviced them, which have run strongly against my gut feeling. As we’ve seen these evolve, my gut was right, they’ve turned into (in some cases) bad deals for us.

Also, as we look at our capitalization efforts, I get similar feelings about particular directions and potential investors. I don’t mind lots of legalese. We have lawyers to deal with that. I mind games. If people play them now, it will be worse later on. If they won’t act in reasonable time frames and with reasonable terms, we need to move on.

Its this gut feeling that has served us very well, that I temporarily overrode in the past … that I am bringing back in a big way.

We met with an erstwhile customer at the SC14 show. Makes great promises, sets additional hurdles. Never does business with us. I’d like to, but the cost of chasing this customer may simply be too high for us, for little return, at this stage of our life. If we were bigger, it would be less of an issue. If we had a large investor with a lot of committed capital in us, again, less of an issue to act on working with them. But we don’t as of yet. We have a particular hand of cards we can play, and some we need to discard in order to improve our hand. As much as I might like to play this hand with that card for that customer, my gut tells me to wait.

Its tough, but I’m going back to my gut feelings on these. For customers, for investors, whatever. If I don’t get a good feeling, or if I see actions which on their own might be innocuous, but collectively would be predatory, I’ll rethink working with them.

The gut feeling is all about the value prop and the ROI on effort. Sometimes its dead on. Far more often for us than not. Its time to use it again, in the large.

Viewed 36885 times by 3719 viewers

#SC14 day 2: @LuceraHQ tops @scalableinfo hardware … with Scalable Info hardware …

Report XTR141111 was just released by STAC Research for the M3 benchmarks. We are absolutely thrilled, as some of our records were bested by newer versions of our hardware with newer software stack. Congratulations to Lucera, STAC Research for getting the results out, and the good folks at McObject for building the underlying database technology.

This result continues and extends Scalable Informatics domination of the STAC M3 results. I’ll check to be sure, but I believe we are now the hardware side of most of the published records.

Whats really cool about this is that you can get this power from Lucera if you don’t want or need to stand up your own kit, from us or our partners if you prefer to stand up your own private cloud, or combinations if you would like to take advantage of all the additional capabilities and functionality Lucera brings to the table.

Viewed 43827 times by 4232 viewers

Starting to come around to the idea that swap in any form, is evil

Here’s the basic theory behind swap space. Memory is expensive, disk is cheap. Only use the faster memory for active things, and aggressively swap out the less used things. This provides a virtual address space larger than physical/logical memory.

Great, right?

No. Heres why.

1) swap makes the assumption that you can always write/read to persistent memory (disk/swap). It never assumes persistent memory could have a failure. Hence, if some amount of paged data on disk suddenly disappeared, well …

Put another way, it increases your failure likelihood, by involving components with higher probability of failure into a pathway which assumes no failure.

2) it uses 4k pages (on linux). Just. Shoot. Me. Now. Ok, there are ways to tune this a bit, and we’ve done this, but … but … you really don’t want to do many many 4k IOs to a storage device. Even an SSD.

NVMe/MCS may help here. But you still have the issue number 1, unless you can guarantee atomic/replicated writes to the NVMe/MCS.

3) Performance. Sure, go ahead and allocate, and then touch every page of that 2TB memory allocation on your 128GB machine. Go ahead. I’ve got a decade or two to wait.

4) Interaction with the IO layer is sometimes buggy in surprising ways. If you use a file system, or a network attached block device (think cloud-ish), and you need to allocate a SKB or some additional memory to write the block out, be prepared for some exciting (and not in the good way) failures, some spectacular kernel traces that you would swear are recursive allocation death spirals.

“Could not write block as we could not allocate memory to prepare swap block for write …”

Yeah. These are not fun.

5) OOM is evil. There is just no nice way to put this. OOM is evil. If it runs, think “wild west”. kill -9 bullets have been lobbed against, often, important, things. Using ICL to trace what happened will often lead you agape with amazement at the bloodbath you see in front of you.

So towards this end, we’ve been shutting off paging whenever possible, and the systems have been generally faster and more stable. We’ve got some ideas on even better isolation of services to prevent self flagellation of machines. But the take home lesson we’ve been learning is … buy more ram … it will save you headache and heartache.

Viewed 54037 times by 4538 viewers

#sc14 T-minus 2 days and counting #HPCmatters

On the plane down to NOLA. Going to do booth setup, and then network/machine/demo setup. We’ll have a demo visualfx reel from a customer whom uses Scalable Informatics JackRabbit, DeltaV (and as the result of an upgrade yesterday), Unison.

Looking forward to getting everything going, and it will be good to see everyone at the show!

Viewed 47322 times by 4278 viewers