Too simple to be wrong

I’ve been exercising my mad-programming skillz for a while on a variety of things. I got it in my head to port the benchmarks posted on julialang.org to perl a while ago, so I’ve been working on this in the background for a few weeks. I also plan, at some point, to rewrite them in q/kdb+, as I’ve been really wanting to spend more time with it.

The benchmarks aren’t hard to rewrite. No, thats not been the challenge. The challenge has been to leverage tools I’ve not used much before, like PDL.

It boils down to this. Tools like Python etc. get quite a bit of attention for big data and analytical usage, while other tools, say some nice DSLs, possibly more appropriate to the tasks, get less attention. I wanted an excuse to spend more time with the DSLs.

And I am curious about the speed of them, and the core language. Perl isn’t slow as a language. The code is compiled down to an internal representation before execution, so as long as we don’t do dumb things (including using Red Hat builds of it), it should be reasonably fast. But more to the point, DSLs can provide significant programmer simplicity and performance benefits, to say the least, when used correctly.

So I set about to doing the port, and completed the basic elements of it. I ran the tests in C, Fortran, Perl, Python, Julia, and Octave on my laptop. The problems are toy sized problems though, and can’t be used for real comparisons … which is to the detriment of the presentation of the benchmarks on the Julia site. Actually, I’d argue that a set of real world problems, showing coding/development complexity, performance, etc. would be far better (and actually be quite useful for promoting Julia usage). FWIW, I am a fan of Julia, though I do wish for static compilation, to simplify distribution of a runtime version of code (lower space footprint).

For the perl port, I used relevant internal functions where it was wise to do so. Why should we recode quicksort, when the sort function already does quicksort by default? Where there were no internal functions, I looked at options from CPAN to provide the basis for the algorithm. Given that Python leveraged numpy, I thought PDL made sense to use in similar cases.

But I always started out with the original in pure perl to make sure the algorithm was correct. I used Python 3.4.0, Perl 5.18.2, gcc/gfortran 4.7.2, Octave 3.6.x, Julia 0.3.0.x.

One simple example coded up was the Fibonacci series computation. Usually used as a test of recursion. Code is relatively trivial.

Execution time measured over m sets of N iterations. Timer resolution +/- 0.625 ms, N has to be large enough to get enough of a signal so measurement is much larger than the tick resolution.

lang execution time (x 10-3s)
C 0.068
Fortran 0.074
Python 3.17
Julia 0.072
Perl 5.65

Interesting, but not surprising. What about a more computationally bound test, say sum reduction (as in the pi_sum test, which is quite similar to my Riemann zeta function test).

lang execution time (x 10-3s)
C 48.7
Fortran 48.6
Python 684.4
Julia 46.9
Perl 83.6

So how can Perl be within a factor of 2 or so of the compiled languages? What horrible things did I have to do to the code?

sub pisumvec {
    my $sum = 0.0;
    my $k;
    foreach my $i (1 .. 501) {
     $k     = sequence(10000,1) + 1;
     $sum   += sumover(1.0/($k*$k));
    }
    return $sum;
}

A simple vector sum, repeated 500 times. Nothing complex here, the DSL is embedded in the language. The += in the sum line is to prevent the optimizer from making the inner loop go away, and be computed once.

Nice. PDL has some cool powers in this case.

I also used it on the random matrix multiply bit.

lang execution time (x 10-3s)
C 228.3
Fortran 904.6
Python 220.5
Julia 209.6
Perl 238.5

Ok, whats surprising to me is the lower performance of the fortran code. It is quite consistent … so I am guessing that we are hitting on an aliasing issue that isn’t apparent with the other codes. This has been a problem with Fortran for a long while, and can cause sudden performance loss on things that should be fast. Given that we were using the matmul intrinsic, this should be nearly optimal in performance.

Basically I am noting that Perl appears, in these microbenchmarks, cherry picked for the moment, to be holding its own.

The only outlier appears to be in the rand_mat_stat for Perl, and I think I might have made a coding error in it. Still looking it over (this is mostly for PDL exploration, and I am still trying to get my head around PDL.

But here’s where things go pear shaped. The Mandel code snippet. Basically its to compute the Mandelbrot set from (-2+-1i) to (0.5+1i). We know what it should look like.

I decided to try to get cute with this. Always a mistake … learn from my errors oh young programmer, and do not hyperoptimize before you get it correct.

So here I have this awesome DSL, PDL. Instead of computing this an element at a time, why not compute the entire region at once? So I define a complex PDL, get the range/domains correctly. Iteration is trivial. The hard part is computing whether or not a particular point has escaped, as we wish to do this an image at a time. Turns out there are conditional functions that apply to the entire image, and generate “masks” or indices for indexing. This allows you to do things like $n->indexND($mask), and only increment the pointers that should be incremented (for the abs($z) <= 2).

Very cool stuff.

Computation is fast this way.

But its also wrong.

So I go back to doing a row at a time using PDL, and a simpler version of this.

Still wrong.

Gaak.

Do element at a time. Print out the counts, so I can see the image.

Yes, I do see the set, and it looks correct, quite like the images of the other sets I’ve seen.

Ok, we are onto something.

Now look at the benchmark provided code.

results look … well … wrong.

So I played with the version in Octave, and got the same results, as the Python version. Which look wrong.

But the code is simply too simple to be wrong. Unless something else is going on. So I am delving into what this difference could be. Comparing the Octave and perl implementations. Making sure the code returns the same results for random sets of points, and then figuring out why or why not.

Its too simple to be wrong, but … looking at the output, it definitely is. The question is which of these are wrong.

Viewed 24183 times by 1660 viewers

OS and distro as a detail of a VM/container

An interesting debate came about on Beowulf list. Basically, someone asked if they could use Gentoo as a distro for building a cluster, after seeing a post from someone whom did something similar.

The answer of course is “yes”, with the more detailed answer being that you use what you need to build the cluster and provide the cycles that you or your users will consume.

Hey, look, if someone really, truly wants to run their DOS application, Tiburon/Scalable OS will boot it. This isn’t an ego issue. We have customers actively booting everything from older linux distros through SmartOS and solaris rebuilds, and just about everything else in between.

But the interesting responses, to me anyway, came from the folks with one particular distro in mind. They wanted everyone to conform to their view in that they believed the “enterpriseness” of their distro was a strong positive in their argument for one distro. Actually, its a strong negative, in that newer things that people want and need often don’t show up in it for years.

This point was driven home many times, but the people dug in.

I respect the viewpoints, and those making them, even if I disagree with them. CentOS and derivatives are no longer something anyone can ship in a commercial setting (seriously, read the sites fine print), which in part was the last nail in the coffin for us. Its very hard to add value to a system when you are not allowed to change a single bit (again read the fine print), never mind not being allowed to ship that system.

I view the OS or distro as a detail of the run or application. There are a set of distros that are based on Red Hat that make some of what we want to do very nearly impossible, and with the new copyright and licensing fine print, pushes the cost-benefit equation deep into the “look elsewhere” category. But for end users whom have an app that depends on a particular distro, use that distro. Use what you need to use. There is no one size fits all.

Use what makes the most sense. We do.

We have customers using SmartOS (OpenSolaris derived) to run their apps. We have customers using openstack. Customers using other bits. Use whatever reduces the friction from start of planning through execution.

I look at the OS/distro as a detail of what it is you need to accomplish your job. And whether or not to containerize/VMize it as an engineering decision on the part of the user/administrator. Not as an imposition from on high.

Because the ease with which people can move from service to service makes mandates like that very self defeating.

Viewed 36495 times by 2143 viewers

Scratching my head over a weird bonding issue

Trying to set up a channel bond into a 10GbE LAG. Set up bonding module, use the ‘miimon=200 mode=802.3ad’ options.

The switch was sending LACP packets, 1/sec to the NICs. The NICs bond formed. But it didn’t seem to negotiate the LACP circuit correctly with the switch. The switch never registered it.

I’ve not seen that one before.

With Mellanox, Arista, Cisco, others like that, the LACP circuit forms correctly and quickly. Here the bond came up on the machine and the LACP bond simply wasn’t active. Though I could see bootp packets over it, thats all I saw.

And I don’t understand why.

Possible NIC incompatibility with the switch? Haven’t seen such a thing in a LONG time. This is a Brocade (a.k.a Extreme Networks) switch. I’ve not used EN in a long time, haven’t used their 10GbE kit.

Will try again next week.

Viewed 36539 times by 2139 viewers

New customers

We have a number of nice new customers that have been absorbing about all of my time for the last few weeks. This is goodness.

One has our current generation FastPath Cadence SSD converged computing and storage system, and will be running kdb+ on it.

Another has a 1PB Unison parallel file system, and while we did the previous 2TB write in 73 seconds with it, we did some tuning and tweaking and are down to 68 seconds. With no SSDs for the primary data storage path. Yes, our spinning rust boxes are, often as fast or faster than various competitors SSD boxen. You should see how fast our SSD arrays are.

Another has a brand new Unison storage target for genomics we helped set up today.

Others have a new DeltaV pair for media service.

etc.

Its been a busy month. A good month, but a busy one. Hopefully some coding time for me over the long holiday weekend.

Viewed 36619 times by 2175 viewers

M&A: PLX snarfed by … Avago ?

Ok, didn’t see this acquirer coming, but PLX being bought … yeah, this makes sense. Avago looks like they are trying to become the glue between systems, whether the glue is a data storage fabric, or communications fabric, etc.

PLX makes PCIe switches and other kit. PCIe switch and interconnection is the direction that many are converging to. Best end to end latencies, best per-lane performance, no protocol stack silliness to deal with. Just RDMA your data over into a buffer.

I had thought maybe Intel might grab them at some point as they were grabbing other network fabric bits.

More M&A anticipated. Much more.

Viewed 79277 times by 4234 viewers

M&A: SanDisk snarfs FusionIO for $1.1B USD

This is only the beginning folks … only the beginning.

See this.

FusionIO was, quite arguably, in trouble. They needed a buyer to take them to the next level, and to avoid being made completely irrelevant. SanDisk is a natural partner for them. They have the fab and chips, FusionIO has a product. SanDisk has a vision for a flash-only data center.

What’s interesting about this is that Fusion was sort of the last independent enterprise class PCI Flash vendors. There are a few smaller shops running around. LSI sold its Nytro bits off to Seagate. HGST/WD bought our friends at Virident. Intel has their own PCIe card. Micron has theirs, but they are a relatively smaller player. I don’t think Samsung has a PCIe card version. Toshiba bought the remnants of OCZ, which included a PCIe card. One we tested in the past, and … well … hilarity ensued.

Trust me, don’t ask.

So, unless I am wrong, there really aren’t any independent enterprise class PCIe folks out there anymore. Which is interesting. I didn’t expect PCIe flash to last as long as it did … in part … because it is intrinsically non-hotswappable. So if you are deploying it in mission critical areas, you need to build pairs of servers in fail-over configs.

Virident has technology to reduce the impact of flash failure built into their controllers. Not sure anyone else does.

Back to the business side. I’d expect to see Violin go next. Like FusionIO, they are in trouble. They are, to a degree, attempting to expand into our area, and compete against tightly coupled machines with massive IO systems, heavily tuned kernels.

Possibly Nimble, though they are holding their own, and not in a bad way like Violin.

Pure, Tegile, Tintri, Nutanix etc. have made acquisitions in this space more costly due to their IPO prep and valuations. I’d expect to see the smaller players in this space snarfed up this summer.

There is always a frothy acquisition and girding for battle around major paradigm changes. Spinning rust being used for lower cost things that tape was once used for, and tape being relegated to where it is headed is a massive change*. All flash data centers are on the way, and flash capacity is growing massively. I’m under NDA so I can’t say what I am aware of or from whom, but you should expect to hear of some very surprising things soon (assuming they come to fruition, which I am betting on).

And as a tangent, I noted that IBM, EMC, and HDS were all claiming the title of flash superiority as they shipped 17-19PB of flash last year. Curious, we shipped a bit north of 0.5PB of flash last year. 1/34th to 1/40th of the big boys. And our flash is probably a bit faster

* Yeah, I know, some industry stalwarts will disagree strongly with my characterization of tape and disk, not to mention flash … the proof is in the pudding as it were.

Viewed 118398 times by 5343 viewers

Selling inventory to clear space

[Update 16-June] We’ve sold the 64 bay FastPath Cadence (siFlash based) , and now we have a few more 60 bay hybrid Ceph and FhGFS units, as well as a 48 bay front mount siFlash.

Whats coming in are many of our next gen 60 bay units, with a new backplane design, and we want to start running benchmarks with them ASAP. As we have limited space in our facility, we gotta make hard choices …

Email me (landman@scalableinformatics.com) if you’d like more info on what we have left. New units due in next week or so from our manufacturer, so we need to clear these out!

I feel like Crazy Eddie right now …


Well, in the course of new product development and testing, sometimes you need to make space in the lab by selling the current kit.

We are selling our record setting unit, our Cadence system, easily the fastest in market server and storage in a 4U container. A few others as well, see below.

The reason is that the next generation machines are coming in, and we can’t reuse (as in move over) the existing SSDs or backplanes (6g vs next gen bits).

For those whom don’t remember, the Cadence is this unit (TSA-F4) we were showing off on the floor of SC13:

First come, first served, we need to clear the space quickly.

Other units we have in the lab we need to move out include a Unison Ceph system, and a 60 bay JackRabbit converged storage and computing unit.

Send an email to me (landman _AT_ scalableinformatics *DOT* com ) if you’d like to discuss.

Viewed 171513 times by 6583 viewers

Divestment: Violin sells off PCIe flash card

This article notes that Violin has divested itself of its PCIe flash card. This card was, to a degree, a shot across the Fusion IO/Virident/Micron bows. I don’t think it ever was a significant threat to them though.

Terms of the sale indicate about $23M cash and assumptions of $0.5M liabilities, as well as hiring the team.

What is interesting is where it was sold.

Hynix. Yes, the memory chip/flash maker.

I’ve been saying for a while that the parts OEMs want to get more vertically integrated, and capture more of the up-market margin and business. PCIe flash cards would be up-market for Hynix, and they’d be able to find a willing reseller community to push this.

This is also one less distraction for Violin. Given their other issues, this may or may not make a difference.

I fully expect to hear more M&A and divestment announcements soon. Companies are girding themselves for the battles in the new market space ahead.

Viewed 176002 times by 6879 viewers

M&A: Seagate acquires LSI’s flash and accelerated bits from Avago

I’ve been saying for a while that M&A is going to get more intense as companies gird for the battles ahead.

I see component vendors looking at doing vertical integration … not necessarily to compete with their customers, but to offer them alternatives, reference designs, etc. and capture a portion of the higher margin businesses.

This move gives Seagate control over Sandforce controllers, and PCIe flash.

See this link for more info. As I had noted, LSI’s flash business was not core to Avago. Less head scratching at this point.

But it does make sense. Seagate, WD, and Toshiba all want to move up market from pure component sales. This is why Seagate is investing in Xyratex appliances instead of shuttering it. I have other examples, but being under NDA means I can’t talk about them, but suffice it to say that we are pretty sure this is not the last N&A of the season … rather the beginning of a very interesting change in the market.

Viewed 176366 times by 6896 viewers

Massive, unapologetic, firepower: 2TB write in 73 seconds

A 1.2PB single mount point Scalable Informatics Unison system, running an MPI job (io-bm) that just dumps data as fast as the little Infiniband FDR network will allow.

Our test case. Write 2TB (2x overall system memory) to disk, across 48 procs. No SSDs in the primary storage. This is just spinning rust, in a single rack.

This is performance pr0n, though safe for work.

usn-01:/mnt/fhgfs/test # df -H /mnt/fhgfs/
Filesystem      Size  Used Avail Use% Mounted on
fhgfs_nodev     1.2P  895M  1.2P   1% /mnt/fhgfs


usn-01:/mnt/fhgfs/test # /opt/openmpi/1.8.1/bin/mpirun --allow-run-as-root --hostfile hostfile  -np 48 ./io-bm.exe -n 2048 -b 48  -w -f /mnt/fhgfs/test/files

...
 
Thread=00004: host=usn-01 time = 72.055 s IO bandwidth = 606.351 MB/s
Thread=00032: host=usn-06 time = 72.008 s IO bandwidth = 606.748 MB/s
Thread=00025: host=usn-05 time = 72.118 s IO bandwidth = 605.818 MB/s
Thread=00030: host=usn-06 time = 72.200 s IO bandwidth = 605.134 MB/s
Thread=00014: host=usn-03 time = 72.242 s IO bandwidth = 604.782 MB/s
Thread=00003: host=usn-01 time = 72.291 s IO bandwidth = 604.371 MB/s
Thread=00013: host=usn-03 time = 72.328 s IO bandwidth = 604.062 MB/s
Thread=00027: host=usn-05 time = 72.333 s IO bandwidth = 604.021 MB/s
Thread=00033: host=usn-06 time = 72.391 s IO bandwidth = 603.541 MB/s
Thread=00045: host=usn-08 time = 72.299 s IO bandwidth = 604.303 MB/s
Thread=00040: host=usn-07 time = 72.377 s IO bandwidth = 603.656 MB/s
Thread=00000: host=usn-01 time = 72.432 s IO bandwidth = 603.195 MB/s
Thread=00015: host=usn-03 time = 72.513 s IO bandwidth = 602.521 MB/s
Thread=00031: host=usn-06 time = 72.518 s IO bandwidth = 602.482 MB/s
Thread=00022: host=usn-04 time = 72.520 s IO bandwidth = 602.464 MB/s
Thread=00026: host=usn-05 time = 72.526 s IO bandwidth = 602.413 MB/s
Thread=00041: host=usn-07 time = 72.535 s IO bandwidth = 602.338 MB/s
Thread=00028: host=usn-05 time = 72.505 s IO bandwidth = 602.585 MB/s
Thread=00034: host=usn-06 time = 72.565 s IO bandwidth = 602.094 MB/s
Thread=00005: host=usn-01 time = 72.580 s IO bandwidth = 601.970 MB/s
Thread=00044: host=usn-08 time = 72.572 s IO bandwidth = 602.035 MB/s
Thread=00001: host=usn-01 time = 72.632 s IO bandwidth = 601.535 MB/s
Thread=00021: host=usn-04 time = 72.643 s IO bandwidth = 601.447 MB/s
Thread=00011: host=usn-02 time = 72.723 s IO bandwidth = 600.782 MB/s
Thread=00008: host=usn-02 time = 72.723 s IO bandwidth = 600.785 MB/s
Thread=00009: host=usn-02 time = 72.728 s IO bandwidth = 600.739 MB/s
Thread=00007: host=usn-02 time = 72.752 s IO bandwidth = 600.542 MB/s
Thread=00019: host=usn-04 time = 72.770 s IO bandwidth = 600.391 MB/s
Thread=00024: host=usn-05 time = 72.752 s IO bandwidth = 600.539 MB/s
Thread=00002: host=usn-01 time = 72.797 s IO bandwidth = 600.174 MB/s
Thread=00035: host=usn-06 time = 72.791 s IO bandwidth = 600.218 MB/s
Thread=00017: host=usn-03 time = 72.786 s IO bandwidth = 600.264 MB/s
Thread=00016: host=usn-03 time = 72.802 s IO bandwidth = 600.127 MB/s
Thread=00043: host=usn-08 time = 72.764 s IO bandwidth = 600.441 MB/s
Thread=00012: host=usn-03 time = 72.815 s IO bandwidth = 600.020 MB/s
Thread=00020: host=usn-04 time = 72.813 s IO bandwidth = 600.039 MB/s
Thread=00039: host=usn-07 time = 72.839 s IO bandwidth = 599.829 MB/s
Thread=00010: host=usn-02 time = 72.840 s IO bandwidth = 599.820 MB/s
Thread=00042: host=usn-08 time = 72.816 s IO bandwidth = 600.014 MB/s
Thread=00037: host=usn-07 time = 72.856 s IO bandwidth = 599.686 MB/s
Thread=00036: host=usn-07 time = 72.927 s IO bandwidth = 599.104 MB/s
Thread=00018: host=usn-04 time = 72.927 s IO bandwidth = 599.100 MB/s
Thread=00038: host=usn-07 time = 72.943 s IO bandwidth = 598.972 MB/s
Thread=00023: host=usn-04 time = 72.956 s IO bandwidth = 598.866 MB/s
Thread=00029: host=usn-05 time = 72.979 s IO bandwidth = 598.674 MB/s
Thread=00006: host=usn-02 time = 72.997 s IO bandwidth = 598.528 MB/s
Thread=00046: host=usn-08 time = 72.965 s IO bandwidth = 598.786 MB/s
Thread=00047: host=usn-08 time = 73.043 s IO bandwidth = 598.149 MB/s
Naive linear bandwidth summation = 28874.455 MB/s
More precise calculation of Bandwidth = 28711.174 MB/s

and the files

usn-01:~ # df -H /mnt/fhgfs/
Filesystem      Size  Used Avail Use% Mounted on
fhgfs_nodev     1.2P  2.2T  1.2P   1% /mnt/fhgfs

usn-01:/mnt/fhgfs/test # du -m files*
43691	files.0
43691	files.1
43691	files.10
43691	files.11
43691	files.12
43691	files.13
43691	files.14
43691	files.15
43691	files.16
43691	files.17
43691	files.18
43691	files.19
43691	files.2
43691	files.20
43691	files.21
43691	files.22
43691	files.23
43691	files.24
43691	files.25
43691	files.26
43691	files.27
43691	files.28
43691	files.29
43691	files.3
43691	files.30
43691	files.31
43691	files.32
43691	files.33
43691	files.34
43691	files.35
43691	files.36
43691	files.37
43691	files.38
43691	files.39
43691	files.4
43691	files.40
43691	files.41
43691	files.42
43691	files.43
43691	files.44
43691	files.45
43691	files.46
43691	files.47
43691	files.5
43691	files.6
43691	files.7
43691	files.8
43691	files.9

This is just a single rack

Viewed 215811 times by 8299 viewers