More unix command line humor

Waaaay back in grad school in (mumble) late 80s/early 90s (/mumble), I started using Unix in earnest. Back then, my dad shared some funny Unix error messages which were double entendres … often quite entertaining, as the shell was effectively playing the straight man in a comedy duo. Without intentionally doing so (of course).

Nowadays, you can ask Siri about the air speed of an unladen swallow, and get something funny back, but that is because Siri has had that capability programmatically added. These are funny, because the humor is unintentionally ironic.

See this link for some of them.

With this background, late last week, I saw a reference to a BSD library function call run around work. The call is ffs, and a ‘man 3 ffs’ on my Mac shows something like this.

FFS(3)                   BSD Library Functions Manual                   FFS(3)

     ffs, ffsl, ffsll, fls, flsl, flsll -- find first or last bit set in a bit

     Standard C Library (libc, -lc)


     ffs(int value);

Ok. This is part of the background. The other part, is the common abbreviation indicating exasperation, which is FFS.

Now that this background is in process, lets see if we can get some humor.

The man page shows up on my BSD systems (Mac, SmartOS, Linux, etc.) under section 3. But I was given a man page section of 3c, so

landman@lightning:~$ man 3c ffs
No manual entry for ffs in section 3c

[for the unix command line humor impaired, replace the ffs with the urban dictionary version of this and say that “No manual entry” line out loud with that substitution …. not at work, or in front of small children, or on the phone …]

Thank you, thank you, I’ll be here all week.

Viewed 35981 times by 2691 viewers

What reduces risk … a great engineering and support team, or a brand name ?

I’ve written about approved vendors and “one throat to choke” concept in the past. The short take from my vantage point as a small, not well known, but highly differentiated builder of high performance storage and computing systems … was that this brand specific focus was going to remove real differentiated solutions from market, while simultaneously lowering the quality and support of products in market. The concept of brand and marketing of a brand is about erecting barriers to market entry against the smaller folk whom might have something of interest, and the larger folk who might come in with a different ecosystem.

Remember “no one gets fired for buying IBM” ? Yeah, this is that.

The implication is that this might not be true for other vendors than IBM.

This post is not about IBM BTW. Not even remotely.

It’s about the concept of risk reduction in vendor selection.

And what real risk reduction means.

Lets look at this in terms of say … RAID units. RAID, as a concept, is about distributing risk failure across N units, with a scheme in such a manner as to be able to survive and operate (albeit at a reduced capability) in the event of a failure of a single unit. In some cases, in the case of a failure of two units. RAID is not a backup (yeah, it is likely time I repost this warning). RAID is about giving time to operators to replace a system component so that operations can continue.

Erasure coding is a somewhat more intensive version of this, but basically the same thing.

You make the (reasonable) assumption, that you will have a failure. You have an architecture in place, resilient to that failure. When a failure comes, within the design spec of that resiliency, you mitigate the impact of the failure if you follow protocol. Of course, if you have a failure outside of this spec, yes, you can lose data. Which is why we have layered protocols and systems for disaster recovery (replication, on and off-site).

All of this matters. Whether you are building storage systems, large computing systems, clouds, etc.

The attention to detail, the base engineering, the ability to support … find problem root causes, and meaningful remediations/work arounds … all of this matters. And from my own experience, running a company that did these things, it matters far more than brands do.

A brand is meant to be an abstract mental concept … that somehow represents how well a product should behave, and the support/engineering behind it. However, a brand is rarely that. It is really, just a name. There is little empirical evidence that shows that slapping a particular label on the outside of a box does anything to make it better/more stable. And if I’m wrong, I’d love to see the peer-reviewed studies of this (a cursory google search yielded a few results of popular anecdotal articles, with limited real analysis behind them).

My claim is that engineering matters. What you put into your design and implementation matter. What doesn’t matter, I claim, is the brand name on the box.

Sure, you can claim “they have better access to supply chain, OEMs, etc.”. And you may be right. Without revealing anything specific, I can tell you that this access doesn’t necessarily result in better outcomes.

Actually, if your boxes never have issues to begin with … well, you understand.

But more to the point. Architecture matters. Engineering matters. Support matters. Brand? Not so much.

If you or your company are making decisions based upon brands, it might be a good exercise to ask … “why” … is this being done? Is this risk reduction? If so, what risk can you quantifiably and empirically determine has been reduced? I am guessing this isn’t the real reason.

Is it comfort level with a vendor? That is, you know the brand names won’t go away, be sold off, or go into bankruptcy. Like IBM, Apple, Sun … er … oh wait.

What is the real reason that you have to buy vendor X?

And getting back to the RAID analogy above, in order to reduce risk, shouldn’t you have 2 vendors (at minimum) whom can produce the same things, with different parts (it is possible that some parts may be in common … you can’t escape that, but hey, a VW and a Porshe have pistons, and they are very different vehicles, engineering and built to different standards).

It’s too late for my old company … though I am getting support requests now to my personal email account … but in general, the question you need to ask yourself is, am I really reducing risk by concentrating risk? Will I really get better support from a behemoth who only wants to deal with massive customers, or a smaller dedicated team of experts whom are highly focused upon me, because … hey … I am core to their business, and they are invested in you.

The question is, do you want a brand, or do you want solid engineering and support, invested in your success? Not every smaller company is like that.

Scalable was.

And we lost.

Because people wanted the single brand, single throat to choke.

On the other side of this now, I see the impact with that single throat to choke doesn’t result in better outcomes.

So … you are going to get failures. You should be engineering for this. Planning for this.

Who will support you better?

That is who you should buy from.

Anyone focusing on ease of procurement over quality of engineering needs to be pulled out of the decision and purchasing loop. Really.

My argument is that operational and project risk increases when you do this. I don’t have hard numbers to demonstrate this. Merely observations of this risk being realized in various forms. With the common aspect being the single large preferred vendor taking business that would likely have gone elsewhere.

Viewed 38100 times by 2760 viewers

On hackerrank and Julia

My new day job has me developing considerably less code than my previous endeavor, so I like to work on problems to keep these particular muscles in steady use. Happily, I get to do more analytics than ever before, so this at least is some compensation for the lower amount of coding.

When I work on coding for myself, I’ll play with problems from my research days, or small throw-away ones, like on Hackerrank.

The latter group of problems tends to be somewhat poorly specified, with the solutions being preferred as minimum viable, passing all tests in the allotted time. Elegance of solution is not directly scored. Quality of implementation is determined entirely in terms of attaining the same results that the authors did for their test cases.

I’ve found the last part of that to be dubious in the past, after finding some errors in a number of their tests. Specifically in one in particular, they made an argument about a simple closed form solution for a particular problem in their discussion section, reducing it to evaluating the closed form solution. While I followed their logic, I did not agree with it, and coded up a simple brute force test (the problem was small enough to admit this mechanism). Sure enough, I found that their answers (to this one particular problem) were in fact, wrong.

So, I take their “answers” with some grains of salt. This doesn’t take away from the joy of working through solving problems with code. So I do that, and generally care less about their ranking and scoring. Though I do run my code through their checker to see if it would “pass” their tests.

Ok … that background set.

I’ve been enjoying watching and working with small bits of Julia language over the years. I include it in my Nlytiq development base as one of the core components. I’ve built automation around its build and module installation to create a useful environment for me, others … scientists, data scientists and engineers, etc. Not just Julia, but many other tools.

With this basis, I play with problems. One of the problems on Hackerrank was what they called a “megaprime” number. This is not the more common use of megaprime (million digit prime). Their megaprime number is a prime number whose individual digits are also prime (e.g. no “1” or even numbers in the digits).

Their problem was to take a set of two numbers as input, and find all the megaprimes between them.

There is a simple algorithm to do this … just find the primes between the lower and upper bound, and then test each prime for the property megaprime-ness.

The megaprime checker is also remarkably simple to articulate … and if you have a very powerful and expressive language, fairly simple to code.

function megaprime(x::Int64)
  D = digits(x);
  SD = size(D)[1];
  P = find(isprime.(D));
  if SD != size(P)[1]
    return false

Take a 64 bit Int as input, named x. 1st, break the x up into its digits, and store in D. Second, get the leading dimension of that D array. Third, apply the isprime function to each digit in D, and filter out any that come back as false, storing this into an array P. If the array of digits did not all come back as prime (e.g. the array P had some false tests within it) then return a false, otherwise, fall through and return true.

Ok, I got lazy with the isprime function there. I could have simply tested the digits to be 2, 3, 5, or 7. If they were none of these, then the megaprime test fails. This one is a bit more readable though, and likely nearly as fast (testing single digits for primality shouldn’t take long … no need to optimize this further).

Then the driver code:

# read first/last
x = [parse(Int64,ss) for ss in split(readline())];
first = x[1];
last = x[2];
count = 0::Int64;

if last < first
  first,last = swap(first,last);

R = primes(first,last); 

Here I read a line from STDIN, and parse it from a string into an array of Int64 types. I store the two in the first and last variables, and then check to make sure they are correctly ordered, swapping if not.

The swap function …

function swap(x,y)

… yes … this is they way it should be written.

julia > swap(1,2)

In the driver section, I generated a list of primes between first and last storing that in R.

It is important to note that these are fairly fast functions, even on my 6 year old laptop.

julia> @time primes(1000000000)
  4.099736 seconds (9 allocations: 766.314 MB, 2.72% gc time)
50847534-element Array{Int64,1}:

Primes up to 1 billion in about 4 seconds on a fairly old machine. The machines they run this on in the cloud are somewhat faster, with more ram.

The primes to test are in the R array.

This is the meat of it. How to iterate over the R array in a meaningful way. Some folks will want to use list comprehensions and other CS constructs. These sometimes occlude intent and remove clarity. What I want to do is to loop over the elements of R.

So why not … I dunno … do that?
# now only have R items to inspect
for num ∈ R
test = megaprime(num);
if test
count = count + 1;

Notice the notation. This is exactly how I want code to read.

I could have used a list comprehension with
count = count + megaprime(t) for t ??? R

if I changed megaprime to return 0 or 1. That is possible, but I thought this explicit loop would be easier to comprehend for the programmer (me).

Yes, some will scream syntactic sugar. Some will complain bitterly over the removal of the test from the if then construct. That was due to a crash in the language I observed, simple memoization worked around that issue.

Ok. So I did this. It passed the run test on my machine.

landman@lightning:~/play/hr/mp$ /usr/bin/time ./mp.jl < inp
0.79user 0.70system 0:00.74elapsed 202%CPU (0avgtext+0avgdata 174512maxresident)k
0inputs+0outputs (0major+10105minor)pagefaults 0swaps

Remember, Julia is a compiled language, so this time includes startup/compilation time as well as execution time.

So I pressed the test button.

And it failed.

Well, more precisely, they indicated it failed.

“Timed out” was the error.

So I tried the “Run” button, so I could see if I could get some of the other tests run, download the inputs and outputs, and do a run to check.

landman@lightning:~/play/hr/mp$ cat inp2 out2
230711883 401853350
landman@lightning:~/play/hr/mp$ /usr/bin/time ./mp.jl < inp2
9.67user 0.66system 0:09.66elapsed 106%CPU (0avgtext+0avgdata 301256maxresident)k
0inputs+0outputs (0major+12135minor)pagefaults 0swaps

Works on a much larger problem, again, correctly.

But I get the timeout and … now … “too much output” message.

I am guessing that they have an old version of julia installed. 0.5.2 is current stable, with 0.6-RC series up. Hopefully they will update soon.

This said, my point about all of this is that Julia is a joy to use. No silly indentation, useful and expressive functions, rich and growing module ecosystem, multiple mechanisms to do things, compiled … did I mention, compiled?

Viewed 55302 times by 3726 viewers

The birthday problem (allocation collisions) for networks and MAC addresses

The birthday problem is a fairly simple to state situation. There is at least a 50% probability (e.g. even chance) that at least 2 of 23 randomly chosen people in a room have the same birthday. This comes from some elementary applications of statistics, and is documented on Wikipedia.

While we care less about networks celebrating their annual journey around Sol, we care more about potential address collisions for statically assigned IP addresses. And, curiously, MAC addresses.

Ok, first some Julia code:

function P(N::Integer,M::Integer)
  prod = 1.0
  for i in 0:N-1
    prod = prod * float(M-i)/float(M)

Now cut and paste this into your REPL (julia command line, I’ll wait).

Now, we can replicate the P(A’) and P(A) calculations on the wikipedia page trivially.

julia> P(23,365)   # P(A')

julia> 1-P(23,365) # P(A)

And our mechanism is generic enough, that we can apply this to Classful networks with IPv4, and other things.

The equivalent calculations for a IPv4 class C network (/24), with 1 gateway, 1 broadcast address, so 254 usable addresses …

julia> P(20,254)    # P(A')

julia> 1-P(20,254)  # P(A)

That is, just 20 defined (random) addresses in the class C are enough to get at least a 53.6% probability that there will be one address collision.

You might think, “hey, this is for statically defined addresses, so who cares, we do DHCP everywhere”. Doesn’t matter though, as the DHCP server has to allocate an address, preferably an unused address. It has to test that address to see if it already exists (defensive coding … should really be done) in case another system allocated this address to itself.

Of course, you might say “Darn it, we’ll just use class B’s everywhere. So lets look at that again. Remembering that a class B is 65536 – 2 = 256*256 – 2 = 65534 addresses. I mean, this is more than enough … right?

julia> P(302,65534)    # P(A')

julia> 1-P(302,65534)  # P(A)

Yeah … you only need 302 allocated addresses out of the pool of 65534 to get at least a 50% probability of a collision on allocation.

What about a class A? I mean really … we shouldn’t get collisions … we have sooo many addresses …

First, number of addresses

julia> 256^3 -2

so …

julia> P(4823,16777214)     # P(A')

julia> 1-P(4823,16777214)   # P(A)

Erp … just 4823 addresses out of 16M. In fact, if you look at it closely, you’ll realize that the crossover point goes as about the square root of the number of slots (IP address in this case).

This of course, assumes that the space isn’t sparse, that all addresses are accessible. If the space is sparse, so that large chunks are never allocated or used for whatever reason, you wind up with an interesting problem. You reduce the size of the space by the size of the holes. Which reduces the number after which you will get a collision.


So why on earth would I spend time tackling something that is ostensibly a solved problem?

Well, there is another problem related to these, that people building out large data centers have likely seen.

MAC address space collisions.

Technically, MAC48 is obsolete. In practice, it is still very much in use. So we have to live with consequences of this. There are

julia> 256^6 

addresses in this space. Which, if we use our preceding rough approximation of the square root of that number, we’d get about 16M unique random MAC addresses before we hit a collision.


julia> P(19753662,281474976710656)     # P(A')

julia> 1-P(19753662,281474976710656)   # P(A)

Imagine, if you will, a network with 1M machines, each with 32 VMs and unique MAC addresses per VM. You will have greater than a 50% chance of a MAC48 collision on this.

This analysis assumes (of course) that the space isn’t sparse. It’s not like there is a vendor field … er … no … wait. There is.

First 3 octets in the address are the vendor OUI. Second 3 are the address. So technically, your pool is really out of 16777216 per vendor.

So … if this is a popular unit, say a LOM NIC, and the vendor wraps the MAC addresses, rather than requesting/paying for allocating another OUI segment …

You could get a collision. Similarly a very large number of VMs on many machines in the DC … you could get a collision.

The analysis gone through here is fairly naive, but it shows that such collisions are anything but improbable. Might be worth thinking about in the era of IoT. IPv6 could help with 3.4 x 1038 addresses … we’d need order of magnitude 1019 addresses see higher probabilities of collisions … This might seem like a long way off, but … imagine if you are working on an allocator for these addresses, and you have a 50% or more probability of generating a collision. Which suggests that you are going to have to rethink how you allocate addresses, what state are you going to preserve (do you really want to store even a small fraction of what you allocated???), etc.

Someone once said something about N kilobytes being enough for any machine, or maximum of M computers as the total world market. Both scenarios had nothing to do with collisions, but were based upon some (wildly) incorrect assumptions about the future. While IPv6 looks like it will push out the pain for a while, there are other aspects to this that might need some looking at now, and planning for a future where we have the ability to easily extend the range as needed.

I mean, we have 1019 addresses before we have to worry about higher probability of collisions … what could go wrong?

Viewed 62980 times by 4118 viewers

Now for your bidding pleasure, the contents of one company

This is an on-going process I won’t comment on, other than to provide a link to the bidding site.

There are numerous cool items in there.

  • Lot 2-57207: a 64 bay siFlash/Cadence machine with 64x 400GB SAS SSDs.
    Fully operational, SSDs very lightly used, extraordinarily fast unit.
  • Lot 2-57215: 2 mac minis (one was my desktop unit)
  • Lot 2-57216: My old Macbook pro, 750 GB SSD, 16 GB ram, NVidia gfx
  • Lot 2-57081: Mac pro tower unit
  • Lot 2-57232: a bunch of awesome monitors
  • Lot 2-57222: Mini 24U rack with PDUs
  • Lot 2-57015: Supermicro Twin 2U system (5 others just like it)
  • Lot 2-57100: a 40 core 256GB testbed machine

And many other computer systems, parts, etc. Full Unison, JackRabbit, siFlash units (bid what you want for them). Multiple laptops. Many chassis with backplanes. These are the systems that destroyed old records and set very hard to beat new ones. Available now.

Literally the physical asset contents of a business. If you don’t see something there you want, just ask, happy to see if they have bundled things together that maybe should be separate.

If you want to talk about purchasing ip assets, they are also for sale, but not at repocast. Reach out to me directly at joe _dot_ landman at the google mail _dot_ com address.

While this does make me sad, to see this pathway in process, it is a necessary step along the way. I’ve moved on. Hopefully these assets can help someone else with their needs.

Viewed 67823 times by 4447 viewers

One door has closed, another has opened

As I had written previously, my old company, Scalable Informatics, has closed. Read that posting to see why and how, but as with all things … we must move forward.

It is cliche’ to use the title phrase. But it is also true. We know the door that closed. It’s the door that has opened afterwards that I am focusing upon.

I have joined Joyent to work on, as it turns out, many similar things to what I did at Scalable. Building and supporting awesome hardware, helping provide operational capability to end users for the platform, working with partners and vendors to bring value to the Joyent public cloud, and many other things.

Due to the nature of what I’ll be working on, there is a very strong HPC component to it, though I don’t think the team is using the HPC word quite yet … that is fine. I’ve not left the market, I’ve not redefined the market so I wouldn’t leave it.

Joyent, for those who don’t know, have a set of very interesting product offerings, and a tremendous team around them. As a company, Joyent was a startup, recently acquired by Samsung Mobile. This provides part of what I craved, stability.

As I had noted in my previous posts, I was getting tired of running as hard as possible and then head first into challenges that had nothing whatsoever to do with technical merit, deal economics, etc. but had everything to do with … well … useless and unreasonable measures of perception and stability.

I don’t have those issues here.

Moreover, I have a strong sense of where I can contribute to the success of the team, the mission objectives, etc.

From a technical scenario, the approach to building a data center as a container engine/platform is absolutely brilliant. Your deployment winds up being trivial. Your performance, since you are running on bare metal, is tremendous.

That is one of the things that really drew me to Joyent years ago. There’s a right way to do containers/VMs and ways that aren’t right. Joyent does these right.

No system is perfect, all have challenges of one sort or the other. I am well aware of the challenges for the Triton platform, but I am simultaneously, far less concerned about them, as compared to other platforms.

In the lead up to Joyent, I spoke to many other groups about positions, and I’ll have a writeup on that later (without naming names, just talking about the process and some of the odd things that came up).

For an old HPC hand like me, I was at least a little amused by some of the other groups insistence that you didn’t need feature X for HPC and HPC like things … when the HPC world had pretty clearly spoken over the last decade in demanding feature X.

The position boiled down to “well, we don’t have it, so no one needs it.” Which, if you look at this, was marketing drivel being repeated by engineering resources.

Contrast to to Joyent. Who detailed for me “we are here” and “this is where we need to be”. With a very well understood path between the two locations. There’s no “we know better”. There was quite a bit of “what do we have to do to get better, and be better.”

This is what I like. The team is phenomenal. Really. The tech … I’ve gushed on about it in the past in these pages … but its very good. There’s a few tech challenges, but the cool thing is that I get to help make those less of a concern by helping to find solutions to them. Which I’ve done once before for this platform. And now I have a wider remit, and don’t have to worry myself with making my monthly numbers to pay the team.

You see why I like this yet?

Add to this, some of the needs from the parent company. Just like in the original Jurassic park

… I know this stuff. I lived and breathed the stuff they need for most of my professional life.

And did I mention the awesome team? Great products? Incredible market and market ops?

It took a friend to reconnect me with the team. What was supposed to be a 30 minute call wound up taking all afternoon. I was positively giddy after it.

And I am now onboard. Running as hard as I can to catch up and not slow them down. Hopefully to help them accelerate, even harder and faster than before.

This is a good thing, stay tuned!

Note: I won’t be able to blog much about internal things (as I had hinted at, at Scalable). But I will likely drop hints on things that I am allowed to talk about, or that people might wish to pay attention to.

And remember that I’ve not left HPC, nor the market behind. This role encompasses everything I’ve done before … plus quite a bit more stuff.

This is good. Very good!

Viewed 131064 times by 8262 viewers

Hard disk shipments dropped 10% QoQ, 2% YoY

This jives very well with what I’ve observed. Decreasing demand for enterprise storage hard disks, or as I call them “Spinning Rust Drives” (or SRD) as compared with SSD (Solid State Drives).

The summary is here with a key quote being

3.5″/2.5″ Enterprise HDDs: For a second consecutive quarter, total enterprise HDD sales declined. Both performance enterprise and nearline HDD volumes were down for the quarter, as
total unit shipments fell to approximately 16-17 million.

Again, jives well with what I’ve observed.

Mellanox has a good take on its blog, noting that

Flash keeps taking over

Every year, for the past four years, has been ?The Year Flash Takes Over? and every year flash owns a growing minority of storage capacity and spend, but it?s still in the minority. 2017 is not the year flash surpasses disk in spending or capacity ? there?s simply not enough NAND fab capacity yet, but it is the year all-flash arrays go mainstream. SSDs are now growing in capacity faster than HDDs (15TB SSD recently announced) and every storage vendor offers an all-flash flavor. New forms of 3D NAND are lowering price/TB on one side to compete with high capacity disks while persistent memory technologies like 3D-XPoint (while not actually buillt on NAND flash) are increasing SSD performance even further above that of disk. HDDs will still dominate low price, high-capacity storage for some years, but are rapidly becoming a niche technology.

This is a critical point. While SRD are dropping in volume, there is not enough SSD fab capacity to supply the market demand. Which, curiously, means that economics 101 is strong and we see a healthy supply/demand curve in play. Moreover, entry into this market (e.g. building fabs for NAND) is prohibitively expensive. This means that the SSD supply will be constrained for the foreseeable future.

This also opens the way up to other mass producable technologies. Given the contenders to replace NAND for SSD, I’d argue that the combination of the least expensive and most easily licensed/reproduced system (which might be NAND) will be the high growth bit for a while.

Apart from that fab entry cost.

Without reading the tea leaves too hard, it is pretty clear that SRD are headed in the direction of archives, and colder storage. Tape is still around for now, though, I am not sure if anyone will be looking at it in 2-5 years time frame … serialized storage technologies that don’t do reasonable jobs on seeky loads might not fare well.

Possibly SSD/Tape hybrids, with the SSD providing not really a cache per se, but a fractional capacity tier, and many parallel tape drives providing some semblence of many heads on a disk … but the issue then is that seek times to new “sectors” takes 102 to 103 seconds, as compared to a hard disk, where seeking is around 10-2 seconds. Thats 4 to 5 orders of magnitude difference, and for active archives, I can’t imagine a potentially non-serial workload being deployed on such a device.

Even backups … you have the storage bandwidth wall (SBW) problem, where you have a small pipe bandwidth relative to your capacity. SBW measures time to move an amount of data.

$SBW = Capacity / Bandwidth$

1PB of data, at 100 MB/s (about the real writing rate of tapes) is about 107 seconds, or 1/3 of a year. Do you really want that for your read/write cycle? Even 10 units running in parallel for 1GB/s is 1/30 of a year, or about 12 days. And your data keeps growing.

No, tape has a limited lifetime going forward IMO. Pipe size (the B) is one of the major issues, seek rate being the other. Low cost doesn’t matter if you can’t get your data off of it fast enough when you need to.

What I think I see going forward is people basically “sloshing” their data between storage systems, with HDD playing a large role going forward. Larger denser HDD could make integrating archive and backup fairly simple into an all flash storage system. Not as tiers (data motion is your enemy, as it reduces the usable value of Bandwidth in the SBW equation).

So I expect a long tail on the HDD. They won’t go away any time soon. Well, not all of them. 10k and 15k RPM drives are probably done for.

Viewed 107109 times by 6555 viewers

Selling #HPC things on ebay

Given that the (now former) day job has ended, I am selling some of the old day job’s assets on ebay. We’ve sold some siFlash, Unison, and have current listings for Arista and Mellanox switches. More stuff will be listed in short order, check it out here. Feel free to reach out to me at joe.landman at the google mail thingy if you want to talk about any of these things, or buy before I list them. Literally everything must go, no reasonable offers rejected (on ebay or via email).

Viewed 112804 times by 6746 viewers

I always love these breathless stories of great speed, and how VCs love them …

Though, when I look at the “great speed”, it is often on par with or less than Scalable Informatics sustained years before.

From 2013 SC13 show, on the show floor, after blasting through a POC at unheard of speed, and setting long standing records in the STAC-M3 benchmarks …

Article in question is in the Register. Some of the speeds and feeds:

  • 200 microsecs latency
  • 45GBps read bandwidth
  • 15GBps write bandwidth
  • 7 million IOPS

But then … a fibre connection. And … its an array. Not an appliance. So deduct points.

Respectable read bandwidth, chances are they are doing this as reading compressed data, and then counting the uncompressed data as what was read, missing the decompression step. Write perf is low, should be higher. Would need more data on the IOPs to say one way or the other, how did they measure, etc.

FWIW, in 2012/2013, Scalable Informatics sustained 30+ GB/s read bandwidth on our siFlash unit for 128 threads of IO, and about 3M IOPs for 128 threads of random 8k reads. In 2015, we hit 24GB/s and 5M IOPs on v1 of Forte. v2 of Forte never saw the light of day because we ran out of money. Specs on that (estimated) are 50GB/s 10M IOPs in a 2U container.

And Scalable never got any VC love or cash. A shame, because we set very long standing records in a number of areas, that others are, to some degree, still catching up to, years later.

I’m reminded of a little bit of revisionist history put out by IBM at the time, that storage blogger Robin Harris recalled in his great StorageMojo blog.

Here is a paraphrase

They Say (random VC backed “performance” storage stealthy startup) Entry Into high performance storage Will Legitimize The Market.

The Bastards Say, Welcome.

Viewed 102146 times by 6316 viewers

pcilist: because sometimes you really, really need to know how your PCIe devices are configured

If you don’t know what I am talking about here, that’s fine. I’ll assume you don’t do hardware, or you call someone else when there is a hardware problem.

If you think “well gee, don’t we have lspci? so why do we need this?” then you probably have not really tried to use lspci to find this information, or didn’t know it was available.

Ok … what I am talking about.

When a PCIe bus comes up, the connections negotiate with the PCIe hub. The negotiate width (e.g. how many lanes of PCIe will they consume), speed (e.g. signalling speed in terms of GT/s), interrupts, etc. The PCIe hub presents this information to the OS, though in some cases, OSes like Linux choose to enumerate/walk the PCIe tree themselves … because … you know … BIOS bugs.

Ok. So these devices all autonegotiate as part of their initialization. Every now and then, you get a system where a card autonegotiates speeds or widths lower than expected. The driver generally provides the information on what it is capable of, and the PCIe hub, or OS structure, tells you the actual state.

Why is this important … you might rightly ask ?

So you have two machines. They connect over a simple network. The network speed is lower than the PCIe speed when the unit is operating at full capability. One simple estimate of the maximum possible speed of a PCIe system is take the number of GT/s and multiply it by the width. Divide that by 10. That is your approximate bandwidth in GB/s.

So your two machines have fast network cards (note this also works for HBAs … heck … everything … though … be careful about the power control systems, as they may mess with some of these things). You start using iperf to generate traffic between the two machines. And you see it is way below where you expect.

So, you start looking for why this is the case.

Latest drivers: check
Up to date kernel: check
Switch behaving well: check

Hmmm …. Something is amiss.

Then you try between other machines. Every other machine to the non-suspect machine is giving you reasonable numbers.

The suspect machine is giving you crappy numbers to/from it.

In the network scenario, you also see many errors/buffer overruns. Which means that the kernel can’t empty/fill buffers fast enough. Which suggests some odd speed issue.

Ok … where do you look, and what do you look for?

Pat yourself on the back if your hand shot up and you said, with confidence, ‘lspci’. Or parsing the /sys/… tree by hand. Either will work. Lets focus on lspci for the moment.

Ok, great. Now what information within lspci output do you want, and which options do you use?

They suggest -m or -mm for machine parseable output.

I am going to avoid those options. Try them, and see why for yourself.

You see, to get the juicy bits you need, you will need to give 3 v’s. -vvv . And to get a little more info, add a -kb to get driver and other info.

Now, look at that joyus output. Again, what info do you need?

Look at LnkCap: and LnkSta:

That’s what you need.

Wouldn’t it be nice if this were output in a nice simple, tabular form … so you could … I dunno … see your problem right away?

Well, your long wait is over! For only $19.95, and a quick trip to, you too can grab all this info incredibly quickly. Don’t believe me? Well then, have a gander:

landman@leela:~/work/development/pcilist$ sudo ./ 
PCIid   MaxWidth ActWidth MaxSpeed ActSpeed     driver       description
00:00.0        4        0        5                           Intel Corporation Haswell-E DMI2 (rev 02)
00:01.0        8        0        8      2.5         pcieport Intel Corporation Haswell-E PCI Express Root Port 1 (rev 02) (prog-if 00 [Normal decode])
00:02.0        8        1        8      2.5         pcieport Intel Corporation Haswell-E PCI Express Root Port 2 (rev 02) (prog-if 00 [Normal decode])
00:02.2        8        8        8        8         pcieport Intel Corporation Haswell-E PCI Express Root Port 2 (rev 02) (prog-if 00 [Normal decode])
00:03.0        8        0        8      2.5         pcieport Intel Corporation Haswell-E PCI Express Root Port 3 (rev 02) (prog-if 00 [Normal decode])
00:03.2        8        8        8        5         pcieport Intel Corporation Haswell-E PCI Express Root Port 3 (rev 02) (prog-if 00 [Normal decode])
00:1c.0        1        0        5      2.5         pcieport Intel Corporation Wellsburg PCI Express Root Port #1 (rev d5) (prog-if 00 [Normal decode])
00:1c.4        4        1        5      2.5         pcieport Intel Corporation Wellsburg PCI Express Root Port #5 (rev d5) (prog-if 00 [Normal decode])
02:00.0        1        1      2.5      2.5    snd_hda_intel Creative Labs SB Recon3D (rev 01)
03:00.0        8        8        8        8          mpt3sas LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
05:00.0        8        8        5        5            ixgbe Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
05:00.1        8        8        5        5            ixgbe Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
07:00.0        1        1      2.5      2.5                  ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03) (prog-if 00 [Normal decode])
80:02.0        8        8        8        8         pcieport Intel Corporation Haswell-E PCI Express Root Port 2 (rev 02) (prog-if 00 [Normal decode])
80:03.0       16       16        8      2.5         pcieport Intel Corporation Haswell-E PCI Express Root Port 3 (rev 02) (prog-if 00 [Normal decode])
82:00.0       16       16        8      2.5           nvidia NVIDIA Corporation GM107 [GeForce GTX 750 Ti] (rev a2) (prog-if 00 [VGA controller])
82:00.1       16       16        8      2.5    snd_hda_intel NVIDIA Corporation Device 0fbc (rev a1)

Notice here how the NVidia card throttled down. When you start using it actively, it throttles up in speed.

But, if you have a nice 40GbE card, say an mlx4_en based card, and you see 5GT/s and x4 on the width, that gets you to about 2GB/s maximum. So you’ll see somewhat less than that on your network.

This is what I saw today. And I wanted to make it easy to spot going forward.

Viewed 104873 times by 6419 viewers