New tool to help visualize /proc/interrupts and info in /proc/irq/$INT/

This is a start, not ready for release yet, but already useful as a diagnostic tool. I wanted to see how my IRQs were laid out, as this has been something of a persistent problem. I’ve built some intelligence into our tool, but I need a way to see where the system is investing most of its interrupts.

I omit (on purpose) IRQs that have been assigned, but have generated no interrupts. I haven’t (as of yet) worked the driver bits backward to reach to the module/kernel driver responsible. I do plan to do this.

This is on one of our engineering systems in the lab.

Is it me, or does IOAT look borked? As I said, this is for visualization. I’ll get the code up on github tomorrow. It uses sparklines, but in the absence of that, it will return actual counts.

[edit: the sparklines don’t show up well in text cut and paste …]

root@ucp-01:~# ./ 
  IRQ Node     Mask                      Driver Counts on CPU23 to CPU0
    0    0   ffffff                       timer ????????????????????????
    3    0   03f03f                      serial ????????????????????????
    8    0   03f03f                        rtc0 ????????????????????????
    9    0   03f03f                        acpi ????????????????????????
   18   -1   ffffff                  i801_smbus ????????????????????????
   31    1   000001                        eth1 ????????????????????????
   32    1   000002                 eth1-TxRx-0 ????????????????????????
   33    1   000004                 eth1-TxRx-1 ????????????????????????
   34    1   000008                 eth1-TxRx-2 ????????????????????????
   35    1   000010                 eth1-TxRx-3 ????????????????????????
   38    1   000002                 eth2-TxRx-0 ????????????????????????
   39    1   000004                 eth2-TxRx-1 ????????????????????????
   40    1   000008                 eth2-TxRx-2 ????????????????????????
   41    1   000010                 eth2-TxRx-3 ????????????????????????
   45    0   03f03f mlx4-async@pci:0000:01:00.0 ????????????????????????
   46    0   000001             IR-PCI-MSI-edge ????????????????????????
   69    0   000001                      arcmsr ????????????????????????
   75    0   000010                      arcmsr ????????????????????????
   79    1   000100                      arcmsr ????????????????????????
   83    1   001000                      arcmsr ????????????????????????
   88    0   03f03f                0000:00:1f.2 ????????????????????????
  100    1   000001                     aacraid ????????????????????????
  101    1   000002                     aacraid ????????????????????????
  102    1   000004                     aacraid ????????????????????????
  103    1   000008                     aacraid ????????????????????????
  104    1   000010                     aacraid ????????????????????????
  105    1   000020                     aacraid ????????????????????????
  106    1   000040                     aacraid ????????????????????????
  107    1   000080                     aacraid ????????????????????????
  109    0   03f03f                   ioat-msix ????????????????????????
  111    0   03f03f                   ioat-msix ????????????????????????
  112    0   03f03f                   ioat-msix ????????????????????????
  113    0   03f03f                   ioat-msix ????????????????????????
  114    0   03f03f                   ioat-msix ????????????????????????
  115    0   03f03f                   ioat-msix ????????????????????????
  116    0   03f03f                   ioat-msix ????????????????????????
  117    0   03f03f                   ioat-msix ????????????????????????
  119    1   fc0fc0                   ioat-msix ????????????????????????
  121    1   fc0fc0                   ioat-msix ????????????????????????
  122    1   fc0fc0                   ioat-msix ????????????????????????
  123    1   fc0fc0                   ioat-msix ????????????????????????
  124    1   fc0fc0                   ioat-msix ????????????????????????
  125    1   fc0fc0                   ioat-msix ????????????????????????
  126    1   fc0fc0                   ioat-msix ????????????????????????
  127    1   fc0fc0                   ioat-msix ????????????????????????
  128    0   03f03f               snd_hda_intel ????????????????????????

Viewed 8739 times by 549 viewers

Not sufficiently caffeinated for technical work today

I just spent 30 minutes trying to figure out why the 32 bit q process would run on one machine, while the identical tree and config would fail with a license expired on my desktop (development box).

Turns out one should check for an old license file in one’s home directory.


I think I need to send an RFE for an ‘–low-coffee-mode’ option.

Viewed 10338 times by 613 viewers

Not a fan of device mapper in Linux

Yeah, I know. It brings all manner of capabilities with it. Its just the cost of these capabilities, when combined with other tools, like, say, Docker, that make me not want to use it.

To wit:

root@ucp-01:~# ls -alF /var/lib/docker/devicemapper/devicemapper/
total 52508
drwx------ 2 root root           80 Jan 29 22:38 ./
drwx------ 4 root root           80 Jan 29 22:38 ../
-rw------- 1 root root 107374182400 Jan 29 22:39 data
-rw------- 1 root root   2147483648 Jan 29 22:39 metadata

root@ucp-01:~# ls -halF /var/lib/docker/devicemapper/devicemapper/
total 52M
drwx------ 2 root root   80 Jan 29 22:38 ./
drwx------ 4 root root   80 Jan 29 22:38 ../
-rw------- 1 root root 100G Jan 29 22:39 data
-rw------- 1 root root 2.0G Jan 29 22:39 metadata

root@ucp-01:~# ls -salF /var/lib/docker/devicemapper/devicemapper/
total 52508
    0 drwx------ 2 root root           80 Jan 29 22:38 ./
    0 drwx------ 4 root root           80 Jan 29 22:38 ../
51820 -rw------- 1 root root 107374182400 Jan 29 22:39 data
  688 -rw------- 1 root root   2147483648 Jan 29 22:39 metadata

Sure, it takes up 52MB of space. But its done a nice sparse allocation of 100GB and 2GB for metadata.

I am working on switching this off of device mapper (it has only given me grief in other contexts), and onto overlayfs as a better choice. Its needed as this is one of our ramboot file systems, and I really don’t want to see that sparse file get filled.

Viewed 11609 times by 697 viewers

Radio Free HPC is (as usual) worth a listen

Good wrap up of last years trends, this week at InsideHPC Radio Free HPC podcast.

We get a small mention around 10:50 or so. Thats not why its an especially good listen. The team arrived at many of the same conclusions we did last year, which is why we brought out Forte, and we have some additional products planned in that line for later on in the year.

Basically NVM and variants, NVMe, etc. are massively faster in both streaming and IOPs than traditional SSD, in that they remove a whole set of protocol layers (SAS/SATA) that serve only to slow it down. Moreover, many of the SSD OEMs themselves are moving rapidly towards NVMe as their path forward. I had thought (this time last year) that 12g SAS would be the preferred modality with NVMe being interesting as a possible future tech.

But my mind was changed last year, as I saw far more effort in the NVM space than I did elsewhere. I saw the performance possibilities, and realized how much things were going to change. Even more than this, I had a number of discussions that I can’t repeat, with a number of large vendors, that convinced me that the direction we needed to focus on was NVMe.

NVMe changes many things. For one, the concept of RAID may be significantly less relevant here. This is not that NVM devices won’t fail, they most certainly will. But you’d be hard pressed to do RAID calcs at the rates you need to use these units in a “standard” RAID5 or RAID6. Many people are thinking of them in terms of being a memory tier, and that is fine, but they are devices with a limited number of writes, so you need to be quite careful with that mental model.

This will take years to fully sort out, and we have a number of cool ideas and some tech coming out to augment Forte and alike.

The team spent the first 12-13 minutes talking about NVM and related. As Henry and Dan noted, its something of the sleeper story, but we are putting it front and center with the Forte line. NVMe is here, its loud, its proud, and its gonna rock applications worlds.

Viewed 15297 times by 837 viewers

When infinite resources aren’t, and why software assumes they are infinite

We’ve got customers with very large resource machines. And software that sees all those resources and goes “gimme!!!!”.

So people run.

And then more people use it. And more runs.

Until the resources are exhausted. And hilarity (of the bad kind) ensues.

These are firedrills. I get an open ticket that “there must be something wrong with the hardware”, when I see all the messages in console logs being pulled in from ICL saying “zOMG I am out of ram …. allocations of all sorts failing …. must exterminate processes!”.

Sorry, that last bit because my daughter had me watching Dr. Who recently with her, and that nasty “exterminate” keeps running back into my head. Seriously, we need to instrument OOM killer in the kernel to send that to the audio port when it shoots something.

Ok, you might say, why not set up swap? I mean that’s what it is for … right?

Swap is a bandaid, and a BAAAD thing to do to a good machine with a large amount of RAM to begin with.

Machines aren’t infinitely elastic, they don’t have infinite resources. Many application codes seem to treat machines as if they are the only thing running, or the only instance running on the machine. Take a large enough machine, with many users, and this goes from slightly wrong to complete hogwash.

So I am looking to use a variety of technological measures to impose discipline upon the applications themselves. Hopefully without impacting performance.

A job queuing system with a strong interactive component probably makes a great deal of sense right now, but I think I need to talk with the team using this … as many of them might not like that concept (its interactive, right? So why do we need to submit jobs?). This is why I am looking at whether or not I can contain the problem with containers, or see if I need to go full on VMs. For computationally heavy jobs, the VMs might be better … simpler failure domain. For more cooperative/smaller jobs, the containers might be better.

I know the solutions that have existed for decades in HPC circles, I’ve used most of them, configured/deployed/supported most of them for the last 20+ years.

What amuses me is the cyclical nature of these sorts of problems. Same problem, different type of domain. Not pure HPC any more, but big data analytics.

Viewed 17179 times by 924 viewers

“Unexpected” cloud storage retrieval charges, or “RTFM”

An article appeared on HN this morning. In it, the author noted that all was not well with the universe, as their backup, using Amazon’s Glacier product, wound up being quite expensive for a small backup/restore.

The OP discovered some of the issues with Glacier when they began the restore (not commenting on performance, merely the costing). Basically, to lure you in, they provide very low up front costs. That is, until you try to pull the data back for some reason.

Then it starts getting expensive.

There were many comments about this, including that his use case wasn’t the target use case, his example was a poor one, as he didn’t RTFM, or the fine print in this case, and thought “gee, $0.05 USD/GB storage”, convoluted/painful pricing algorithm.

There may be some truth to some aspects of these. The target market one is interesting, as is the pricing. We’ve had many customers talk to us about doing similar things in the cloud, and asked them what they would be willing to pay to recover their data. I wish I could capture the shocked expressions on their face when we mention that. Pay to recover it? But its “$X USD/GB per month”.

No, no it isn’t.

And the DR/backup use case? Nope, not even close. Wrong tool. But people don’t pay attention to that. They pay attention to the “$X USD/GB per month” and figure they will adapt their use case to this.

So now, lets have you recover 100TB of data because a data center went “boom”. How long will this take, and more importantly, how much will it cost? Well, $0.011 USD/GB for retrieval. So 105 GB x 0.011 USD/GB = $1.1 x 103 USD. Oh, and then there are the network fees atop this.

My point should be fairly obvious. The “low low prices” are for very specific use cases, designed specifically to pull you in, and make it expensive for you to leave.

For the various benefits of cloud computing to be as useful and utilitarian as possible, you need the ability to be able to roam between providers of capacity (commoditized) computing and storage.

Despite many protestations to the contrary, not only do you not have that today, but you are locked in, more firmly than in the past, with these systems.

Which if you are looking at derisking, you not only have to contend with a massively larger attack surface, but possible non-deterministic costs. This is superior … how?

Basically cloud is about using someone elses resources, and paying them for the privilege, so you can reduce your capex, and load up on opex, which you should be able to scale up and down as you need. That is the theory. The issue becomes when you need to alter the workflow to adapt to an issue … any issue … and you suddenly discover that the opex can be very … very large.

Balancing between these is going to be the game for many folks going forward. If you don’t have infrastructure in house, just like outsourcing other things, you are now far more dependent upon your supply lines. If you have a widely variable business demand, with a nearly constant data bolus, yeah, cloud is likely to be the most cost efficient, even with these other issues. Other use cases … not so much.

Viewed 29633 times by 1340 viewers

Container jutsu

Linux containers are all the rage, with Docker, rkt, lxd, etc. all in market to various degrees. You have companies like Docker, CoreOS, and Rancher all vying for mindshare, not to mention some of the plumbing bits by google and many others.

I don’t think they are a fad, there is much that is good with containers, when they are done right.

To see how they are done right, have a good hard long look at SmartOS. There are many good things here, and a few bad things. The good things are a good underlying file system that is (to a degree) built with zones in mind, a high level integrated zone concept. The bad things are, well, drivers. And a somewhat incompatible user space relative to Linux.

Now look at Docker et al. These are control planes (really, that is it) for a set of technologies that need to play well together. This isn’t a bad thing, but building control planes, correctly, is hard enough as it is, without adding odd corner cases, strange bugs, and toss in some good, old fashioned distro snobbishness.

I’ve been working on our v2 SIOS for a while now, trying to get a better integration of Docker and other things in there. We may eventually move to rkt for our work.

What I really like … as in really REALLY like, is RancherOS. It is almost SmartOS on linux. Almost. So close. But its a bit early, and it has some rough edges, but we are keeping our eye on it, as we should be able to use it once a few more things mature. Whats really cool about RancherOS is that the whole OS is running as a docker (global zone in the SmartOS context). You light up services in containers (again, like zones in SmartOS). The system boots up very fast, and starts its services, with similar mechanisms to SmartOS for state preservation. There are a few minor things, and we’ll be working with it in the lab from time to time to see if we can easily adapt everything to it.

For the moment though, we are sticking with the Debian 8.x base. RancherOS is actually Ubuntu 14.04 reworked, so its not that different.

The Debian base lets me add some things into my image that I can’t easily make work just yet with RancherOS. But I also run into some … er … fun things. Like this bug.

This said, we’ve got some things we are planning on for SIOS v2 going forward. Enhancing our container/VM support is very high on the list.

Viewed 26534 times by 1229 viewers

Hard filtering of calls

I find that, over time, my cell phone number has propagated out to spammers/scammers whom want to call me up to sell me something. The US national do-not-call registry hasn’t helped. The complaints I’ve filed haven’t helped.

So I filter. My filtering algo looks like this:

if (number_is_known_person_or_org(phone_number)) {
else if (number_is_unknown(phone_number)) {
function filter_stage_2(phone_number) {
// I ignore 80% of numbers I don't know, let them go to 
// voicemail.  Voicemail is an excellent filter, if you 
// don't leave a VM, then its obviously not important enough
// for me to pay attention.
if (roll_the_dice(bias_to_ignore=0.8)=="answer") {

function answer_phone(phone_number) {
// You have 15 seconds to state your name, your
// affiliation, and what this call is about.  Failure
// to do this in 15 seconds, and I will hang up.  
// This means that all those calls for "the owner" or
// other such BS, if they get answered, are hung up on
if (the_other_party_explains_why_I_should_spend_my_precious_time()) {

The 80% figure used to be 30%, then 50%. Its been 80% for a while. In short order it is going to 100%. If its important enough to call, you can leave a message. I will get it, and if its important to me, I will respond.

Sadly, android doesn’t quite have a “block calls from this number” in its basic phone app (gee … seems pretty obvious that most people want this …).

So far this year … this year, both of the calls that I answered came through from the same person, with the same script, and I patiently explained that they had to tell me who they are, what they were calling about. They refused, saying only “this call is recorded”, and giving me the name of an entity that shouldn’t be calling me.

It used to be “charities”, then it was scam artists, then it was others.

What is going on is convincing me that I do not need a public cell phone number, and that I should give out my voicemail to all whom request my number.

About 3 years ago, we changed our home phone over to VOIP, and I turned off the ringer. This was a few months before the last election cycle, and suddenly we weren’t wasting time as a family fielding calls from places that shouldn’t be calling us. We don’t miss the home phone. It is exceptionally rare that we get a meaningful voicemail on it. Most are marketeers/scam artists.

The filter is pretty good, but I am thinking of simply ignoring any call I do not know the number for. If they leave a VM, I will get it, moments after the call itself, and I can decide whether or not to call them back.

Viewed 26563 times by 1190 viewers

Nutanix files for IPO

Short story here. I am not going to pour over their S-1 form to find interesting tidbits, others will do that, and are paid to do so.

They are the first of several, though I had thought that Dell would acquire them before they hit IPO. I am guessing that the combination of the price for them, plus the EMC acquisition stopped this conversation. So now Nutanix is going to IPO.

Nutanix is a software stack upon generic hardware. The hardware is usually lacklustre (original appliances being low/midrange supermicro gear, “newer” appliances being mostly Dell boxen).

Real hyperconvergence is hardware designed with as few barriers as possible to data motion and performance. pseudo-HCI is a software stack upon generic hardware. HC systems require very high performance, and very few barriers throughout the entire stack … you simply cannot slap random software on random hardware and hope for the best, as the architectural elements you ignored will usually be the first ones to bite you hard when you start pressing this under load.

Hyperconverged is well architected hardware and software. Anything else is marketing.

Viewed 51914 times by 1684 viewers

Toshiba contemplating spinning out NAND flash

This is remarkable if true, and if they follow through with it, it will change the landscape of Flash quite a bit.

Right now there are 43 major flash providers, and a few smaller ones. Building flash fabs is expensive, even given the demand and process improvements, there is still quite a bit of investment required to set up a flash fab.

Toshiba has some cool kit here, we’ve worked with it (and in full disclosure, we were talking about working more closely with them in the past). If they spin out the business, that will change the dynamic between Toshiba the enterprise storage company, and Toshiba the flash manufacturer. It could open doors for Samsung/Intel to do something with Toshiba the enterprise storage company. Or be an acquisition target for someone like Seagate. WD bought SanDisk, so it would make sense if Seagate grabbed this.

Given that storage OEMs like Seagate and WD are going for more vertical integration, having this core product supplier change its model somewhat, definitely opens doors for the OEMs to pull them in. But these OEMs then need to also start looking at hyper-converged systems, as the array business is on a long secular decline as the market changes.

Viewed 47500 times by 1550 viewers