Finally, a desktop Linux that just works

I’ve been a user of Linux on the desktop, as my primary desktop, for the last 16 years. In that time, I’ve had laptops with Windows flavors (95, XP, 2000, 7), a MacOSX desktop. Before that, my first laptop I had bought (while working on my thesis) was a triple boot job, with DOS, Windows 9x, and OS2. I used the latter for when I was traveling and needed to write; the thesis was written in LaTeX and I could easily move everything back and forth between that and my Indy at home, and my office Indigo.

During the SGI years, I used Irix mostly for desktop stuff, and it was very nice. It was IMO the best user interface I’d seen to date inclusive of Windows. Far better than Mac of that era (really … no comparison). The text editors mostly sucked though … I wound up using nedit for almost everything.

After leaving SGI, I resolved that I would use desktop Linux in some form or the other. I started out on a Dell laptop with Mandrake (the flavor of the day then). Moved on to SuSE (driven in part by a customer whom used it). SuSE wasn’t actively unfriendly, its just its UX was … well … not for the faint of heart.

None of these would be reasonable to give to my wife and daughter to use on their machines.

I moved from SuSE everywhere to CentOS on the servers and Ubuntu on the desktop and laptop around 2007 or so. CentOS seemed to make sense to me then for server bits. Ubuntu around 8.04 was really quite good.

But it started going downhill around 10.x. UX sucked in the 11.x and 12.x with the conversion to Unity.

I left the servers on CentOS, and moved the laptop and desktop to LinuxMint. This is a Ubuntu rebuild (which is itself a Debian rebuild). Mint was focused on very easy UX. You shouldn’t have to worry about stuff, it should all just work. Previously had not had that experience with Linux. Nor windows for that matter.

Started out around Mint 12 with Cinnamon. That is a reworking of the Gnome desktop into a paradigm I find comfortable. They also have a Mate version of it which is reminiscent of the SuSE interface, but I really didn’t like that.

Mint was much better than Ubuntu, but sometimes I had interesting and astounding failures. Mint doesn’t believe in upgrades for one. Either you are on the long term support (LTS) release, or you are on the 6 month cycle. The latter is more “bleeding edge”, though you get support for up to 18 months. The former is “more stable” and you get longer support.

Some of the spectactular failures were around the NVidia graphics side. Nouveau, the open source NVidia driver was not terribly good, and would as often as not, hard lock my machines. I had a devil of a time ripping it out of a few machines to replace it with the closed source but mostly working version.

I replaced the NVidia card in the office with an AMD card for a while, but AMDs drivers were just terrible and quite unstable if used in accelerated mode. This appeared to not be Linux specific, but more related to driver quality.

I moved the desktop in the office over to LMDE, which is the Linux Mint based upon the Debian base rather than the Ubuntu base. Slightly different basis, same experience. Generally very stable. Swapped in a newer NVidia card and drivers. Now it is rock solid.

Moved the home machine to Linux Mint 16 and still had some weird problems. It was annoying enough that it hit my productivity. 17 and then 17.1 came out to rave reviews. I decided to update one of my machines.

2 weeks later, after very heavy use, I can say a number of things:

1. Installation was a breeze. This is the first time I didn’t have to fiddle with boot line parameters to disable nouveau, it simply behaved correctly
2. It worked with everything, with no fuss, out of the box, with the bare minimum of configuration on my part.
3. Stability. Oh … my … best … Linux … desktop … experience … ever

I cant say enough good things about Linux Mint 17.1 Cinnamon edition. It really is the best desktop/laptop experience I’ve had to date, inclusive of the MacOSX machines.

I’ve got one outstanding annoyance on one machine, but its minor enough for me not to care so much.

Server side, we are rolling everything over to Debian. Or possibly the Devuan rebuild if I can’t get systemd to behave … though Mint 17.1 uses systemd and it doesn’t seem to suck.

This is definitely one that would work well for my family to use.

Viewed 8578 times by 1335 viewers

stateless booting

A problem I’ve been working on dealing with for a while has been the sad … well … no … terrible state of programmatically configured Linux systems, where the state is determined from a central (set of) source(s) via configuration databases, and NOT by local stateful configuration files. Madness lies in wait for those choosing the latter strategy, especially if you need to make changes.

All sorts of variations on the themes have been used over the last decade or so, with this. Often programmatic things like Chef or puppet, are there to do a push of configuration to a system. This of course breaks terribly with new systems, and the corner cases they bring up.

Other approaches have been to mandate one particular OS and OS version, combined with a standard hardware configuration. Given how hardware is built by large vendors, that word “standard” is … interesting … to say the least.

Viewed 17413 times by 2003 viewers

Coraid may be going down

According to The Register. No real differentiation (AoE isn’t that good, and the Seagate/Hitachi network drives are going to completely obviate the need for such things).

We once used and sold Coraid to a customer. The linux client side wasn’t stable. iSCSI was coming up and was actually quite a bit better. We moved over to it. This was during our build vs buy phase. We weren’t sure if we could build a better box. After getting one and using them for a customer, yeah, we were very sure ours were better.

On the performance side, they never really had anything significant.

Such is life, I hate watching companies go down, even if they are nominally competitors.

Viewed 18810 times by 2141 viewers

Anatomy of a #fail … the internet of broken software stacks

So I’ve been trying to diagnose a problem with my Android devices running out their batteries very quickly. And at the same time, I’ve been trying to understand why my address bar on Thunderbird has taken a very long time to respond.

I had made a connection earlier today when I had noticed the 50k+ contacts in my contact list, of which maybe 2000 were unique.

I didn’t quite understand it. Why … no … where … were all these contacts coming from? And why were there so many duplicates?

In the brave new world of #IoT, we are going to have many interacting stacks. And these stacks are going to have bugs. And some of the failure modes are going to be … well … spectacular.

This is one such failure mode, that I happily caught in time.

Here is how I have pieced it together thus far. We had a run-away amplification of contacts due to some random buggy app. It may have been one of the sync bits in thunderbird, or on my old iphone and ipad, or on my new android, or whatever.

It doesn’t matter what it was. What matters is what happened, and how the failure progressed.

And it shows why remarkably simple, and stupid (e.g. #IoT level) code can result in something akin to a positive feedback loop.

One of the buggy apps apparently either pulled an extra set of contacts from google, or pushed an extra set to google. Doesn’t matter what.

Google’s contact manager is dumb. It could be smarter. Far smarter. Say for example, if another app attempts to push a duplicate contact to it, instead of accepting it, it should simply move on to the next contact. Rinse and repeat.

This would have stopped what amounted to a denial of service on my devices, cold.

But it didn’t.

So some buggy app synced. And then resynced, and then resynced.

And the poor little androids and other devices spent more and more time syncing. And more and more battery syncing.

This is the ultimate in secondary denial of service attack. Don’t attack the device, but cause it to run out its power by leveraging its normal functionality. The denial of service is through a shutoff due to running out of the battery. A second order, or indirect attack vector.

Neat, huh? This is what we have to look forward to.

For reasons beyond my comprehension, each sync resulted in a doubling of contacts. How often they have synced is not known. What is known is that I had well over 10k contacts for one person, that were identical.

So I cleaned that out.

And later today, I found I had that again.

So I stopped everything from syncing against it. Everything. Its now a one way pull from google’s contact manager.

Because google’s contact manager is really, hilariously, stupid. Though the remove duplicates function? Good idea. Though, I dunno, why not make it automatic?

But thats not the main point of this. The real point of this is that IoT is going to be ripe for un-intended abuse, not to mention intentional abuse. Denial of service at a level not comprehended before.

Tis a brave new world. Also known as, be careful what you wish for, you just might get it.

Software has eaten the world, and we might just regret letting this happen.

Viewed 19098 times by 2164 viewers

Drivers developed largely out of kernel, and infrequently synced

One of the other aspects of what we’ve been doing has been forward porting drivers into newer kernels, fixing the occasional bug, and often rewriting portions to correct interface changes.

I’ve found that subsystem vendors seem to prefer to drop code into the kernel very infrequently. Sometimes once every few years are they synced. Which leads to distro kernels having often terribly broken device support. And often very unstable device support.

May work fine for a web server or other lightly loaded cloud like system, but when you push serious metal very hard, bad things happen to these kernels with their badly out of date device drivers. We know, we push them hard. So do our customers.

So I’ve been forward porting a number of drivers, and I gotta say … I really … really … am not having fun dealing with all the fail I see in the source. Our make files are chock full of patches to the kernels to handle these things.

I wish that a requirement for having a device be in the linux source tree was that it was no more than 6 months out of date with current driver revisions.

In our kernels, these things will just work. We’ll offer the patches back to the driver folks, but I don’t think they’ll want them. Past experience on this.

Viewed 19833 times by 2188 viewers

Parallel building debian kernels … and why its not working … and how to make it work

So we build our own kernels. No great surprise, as we put our own patches in, our own drivers, etc. We have a nice build environment for RPMs and .debs. It works, quite well. Same source, same patches, same make file driving everything. We get shiny new and happy kernels out the back end, ready for regression/performance/stability testing.

Works really well.

But …

but …

parallel builds (e.g. leveraging more than 1 CPU) work only for the RPM builds. The .deb builds, not so much.

Now the standard mechanism to build debian kernels involves some trickery including fakeroot, make-kpkg, and other things. These autogenerate Makefiles, targets, etc. based upon the rule sets.

Fine, no problem with this. I like autogenerated things. Actually I often like programmatic generated things better than human generated things, as the latter invariably have crap you really don’t want in there. Not that the others don’t, but there is mysticism around the existence of some things in peoples build environment, versus empirical reality.

The canonical mechanism is to use CONCURRENCY_LEVEL=N for N=some integer.

Fine. Use it in the make file. And …

We have a stubborn single threaded build. It will not change.

Fine, lets capture output and make it verbose. Look for concurrency level in output. See if something is monkeying with it.

scalablekernel@build:~/kernel/3.18$grep CONCUR out export CONCURRENCY_LEVEL=8 cd linux-"3.18""" ; export CONCURRENCY_LEVEL=8 ; fakeroot make-kpkg -j8 --initrd --append-to-version=.scalable --added_modules=arcmsr,aacraid,igb,e1000,e1000e,ixgbe,virtio,virtio_blk,virtio_pci,virtio_net --overlay-dir=../ubuntu-package --verbose buildpackage --us DEB_BUILD_OPTIONS="" CONCURRENCY_LEVEL=1 \  /sigh I look to see where that is coming from. Looks like debian/ruleset/targets/common.mk. Which is in turn, coming from /usr/share/kernel-package/ruleset/targets/common.mk . Look for concurrency and see this snippet debian/stamp/build/buildpackage: debian/stamp/pre-config-common$(REASON)
@test -d debian/stamp      || mkdir debian/stamp
@test -d debian/stamp/build || mkdir debian/stamp/build
@echo "This is kernel package version $(kpkg_version)." ifneq ($(strip $(HAVE_VERSION_MISMATCH)),) @echo "The changelog says we are creating$(saved_version)"
@echo "However, I thought the version is $(KERNELRELEASE)" exit 1 endif echo 'Building Package' > stamp-building # work around idiocy in recent kernel versions # However, this makes it harder to use git versions of the kernel$(save_upstream_debianization)
DEB_BUILD_OPTIONS="$(SERIAL_BUILD_OPTIONS)" CONCURRENCY_LEVEL=1 \ dpkg-buildpackage$(strip $(int_root_cmd))$(strip $(int_us)) \$(strip $(int_uc)) -j1 -k"$(pgp)"  -m"$(maintainer) <$(email)>"
rm -f stamp-building
$(restore_upstream_debianization) echo done >$@


[starts banging head against desk again]

Force concurrency level to 1, AND then force -j1. Oh dear lord.

Lets see if switching these back to 8 helps (8 core machine).

Why … yes, yes it does …

Grrr

[Update] The deities of kernel building are not kind. It appears that parallel build in debian actually breaks other things that it should not break. I have a choice of a very slow (1+ hour) kernel + module build that works, or a fast (roughly 5-10 minute) kernel + module build that fails because of a causality violation (e.g. someone couldn’t figure out how to fix their code to run in parallel).

So, if I have time in the next month or so, I am going to find that very annoying serializer, take it out behind the barn, and put it out of my misery.

Viewed 20168 times by 2207 viewers

Amusing #fail

I use Mozilla’s thunderbird mail client. For all its faults, it is still the best cross platform email system around. Apple’s mail client is a bad joke and only runs on apple devices (go figure). Linux’s many offerings are open source, portable, and most don’t run well on my Mac laptop. I no longer use Windows apart from running in a VirtualBox environment. And I would never go back to OutLook anyway (used it once, 15 years ago or so … never again).

Since I am using Thunderbird, and our dayjob mail leverages Google’s gmail system, I like to keep contacts in sync.

This is where the hilarity begins. And so does the #fail.

A long time ago, in a galaxy far, far away, contact management was easy. You had simple records, a single mail client or two. Everything sorta just worked … because, standards.

Then walled gardens arose. Keep the customer using your product. Prevent information outflow, but use information inflow. Break things in subtle ways.

Thus arose contact managers/importers, and things were again good in the world.

Until those in walled gardens (apple, google) decided to break other things, as you know, they started to compete more.

Those contact importers for Thunderbird worked, but pretty soon the address bar slowed down. Type in an address and wait 10 seconds or so to autocomplete.

Mind you, this is on a 24 physical processor desktop system, with 48GB ram, high end NVidia graphics, two displays, SSD OS drive, and a 1GB/s 5TB local storage. This is not a slow machine. Its actually bloody fast. One of our old Pegasus units we no longer build. Easily the best desktop that ever graced a market, but failed as a product because people want cheap crap on their desktop, not good crap.

Damn it, I am grousing.

Ok, back to the story.

So there I am, wondering why its taking 10+ seconds to auto complete an address. Its a database lookup dammit, should be indexed and fast. Unless … unless … they are doing something INSANE like, I dunno, not using a database with indices. That would manifest as a long delay searching a large “database”. So let me look at my address book. Over 10+ years, I’ve curated about 3.5k addresses, I should see something not unlike that.

This is 50k of pure #fail.

I don’t know whom to blame.

Happily, Google’s gmail has a find/merge duplicates function.

Start using that.

20 iterations later (it stops at 2500 copies of the same entry … go figure), this address book is down to 700 addresses, with no duplicates.

Oh. Dear. Lord.

So much #fail. So little time.

Disabling the contact manager updating google’s contacts. This is an unsolved or poorly solved problem.

Walled gardens suck.

Viewed 16741 times by 2280 viewers

The Interview (no, not that one!)

Rich at InsideHPC.com (you do read it daily, don’t you?) just posted our (long) interview from SC14. Have a look at it here (http://insidehpc.com/2015/01/video-scalable-informatics-steps-io-sc14/) .

As a reminder, Portable PetaBytes are for sale! And yes, the response has been quite good …

More soon … And no, we aren’t going to hack anyone

Viewed 24345 times by 3010 viewers

Micro, Meso, and Macro shifts

The day job lives at a crossroads of sorts. We design, build, sell, and support some of the fastest hyperconverged (aka tightly coupled) storage and computing systems in market. We’ve been talking about this model for more than a decade, and interestingly, the market for this has really taken off over the last 12 months.

The idea is very simple. Keep computing, networking, and storage very tightly tied together, and enable applications to leverage the local (and distributed) resources at the best possible speed. Provide scale out storage and compute capability, and the fastest possible communication infrastructure.

Make it so that people with ginormous data and computing needs have a fighting chance of actually being able to do their work in a reasonable period of time.

This is really what tightly coupled is all about. Hyperconvergence is bringing all these aspects together, and enabling the software to make effective use of it.

To distill the essence, this is about reducing the barriers to performance at every level, and designing systems for higher performance efficiency (e.g. more cost effective to run at scale), while increasing the density (e.g. reducing the number of systems you need to get performance).

But this isn’t the only thing changing. People are enamored of Big Data. Though, if you read various analyses, it appears there is a significant effort of self designed/built big data systems versus vendor packaged. And more to the point, the footprint of big data systems is of OOM 103 systems. I don’t know what the distribution function is for this, but a 100% growth in these wouldn’t be terribly large in terms of system footprint.

Which, to a degree, begs the question as to why vendors are chasing such a ‘small’ market so hard.

I know, I know … its all the rage and Wikibon indicates that Hadoop is huge.

The estimate for 2013 was $2B USD, and with a 58.2% CAGR, prediction of an approximate$3.16B USD market in 2014. This is the complete market, not just the software side.

What Hadoop represents is a change in thought processes on how to gain insight from pools of data. How to build better data driven models.

This is not to say that Hadoop is alone. SAS, SPSS, and many other statistical analytics packages have been used, for decades, to construct and test models. What has changed has been the leveraging of new technologies to store and query data at effectively arbitrary scale.

This is, IMO, the fundamental genius of these tools. And this is in part where the value proposition sits.

To distill this to the essence, its about lowering the friction between data storage, modeling, and testing.

While the journalists are using Hadoop to mean the data analytics market, there is an unfortunate tendency to conflate the two. I am pretty sure that Kx, SAS, etc. are all well represented in the analytics market. Specifically, I am wondering if the 103 number is badly undercounting the real size of the market.

Have a look at this poll from KDNuggets. This shows where (a self selecting group, so likely significant biases are shown) people responding indicate they are spending their time for analytics. As you can see, Hadoop is pretty low on the list and growing slowly. It and SAS appear to dominate the growth (again, self selecting data, so there is definite bias).

But chances are there are many people using Hadoop on the back end. Things like the R Connector, and for that matter, many other Hadoop connectors, suggests it is being used as an analytic back end. Indeed, the push for a SQL interface, and now Spark (for in-memory distributed analytics), suggests that there is far interest in utilization of this than is represented by the number reported.

Big data is all about being able to use this data at whatever scale you need, with as little friction and as few barriers as possible.

This is also the essence of tightly coupled computing. Bring computing, storage, and networking together in such a way as to reduce friction.

This convergence is interesting to say the least …

Viewed 30521 times by 3672 viewers

Friday morning/afternoon code optimization fun

Every now and then I sneak a little time in to play with code. I don’t get to do as much coding as I wish these days … or … not the type of stuff I really enjoy (e.g. squeezing as much performance out of stuff as I can).

The Ceph team are rewriting some code, and they needed a replacement log function that didn’t require FP, and was faster/smaller than table lookup. I played with it a little, starting with FP code (Abramowitz and Stegun is a good source for the underlying math, by copy is dog-eared and falling apart). Then I wanted to see what could be done with pure integer code. Log functions can be implemented in terms of series in FP, or shifts and logical operations(!) for ints.

So pulling out a number of options, I did some basic coding, and summed the logs of functions of argument 1 to 65535 inclusive. Compared this to simple casting of the DP log2 function. Code is simple/straightforward, and fast.

#include <stdio .h>
#include <math .h>
#include <sys /time.h>

#define NUMBER_OF_CALIPER_POINTS 10
struct timeval ti,tf,caliper[NUMBER_OF_CALIPER_POINTS];
struct timezone tzi,tzf;

int log_2 (unsigned int v)
{
register unsigned int r; // result of log2(v) will go here
register unsigned int shift;

r =     (v > 0xFFFF) < < 4; v >>= r;
shift = (v > 0xFF  ) < < 3; v >>= shift; r |= shift;
shift = (v > 0xF   ) < < 2; v >>= shift; r |= shift;
shift = (v > 0x3   ) < < 1; v >>= shift; r |= shift;
r |= (v >> 1);
return r;
}

int main (int argc, char **argv)
{
int i, x, milestone;

float logx;
int sx = 0 , sumlogx = 0 ;
double delta_t;

milestone = 0;
gettimeofday(&caliper[milestone],&tzf);
for(x=1; x&lt;65536; x++) sx+=(int)log2((double)x);
milestone++;
gettimeofday(&caliper[milestone],&tzf);

for(x=1; x&lt;65536; x++) sumlogx+=log_2((unsigned int)x);
milestone++;
gettimeofday(&caliper[milestone],&tzf);

printf("library function sum: %i\n",sx);
printf("local   function sum: %i\n",sumlogx);

/* now report the milestone time differences */
for (i=0;i< =(milestone-1);i++)
{
delta_t = (double)(caliper[i+1].tv_sec-caliper[i].tv_sec);
delta_t += (double)(caliper[i+1].tv_usec-caliper[i].tv_usec)/1000000.0;
printf("milestone=%i to %i: time = %.5f seconds\n",i,i+1,delta_t);
}
}


Compile and run it, and this is what I get

landman@metal:~/work/development/log$gcc -o log2.x log2.c -lm ; ./log2.x library function sum: 917506 local function sum: 917506 milestone=0 to 1: time = 0.00284 seconds milestone=1 to 2: time = 0.00091 seconds  Milestone 0 to 1 is the library log function. Milestone 1 to 2 is the local version in the source. Now change out the log_2 function for a different variant int log_2 (unsigned int v) { int r; // result goes here static const int MultiplyDeBruijnBitPosition[32] = { 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31 }; v |= v >> 1; // first round down to one less than a power of 2 v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; r = MultiplyDeBruijnBitPosition[(unsigned int)(v * 0x07C4ACDDU) >> 27]; return r; }  and the results are even more interesting landman@metal:~/work/development/log$ gcc -o log2a.x log2a.c -lm ; ./log2a.x
library function sum: 917506
local   function sum: 917506
milestone=0 to 1: time = 0.00288 seconds
milestone=1 to 2: time = 0.00065 seconds


And it gets faster.

Not my code for the log_2 function, but its fun to play with things like this.

Viewed 69155 times by 5540 viewers