Talk from #Kxcon2016 on #HPC #Storage for #BigData analytics is up

See here, which was largely about how to architect high performance analytics platforms, and a specific shout out to our Forte NVMe flash unit, which is currently available in volume starting at $1 USD/GB.

Some of the more interesting results from our testing:

  • 24GB/s bandwidth largely insensitive to block size.
  • 5+ Million IOPs random IO (5+MIOPs) sensitive to block size.
  • 4k random read (100%) were well north of 5M IOPs.
  • 8k random read were well north of 2M IOPs.

Over a single 100Gb IB connection with our standard PFS BeeGFS running, we sustained 11.6 GB/s and 11.8 GB/s write and read bandwidth respectively.

Viewed 7 times by 5 viewers

Going to #KXcon2016 this weekend to talk #NVMe #HPC #Storage for #kdb #iot and #BigData

This should be fun! This is being organized and run by my friend Lara of Xand Marketing. Excellent talks scheduled, fun bits (raspberry pi based kdb+!!!).

Some similarities with the talk I gave this morning, but more of a focus on specific analytics issues relevant for people with massive time series data sets and a need to analyze them.

Looking forward to getting out to Montauk … haven’t been there since I did my undergrad at Stony Brook. Should be fun (the group always is). Sneaking a day off on Friday to visit with my family, then driving out Saturday morning.

Viewed 6066 times by 471 viewers

Gave a talk today at #BeeGFS User Meeting 2016 in Germany on #NVMe #HPC #Storage

… through the magic of Google Hangouts. I think they will be posting the talk soon, but you are welcome to view the PDF here.

Viewed 6269 times by 496 viewers

Success with rambooted Lustre v2.8.53 for #HPC #storage

[root@usn-ramboot ~]# uname -r
3.10.0-327.13.1.el7_lustre.x86_64

[root@usn-ramboot ~]# df -h /
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           8.0G  4.3G  3.8G  53% /
[root@usn-ramboot ~]# 

[root@usn-ramboot ~]# rpm -qa | grep lustre
kernel-3.10.0-327.13.1.el7_lustre.x86_64
kernel-tools-3.10.0-327.13.1.el7_lustre.x86_64
kernel-devel-3.10.0-327.13.1.el7_lustre.x86_64
lustre-2.8.53_1_g34dada1-3.10.0_327.13.1.el7_lustre.x86_64.x86_64
kernel-tools-libs-devel-3.10.0-327.13.1.el7_lustre.x86_64
lustre-osd-ldiskfs-mount-2.8.53_1_g34dada1-3.10.0_327.13.1.el7_lustre.x86_64.x86_64
kernel-headers-3.10.0-327.13.1.el7_lustre.x86_64
lustre-osd-ldiskfs-2.8.53_1_g34dada1-3.10.0_327.13.1.el7_lustre.x86_64.x86_64
kernel-tools-libs-3.10.0-327.13.1.el7_lustre.x86_64
lustre-modules-2.8.53_1_g34dada1-3.10.0_327.13.1.el7_lustre.x86_64.x86_64

This means that we can run Lustre 2.8.x atop Unison.

Still pre-alpha, as I have to get an updated kernel into this, as well as update all the drivers.

These images don’t simply have Lustre in them, they also have BeeGFS, and we’ll have a few more goodies as well by the time beta rolls around in a few weeks.

Viewed 12976 times by 736 viewers

Its not perfect, but we have CentOS/RHEL 7.2 and Lustre integrated into SIOS now

Lustre is infamous for its kernel specificity, and it is, sadly, quite problematic to get running on a modern kernel (3.18+). This has implications for quite a large number of things, including whole subsystems with a partial back-porting to earlier kernels … which quite often misses very critical bits for stability/performance.

I am not a fan of back porting for features, I am a fan of updating kernels for features. But that is another issue that I’ve talked about in the past.

A large government lab customer wanted a Lustre version of our NVMe system. I won’t discuss specifics of this scenario other than to note that we have to get Lustre onto the system.

SIOS is our OS and boot loader for diskless/stateless systems, and our stateful systems. We don’t like using the distro native installers, due to their sheer fragility (anaconda is something you really want to avoid at almost all costs). Debian’s is large and unweildly, we cannot get it to do what we want.

So SIOS handles putting bits down on disk/ramdisk image. This part works beautifully.

The issue is on startup, debian’s initramfs is a very well engineered/designed system, and it just works correctly. I don’t have to mess around with very much to make it do what we need.

Dracut on the other hand … is … well … dracut. If there was an option to rip it out and replace it with debian’s initramfs, I’d do that. I spent the better part of the last week working on debugging dracut, finding places where reality and the documentation did not match up. Tracing booting one step at a time. Re-building our dracut module (first built in 2012 for CentOS/RHEL6 issues of a similar nature), and discovering the the documentation over what gets called, and when it gets called … is woefully wrong.

It took a week. A whole week to debug the monstrosity. I needed a mixture of grub options, and dracut.conf and other bits to make it work.

Its not fragile. With our current stable kernel, there is a nice dracut “bug” which causes significant spamming of the logs, and really long boot times.

Testing it on the NVMe machine now. I wonder how much performance and stability I am giving up going back to an ancient buggy kernel that we need to use, in order to support Lustre. It would be much better if the Lustre patches for 3.18 and beyond were easily accessible (they aren’t , you have to hunt them down). Would also be good if we didn’t need to mess with the whole kernel build process in order to build Lustre. No problems if the user space depends upon the kernel bits and headers. Definitely a problem that you have to build them all together, all at once. I seem to remember grousing about this 8+ years ago as well.

/sigh

Viewed 15086 times by 843 viewers

reason #31659275 not to use java

As seen on hacker news linking to an Arstechnica article, this little tidbit.

This is the money quote:

In a sense, then, the damage is already done. But the upcoming trial is still relevant. If Google can’t win on fair use, it would be a second blow to the old notion that code should be re-usable. The enormous damages figure can’t be ignored, either. Oracle has suggested it should get $8.8 billion worth of profits from Android, as well as another $475 million in actual damages. That’s not chump change, even to Google. For smaller companies, a big Oracle win will make it clear that if they lose a lawsuit over unauthorized use of APIs, they could suffer an extinction-level event.

I know it seems obvious now to Google and to others, but mebbe … mebbe … they should rethink building a platform in a non-open language?

I’ve talked about OSS type systems in terms of business risk for well more than a decade. OSS software intrinsically changes the risk model, so that you do not have a built in dependency upon another stack that could go away at any moment. Most people view OSS as a way to reduce costs, but really, the de-risking part is critical.

Like it or not, Java is not OSS. It is not even close to OSS. You can’t use it without importing risk into your platform.

As Google is learning, to the chagrin of everyone everywhere, this risk can be not simply substantial, but could potentially be existential. Not that I think Oracle will dismantle Google. But that the risk to continue with any form of Java is now so much more than any conceivable benefit … and this has got to be true now for each and every single Java developer, whom may need a license from Oracle to use Java APIs.

How about them apples.

Unfortunately, this isn’t simply restricted to Java as the article notes. Any ‘proprietary’ API, or copyrighted API is now subject to similar actions. At least in the US.

Think about that in terms of the software development and delivery model in the US.

/shakes head in disbelief at the incredibly short sighted nature of the legal actions, and its long term implications across the industry

Viewed 15357 times by 850 viewers

isn’t this the definition of a Ponzi scheme?

From this article at the WSJ detailing the deflation of the tech bubble in progress now.

Venture capitalist Bill Gurley of Benchmark described this phenomenon at length in a recent blog post, in which he alleged that dirty term sheets allow some companies to continue raising money at higher valuations by promising bigger payoffs to new investors at the expense of older investors. That ultimately could render worthless shares held by employees and even some founders.

A Ponzi scheme is like this:

A Ponzi scheme is a fraudulent investment operation where the operator, an individual or organization, pays returns to its investors from new capital paid to the operators by new investors, rather than from profit earned through legitimate sources. Operators of Ponzi schemes usually entice new investors by offering higher returns than other investments, in the form of short-term returns that are either abnormally high or unusually consistent.

Viewed 23861 times by 1097 viewers

Every now and then you get an eye opener

This one is while we are conditioning a Forte NVMe unit, and I am running our OS install scripts. Running dstat in a window to watch the overall system …

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  2   5  94   0   0   0|   0    22G| 218B  484B|   0     0 | 363k  368k
  1   4  94   0   0   0|   0    22G| 486B  632B|   0     0 | 362k  367k
  1   4  94   0   0   0|   0    22G| 628B  698B|   0     0 | 363k  368k
  2   5  92   1   0   0| 536k  110G| 802B 2024B|   0     0 | 421k  375k
  1   4  93   2   0   0|   0    22G| 360B  876B|   0     0 | 447k  377k

Wait … is that 110GB/s (2nd line from bottom, in the writ column) ? Wow …

Likely a measurement oddity. But it made me do a double take

Viewed 21250 times by 1030 viewers

new SIOS feature: compressed ram image for OS

Most people use squashfs which creates a read-only (immutable) boot environment. Nothing wrong with this, but this forces you to have an overlay file system if you want to write. Which complicates things … not to mention when you overwrite too much, and run out of available inodes on the overlayfs. Then your file system becomes “invalid” and Bad-Things-Happen(™).

At the day job, we try to run as many of our systems out of ram disks as we can. Yeah, it uses up a little ram. And no, its not enough to cause a problem for our hyperconverged appliance users.

I am currently working on the RHEL 7/CentOS 7 base for SIOS (our Debian 7 and?8 base already work perfectly, and our Ubuntu 16.04 base is coming along as well). Our default platform is the Debian 8 base, for many reasons (engineering, ease of support, etc.)

SIOS, for those whom are not sure, is our appliance OS layer. For the most part, its a base linux distribution, with our kernel and our tools layered atop. It enables us to provide the type of performance and management that our customers demand. The underlying OS distro is generally a detail, and not a terribly relevant one, unless there is some major badness engineered into their distro from the outset.

SIOS provides an immutable booting environment, in that all writes to the OS file system are ephermal. They last only for the lifetime of the OS up time. ?Upon reboot, a pristine and correct configuration is restored.

This is tremendously powerful, in that it eliminates the roll-back process if you mess something up. Even more so, it completely eliminates boot drives from the non-administrative nodes in your system. And with the other parts of SIOS Tiburon tech, we have a completely decentralized and distributed booting and post-boot configuration* system. All you need are working switches.

More on that in a moment.

First the cut-n-paste from a machine.

ssh root@n01-1g 
Last login: Wed Apr 27 18:22:18 2016 from unison-poc

[root@usn-ramboot ~]# uptime
 18:46:17 up 25 min, 2 users, load average: 0.46, 0.34, 0.38

[root@usn-ramboot ~]# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/zram0 7.8G 6.2G 1.6G 80% /

[root@usn-ramboot ~]# cat /etc/redhat-release 
CentOS Linux release 7.2.1511 (Core)

 

Yes, that is a compressed ram device. With an ext4 file system atop it. It looks like it is using 6.2GB of space …

[root@usn-ramboot ~]# cat /sys/block/zram0/orig_data_size
6772142080

… but the actual compressed size is 3.3GB

[root@usn-ramboot ~]# cat /sys/block/zram0/compr_data_size
3333034374

I know zram was more intended for swap operation than OS drives. But its a pretty nice use of it. FWIW, the squashfs version is a little larger, and the mount is more complex.

Now for the people clamoring on how we preserve useful state, like, I dunno, config files, logs, etc.

First, logs. Should always be forwarded if possible to logging databases for analysis/event triggering. You can use something as simple as rsyslog for this, or more robust like SIOS-metrics (to be renamed soon). Not everything goes through the syslog interface, and some subsystems like to write to strange places, or try to be your logging stack (systemd anyone?). SIOS-metrics will be including mechanisms to help vampire out the data from the tools that are currently hoarding/hiding it. This includes BTW, reluctant hardware like RAID cards, HBAs, IB HCA, etc.

Second, configs. There are many ways to attack this problem, and SIOS allows you to use any/all/some of them. That is, we aren’t opinionated about which tool you want to use (yet). This will change, as we want the config to come from the distributed database, so we’ll have more of a framework in place soon for this, with a DB client handling things. Right now, we create a package (script and/or tarball usually, but we are looking at doing this with containers) which has things pre-configured. Then we copy the script/tarball/container to the right location after boot and network config, and then proceed from there. I should note that our initial network configuration is generally considered to be ephermal. We configure networking on most of our units this way via programatic machinations. This allows us to have very complex and well tuned networking environments dynamically altered, and a single script/tarball/container effecting this. It enables us to trivially configure/distribute optimized Ceph/GlusterFS/BeeGFS/Lustre configs (and maybe GPFS some day).

As I noted, the base distro is generally a detail. One we try to ignore, but sometimes, when we have to put in lots of time to work around engineered in breakage and technical debt in the distro … its less fun.

More soon.

Viewed 22361 times by 1097 viewers

there are times

that try my patience. Usually with poorly implemented filtering tools of one form or another.

The SPF mechanism is to provide an anti-spoofing system, which identifies which machines are allowed to send email in your domain name.

The tools that purport to test it? Not so good. I get conflicting answers from various tools for a simple SPF record. The online tester (interactive) seems to work and show me my config is working nicely.

The email tester, shows it is working nicely.

The spf policy framework for postfix goes ::shrug::

Some corporate SPF framework with minimal visibility, and no support for non-customers (the ones whose email it is miss-classifying) claims there is a problem.

The DKIM bits seem to work. I’ve not set up DMARC (yet) though I might.

Curiously, all of this is for the $dayjob using the google mail system. For this system, no such issues. Everything seems to work.

Honestly, I think it is time for people to set up a emailtest@domainname so that it becomes easy to diagnose problems with legitimate email. I just wrestled with an earlier header problem (which wasn’t our problem per se, but I am trying to be helpful). Now I have other folks simply rejecting mail for no apparent reason.

Stuff like this wastes my time/effort, makes technology far less fun.

I have more important things to do than to waste on this.

Viewed 24302 times by 1143 viewers