Invariant under change of notation

This was the “joke” about tensors that one of my graduate school professors told us when we were trying to grok a sudden notational shift. Took some hard thinking, and then we sorta got it. Well enough to work out a problem. Hopefully to be useful in later life.

Well, 17 years (wow…. that long?) later, I am writing some quick code to transform a data set extracted in XML into another data set. Basically I took the XML, and through the magic of Perl, turned it into a nice database. Which I can then play with. Going to extract it to Excel.

What has this got to do with HPC? This is the topcrunch data, and yes, I will make all the tools available on my site after finishing them.

What am I doing? Well the topcrunch folks have lots of good data in their database, and in trying to analyze it and understand it, I am discovering that letting users enter whatever they want in fields can be an interesting adventure. The topcrunch data is great, and I am much endebted to the folks (Prof. David Benson at UCSD) who made it available. Extracting meaning from this data set is an object lesson in transformation.

So I am trying to write a simple minded transformation engine. Basically I want to map things like product names into technologies, insert useful columns, etc. That is, for a set of data X, I want to apply a transformation T to get a new set of data, in a slightly different view of the same data. I am changing notation. I am creating a tensor operator.

Ok… enough patting myself on the back here, this is how it works.

We start out with a little xml

<transform>
 <field name=”cpu_interconnects”>
    <condition>
       ‘/rapidarray/i’ => ‘fabric=rapidarray’
  </condition>
 </field>
</transform>

that contains the transform mini-language. It is that thing in the middle, the condition. It is quite Perlish, so let me translate.

If the data contained in the field “cpu_interconnects” matches the value rapidarray in any case, then set a field, possibly creating a new field in the output data named fabric, and set it to rapidarray.

Ok, why on earth would I want to do this when I could use some SQL …. Ok, that should answer your question right there. I know SQL, I just don’t like it, and it isn’t that portable across DBs (mysql commands don’t work that well on postgresql or sqlite3 or oracle or … argue if you want, I have been bitten by this enough that I wrote DBIx::SimplePerl so I don’t have to think SQL dialects anymore).

This way I can encode my transforms within XML, relatively easily, and off I go. Sure its not a more general engine. Doesn’t need to be. Pareto optimal with respect to the other methods constrained by the fact that I want it to be quick/simple/useful.

So now I can use it to extract processor model data, network fabric data, clock speed, etc… and hopefully turn the topcrunch data set into something easier for me to extract meaning from.

Isn’t that a direction we in HPC need to be going in? Making it easier to extract meaning from data? Paraphrasing Hamming, we need to spend more time thinking about our results meaning than the results themselves. Tools which make it easier to get to that meaning are good. Things like Matlab and Octave. Tools which let you stop worrying about the details, and focus upon what you want to do. Like Perl, and XML::Smart, and DBIx::SimplePerl.

Now is there a Microsoft connection? Of course, and it is positive. The end result of all this work will be a nice set of tools to generate and use Excel spreadsheets from these data sets. Because Excel is also one of those tools that enables you to spend more time thinking about meaning than specific numbers though it lets you drill into the numbers. If these connections were a bit better, so I could integrate my transformation logic as part of the spreadsheet (no, VBscript is not an option, lets simply not go there), and thus make it even easier … (insert sounds of massive hints being dropped).

Update: In order to make life easier and the tool more useful, I am going to do some format shifts and code changes. Now we are going to be able to have regex and evaluatable rules like this

 <condition>(AMD|Intel)=>cpu_vendor=$1
</condition>

This means I can bring the full power of Perl into this. This is good in that it is a relatively minor change in the code, with potentially large leverage of power.

Simple and powerful.

Viewed 12975 times by 2584 viewers

Facebooktwittergoogle_plusredditpinterestlinkedinmail