<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: The coming bi(tri?)furcation in HPC, part 1</title>
	<atom:link href="http://scalability.org/?feed=rss2&#038;p=1590" rel="self" type="application/rss+xml" />
	<link>http://scalability.org/?p=1590</link>
	<description>not so random musings and mutterings about high performance computing</description>
	<lastBuildDate>Tue, 07 Sep 2010 21:40:24 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
	<item>
		<title>By: This could be game changing for lots of users &#171; scalability.org</title>
		<link>http://scalability.org/?p=1590&#038;cpage=1#comment-31202</link>
		<dc:creator>This could be game changing for lots of users &#171; scalability.org</dc:creator>
		<pubDate>Tue, 13 Jul 2010 12:30:38 +0000</pubDate>
		<guid isPermaLink="false">http://scalability.org/?p=1590#comment-31202</guid>
		<description>[...] I noted a little more than a year ago that HPC was about to fragment. This announcement is going to accelerate the process. [...]</description>
		<content:encoded><![CDATA[<p>[...] I noted a little more than a year ago that HPC was about to fragment. This announcement is going to accelerate the process. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joe</title>
		<link>http://scalability.org/?p=1590&#038;cpage=1#comment-29432</link>
		<dc:creator>Joe</dc:creator>
		<pubDate>Thu, 11 Jun 2009 12:06:46 +0000</pubDate>
		<guid isPermaLink="false">http://scalability.org/?p=1590#comment-29432</guid>
		<description>@Chris:

  GPU accelerators should be treated more like vector processors ... like vectors they are quite sensitive to memory access patterns.  When you hit the right pattern, you get some good performance (assuming your code is integer/single precision based).  It still has issues in double precision.

   I played with the Fixstar&#039;s Cell (GA-180) we are selling in the Pegasus GPU+Cell.   It is basically a PC on a card, with 4GB ram, 2x Powercell 8xi (think Roadrunner Cell units).  Writing code for it is relatively easy, though I have to learn how to make effective use of the SPUs with the compilers.  This is what I find for a slightly modified code:

AMD: 2.3 GHz Shanghai


landman@pegasus-a3g:~/rzftest$ time ./rzf-amd.exe 
pi = 3.141592644040497 
error in pi = 0.000000009549296 
relative error in pi = 0.000000003039635 

real	0m0.740s
user	0m0.736s
sys	0m0.004s



Powercell PPU 2.8 GHz


[landman@pxcab rzftest]$ time ./rzf-cell.exe
pi = 3.141592644040497 
error in pi = 0.000000009549296 
relative error in pi = 0.000000003039635 

real	0m2.794s
user	0m2.784s
sys	0m0.007s


SPU on Powercell 8xi:


[landman@pxcab rzftest]$ time ./rzf-spu.exe 
pi = 3.141592644040497 
error in pi = 0.000000009549296 
relative error in pi = 0.000000003039635 

real	0m6.087s
user	0m0.001s
sys	0m0.006s


So as far as acceleration goes, there is a learning curve there as well.  I think it is likely a general rule of thumb that in the vast majority of cases, acceleration will require some effort to effect.  Our experiments with Cuda yielded similar initial results ... only after we understood how to approach the architecture were we able to make effective use of it.

In the case of the SPUs, in aggregate, we should be able to approach 100 GFLOP double precision for 8 of them, so roughly on the order of 12 GFLOP/SPU for double precision.  Which is not that far off an AMD or Intel processor core.   SPUs have very little local memory, and the PPU manages the memory access for it, so this usually winds up being a bottleneck for codes that haven&#039;t been re-architected for it.  My experiment above can&#039;t be construed as the speed of a PPU or an SPU, but it can help set expectations that speed increases which are possible are not automatic without a code re-architecture.

And this is true on Cuda, with SSE, with ...

Basically its going to take some effort to get it there.</description>
		<content:encoded><![CDATA[<p>@Chris:</p>
<p>  GPU accelerators should be treated more like vector processors &#8230; like vectors they are quite sensitive to memory access patterns.  When you hit the right pattern, you get some good performance (assuming your code is integer/single precision based).  It still has issues in double precision.</p>
<p>   I played with the Fixstar&#8217;s Cell (GA-180) we are selling in the Pegasus GPU+Cell.   It is basically a PC on a card, with 4GB ram, 2x Powercell 8xi (think Roadrunner Cell units).  Writing code for it is relatively easy, though I have to learn how to make effective use of the SPUs with the compilers.  This is what I find for a slightly modified code:</p>
<p>AMD: 2.3 GHz Shanghai</p>
<p>landman@pegasus-a3g:~/rzftest$ time ./rzf-amd.exe<br />
pi = 3.141592644040497<br />
error in pi = 0.000000009549296<br />
relative error in pi = 0.000000003039635 </p>
<p>real	0m0.740s<br />
user	0m0.736s<br />
sys	0m0.004s</p>
<p>Powercell PPU 2.8 GHz</p>
<p>[landman@pxcab rzftest]$ time ./rzf-cell.exe<br />
pi = 3.141592644040497<br />
error in pi = 0.000000009549296<br />
relative error in pi = 0.000000003039635 </p>
<p>real	0m2.794s<br />
user	0m2.784s<br />
sys	0m0.007s</p>
<p>SPU on Powercell 8xi:</p>
<p>[landman@pxcab rzftest]$ time ./rzf-spu.exe<br />
pi = 3.141592644040497<br />
error in pi = 0.000000009549296<br />
relative error in pi = 0.000000003039635 </p>
<p>real	0m6.087s<br />
user	0m0.001s<br />
sys	0m0.006s</p>
<p>So as far as acceleration goes, there is a learning curve there as well.  I think it is likely a general rule of thumb that in the vast majority of cases, acceleration will require some effort to effect.  Our experiments with Cuda yielded similar initial results &#8230; only after we understood how to approach the architecture were we able to make effective use of it.</p>
<p>In the case of the SPUs, in aggregate, we should be able to approach 100 GFLOP double precision for 8 of them, so roughly on the order of 12 GFLOP/SPU for double precision.  Which is not that far off an AMD or Intel processor core.   SPUs have very little local memory, and the PPU manages the memory access for it, so this usually winds up being a bottleneck for codes that haven&#8217;t been re-architected for it.  My experiment above can&#8217;t be construed as the speed of a PPU or an SPU, but it can help set expectations that speed increases which are possible are not automatic without a code re-architecture.</p>
<p>And this is true on Cuda, with SSE, with &#8230;</p>
<p>Basically its going to take some effort to get it there.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Chris Samuel</title>
		<link>http://scalability.org/?p=1590&#038;cpage=1#comment-29430</link>
		<dc:creator>Chris Samuel</dc:creator>
		<pubDate>Thu, 11 Jun 2009 09:24:46 +0000</pubDate>
		<guid isPermaLink="false">http://scalability.org/?p=1590#comment-29430</guid>
		<description>I think GPUs (the most likely accelerators that people will look at) are still hampered by memory bandwidth - but I don&#039;t know how much longer it&#039;s going to be like that for.  Talking to an nVidia guy the other week he didn&#039;t think there was much on the way to help with that for the foreseeable future.

Of course (a) if there was he might not have been at liberty to talk about it and (b) there&#039;s plenty of people for whom GPUs may be good enough (yes, NAMD, I&#039;m looking at you).. ;-)</description>
		<content:encoded><![CDATA[<p>I think GPUs (the most likely accelerators that people will look at) are still hampered by memory bandwidth &#8211; but I don&#8217;t know how much longer it&#8217;s going to be like that for.  Talking to an nVidia guy the other week he didn&#8217;t think there was much on the way to help with that for the foreseeable future.</p>
<p>Of course (a) if there was he might not have been at liberty to talk about it and (b) there&#8217;s plenty of people for whom GPUs may be good enough (yes, NAMD, I&#8217;m looking at you).. <img src='http://scalability.org/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
</channel>
</rss>
