Generating an "optimal" circuit from a language construct

We use high level languages to provide an abstraction against the hardware, OS, system services we want to use. The compiler is responsible for this mapping. So when I write a simple loop

for(i=1;i<N;i++)
{
a[i] = b[i] +c[i];
}


or for you Fortran types out there

do i=0,N-1
a(i)=b(i)+c(i)
enddo


The compiler will turn that into an assembly language loop, which loads an iteration counter (i) into a register, either load registers with a[i], b[i], and c[i], or do memory operations to load compute and then save.
While I am sitting here thinking about how to build a fast circuit to do this computation, and not this shouldn’t be hard, it occured to me that the compiler has a fixed, albeit general, model of the processor within it.
Well of course. How else could it generate instructions, and optimize them …
Ok. Now remove the constraint of a fixed model. The issues will be similar, but now you get to ask a question of how best to represent this particular operation as a circuit. I am wondering how much sense it makes to go in a partially opposite direction, where once you know where the code spends its time, you can replace the time expensive versions with circuits to do the equivalent computing. This is not a great epiphany, Stretch and a few others have been doing it.
The other issue is that not all codes spend large amounts of time in one or two functions. And if they did, you have something akin to an Amdahl’s law in operation. Even if you could make the most time expensive portions vanish in an infinitely fast processor, you are limited by the time-inexpensive portions which now dominate the calculation.
So while this is an interesting direction, I am not sure it is the best, or the most general one.
Right now we use APIs to control APUs from a central program. We pre-define their functions, and then issue ostensibly library calls to them. I think we have something there. But the APU needs to be tightly coupled to the machine for this to work. Long latencies, low memory access bandwidth, forcing a particular programming model are bound to fail in the end, as they would not be general enough.
What you need is to make the APU function like a library, or as an extended instruction processor that augments the existing system, in such a way that the end user simply uses a library. This strongly suggests focusing upon primatives as a method to do calculations. Create an GEMM primative. Now with this, you can build matrix calcs. Create a blastwordsearch primative to scan through data in ram, in parallel. Now you can build a better/faster blast.
With all the excitement around these systems, it would be a shame to pick the wrong approach. Telling a chemist or a biologist that they have to start using VHDL to program their machine is going to make many of them go elsewhere.
But thats not the most interesting aspect of this. These APUs are applicable to far more than HPC. Indeed, the market for APUs on specific point function cards (GPUs, IO controllers) is huge. HPC while a large market, may simply be a drop in the bucket. The question is how to best leverage them, and the little programming example I pointed to at the top shows the issue. You don’t want the compiler to generate the circuit, any more than you want to craft a new circuit by hand. You want BLAS and BLAS-like things so you can work at a higher level, and assemble programs.
Now is where we run into the chicken and egg parts. Where do these libraries come from, where is the API for the APUs? This is the right question, and rather than building many companies trying to turn point specific APUs into CPUs, it might make sense to build companies who want to build applications atop the APUs, and would have an inclination to do the right thing. This also presumes a common/standard interface and board to work with. Eventually we well get there.
The idea is not simply to generate optimal code, or an optimal circuit. But to enable high level programming to speed programmer productivity, while helping to build out the innards and accelerating the tasks. Otherwise you doom a compiler to searching a vast combinatorial space every time a program is built, specifically for optimal performance criteria. Would make the Trace MultiFlow’s compilations, which took days, look tame by comparison. Optimal may not be the right approach, in circuits or performance. Near optimal may be better, more correct, and easier for programmers to deal with.