Blogging my first post from Stanford and I thought I will start with a nice parallel programming algorithm. In our SIMD programming course we were taught the way to program the sorting algorithm on a parallel processor. We found that the sorting networks like the bitonic sorter and odd-even sorters perform much better than the parallel versions of a quick sort, which is the fastest serial algorithm.
The odd-even transposition network shown below performs a compare-&-swap in each stage and in n stages the entire input is sorted. It is also very efficient on a linear array processor since it just involves the nearest neighbor communication (I implemented it on the Kestrel SIMD processor). Thus the odd-even transposition network requires zero communication cost, which we know is ultimately the limiting factor in parallel algorithms as we scale the processing elements.
We see that the problem speedup potential is unlimited as long as we can process an input of size n with a chip of n PEs. But in reality we know that the algorithm can exist in virtual variable space unlike the processor chip which has to manage the problem in its physically constrained space. Thus the interesting question thrown at us was how can you extend this wonderful parallel sorting algorithm when the size of input is larger than the resources. (Assuming we have a big memory but not extra PEs)
One of the big goals was to extend this algorithm to retain that near neighbor communication. My approach to this problem is to carry over the existing work of sorting locally using the fastest serial sorting algorithm (I used the quicksort sorting network implementation since it was a SIMD processor). This local sort is 1-time however as the rest of the algorithm just uses the half-sorted property to put them back in their place.
The merge phase instead of exploding in communication uses the odd-even transposition network by replacing the compare-&-swap building block by a merge block which takes the locally sorted inputs and puts them in the right place. Thus at every stage we have replaced the smaller computation cost with a larger one. Now this is where this self-replicating network becomes impressive because the merge block is nothing but a transposition network again with k*(k-1)/2 (k=[n/m] where m is the # of PEs and [.] is the least integer greater than equal to n/m) stages since the merge block is still serial computation among 2 processors. So overall to summarize the performance of this algorithm it takes O(n*[n/m]^2) and on the face value looks a shot in the arm of serial processors. However any practical implementation of quick sort or other serial sorting algorithms also run into similar problems as the problem size scales due to their caches, memory and pipelines and therefore would still perform slower than the parallel implementation.
There is always a temptation to think this approach is not better than a bitonic sorter but considering the communication required in a linear array processor, this approach scales much the same way as the problem size. However the one approach that I still consider fascinating is the use of good online sorting algorithms that build structures like minheap (Ref: On Balancing sorting on linear array, Yen-Chun Lin). Why online algorithms can provide efficient solutions is the fact that the parallel processors are still used as co-processors for the CPU (try imagining an OS on a GPU and you know why) and connected to the CPU using an I/O-mechanism like the PCI-e which despite the interconnect scaling still remains slower than the frequency of the co-processor.
To conclude this long post (which I hope did not make you doze off), coming up with parallel algorithms and thinking about the practical issues when writing such algorithms are necessary to beat the slowdown that serial processors are headed to. As a computer architect, it is a really fun time to research on these things and I hope to be more frequent with my posts about more interesting topics.
The odd-even transposition network shown below performs a compare-&-swap in each stage and in n stages the entire input is sorted. It is also very efficient on a linear array processor since it just involves the nearest neighbor communication (I implemented it on the Kestrel SIMD processor). Thus the odd-even transposition network requires zero communication cost, which we know is ultimately the limiting factor in parallel algorithms as we scale the processing elements.
We see that the problem speedup potential is unlimited as long as we can process an input of size n with a chip of n PEs. But in reality we know that the algorithm can exist in virtual variable space unlike the processor chip which has to manage the problem in its physically constrained space. Thus the interesting question thrown at us was how can you extend this wonderful parallel sorting algorithm when the size of input is larger than the resources. (Assuming we have a big memory but not extra PEs)
One of the big goals was to extend this algorithm to retain that near neighbor communication. My approach to this problem is to carry over the existing work of sorting locally using the fastest serial sorting algorithm (I used the quicksort sorting network implementation since it was a SIMD processor). This local sort is 1-time however as the rest of the algorithm just uses the half-sorted property to put them back in their place.
The merge phase instead of exploding in communication uses the odd-even transposition network by replacing the compare-&-swap building block by a merge block which takes the locally sorted inputs and puts them in the right place. Thus at every stage we have replaced the smaller computation cost with a larger one. Now this is where this self-replicating network becomes impressive because the merge block is nothing but a transposition network again with k*(k-1)/2 (k=[n/m] where m is the # of PEs and [.] is the least integer greater than equal to n/m) stages since the merge block is still serial computation among 2 processors. So overall to summarize the performance of this algorithm it takes O(n*[n/m]^2) and on the face value looks a shot in the arm of serial processors. However any practical implementation of quick sort or other serial sorting algorithms also run into similar problems as the problem size scales due to their caches, memory and pipelines and therefore would still perform slower than the parallel implementation.
There is always a temptation to think this approach is not better than a bitonic sorter but considering the communication required in a linear array processor, this approach scales much the same way as the problem size. However the one approach that I still consider fascinating is the use of good online sorting algorithms that build structures like minheap (Ref: On Balancing sorting on linear array, Yen-Chun Lin). Why online algorithms can provide efficient solutions is the fact that the parallel processors are still used as co-processors for the CPU (try imagining an OS on a GPU and you know why) and connected to the CPU using an I/O-mechanism like the PCI-e which despite the interconnect scaling still remains slower than the frequency of the co-processor.
To conclude this long post (which I hope did not make you doze off), coming up with parallel algorithms and thinking about the practical issues when writing such algorithms are necessary to beat the slowdown that serial processors are headed to. As a computer architect, it is a really fun time to research on these things and I hope to be more frequent with my posts about more interesting topics.
