Rubble and Rants of Reeksrok

Monday, March 4, 2013

Parallel Sorting networks

Blogging my first post from Stanford and I thought I will start with a nice parallel programming algorithm. In our SIMD programming course we were taught the way to program the sorting algorithm on a parallel processor. We found that the sorting networks like the bitonic sorter and odd-even sorters perform much better than the parallel versions of a quick sort, which is the fastest serial algorithm.

The odd-even transposition network shown below performs a compare-&-swap in each stage and in n stages the entire input is sorted. It is also very efficient on a linear array processor since it just involves the nearest neighbor communication (I implemented it on the Kestrel SIMD processor). Thus the odd-even transposition network requires zero communication cost, which we know is ultimately the limiting factor in parallel algorithms as we scale the processing elements.

We see that the problem speedup potential is unlimited as long as we can process an input of size n with a chip of n PEs. But in reality we know that the algorithm can exist in virtual variable space unlike the processor chip which has to manage the problem in its physically constrained space. Thus the interesting question thrown at us was how can you extend this wonderful parallel sorting algorithm when the size of input is larger than the resources. (Assuming we have a big memory but not extra PEs)

One of the big goals was to extend this algorithm to retain that near neighbor communication. My approach to this problem is to carry over the existing work of sorting locally using the fastest serial sorting algorithm (I used the quicksort sorting network implementation since it was a SIMD processor). This local sort is 1-time however as the rest of the algorithm just uses the half-sorted property to put them back in their place.

The merge phase instead of exploding in communication uses the odd-even transposition network by replacing the compare-&-swap building block by a merge block which takes the locally sorted inputs and puts them in the right place. Thus at every stage we have replaced the smaller computation cost with a larger one. Now this is where this self-replicating network becomes impressive because the merge block is nothing but a transposition network again with k*(k-1)/2 (k=[n/m] where m is the # of PEs and [.] is the least integer greater than equal to n/m) stages since the merge block is still serial computation among 2 processors. So overall to summarize the performance of this algorithm it takes O(n*[n/m]^2) and on the face value looks a shot in the arm of serial processors. However any practical implementation of quick sort or other serial sorting algorithms also run into similar problems as the problem size scales due to their caches, memory and pipelines and therefore would still perform slower than the parallel implementation.

There is always a temptation to think this approach is not better than a bitonic sorter but considering the communication required in a linear array processor, this approach scales much the same way as the problem size. However the one approach that I still consider fascinating is the use of good online sorting algorithms that build structures like minheap (Ref: On Balancing sorting on linear array, Yen-Chun Lin). Why online algorithms can provide efficient solutions is the fact that the parallel processors are still used as co-processors for the CPU (try imagining an OS on a GPU and you know why) and connected to the CPU using an I/O-mechanism like the PCI-e which despite the interconnect scaling still remains slower than the frequency of the co-processor.

To conclude this long post (which I hope did not make you doze off), coming up with parallel algorithms and thinking about the practical issues when writing such algorithms are necessary to beat the slowdown that serial processors are headed to. As a computer architect, it is a really fun time to research on these things and I hope to be more frequent with my posts about more interesting topics.

Sunday, July 17, 2011

Conky : Smooth transparent cool widget on Linux

Ever felt bored of the bare Linux desktop? Installing more and more "gadgets" on your desktop and have trouble arranging them? Then realizing the CPU utilization by these gadgets going high and scrapping them. Well the answers to your trouble is Conky. Conky is a very light weight widget that fits on a Linux desktop and increases the fun of staring at your desktop.

Plot graphs fast on Matplotlib

Less than a year ago, being a python novice it was hard to look over utilities like excel and more recently gnuplot to plot graphs from stats obtained in files. Both have their pros and cons. But once the python knot got untied it is harder to look over the ease it provides for anything and everything. One such module is matplotlib.

Linked list representation using dot language

As directed by the first post on Dot language, it can used in several other applications for representing diagrams with minimal effort(unlike paint or sophisticated draw tools). One of such applications I came across last week was to represent a linked list. A linked list can be viewed as a rank sorted graph on the horizontal than the vertical(the graph earlier described in the first post). The only major part where this system fails is when you need to specify some handles on the nodes, then they may not be exactly on the vertical as desired. I am working on this and if I get a solution will post it for the benefit of many in search of this answer.

Random ip ping test

A simple bash script for pinging random ips in the institute for simple networking first session.

Cache Configuration for Multicore Multithread system in Multi2Sim

Multi2Sim is a simulation framework for superscalar, multithreaded, and multicore processors. It is an application-only tool, which allows one or more applications to be run on top of it without booting an operating system first. This little tip is not about multi2sim and its functionalities which can be found at http://www.multi2sim.org/wiki/index.php5/Main_Page . This post talks about writing cache configuration files for multicore multithreaded benchmarks that are to be simulated.

Personal 2010 in Review

2010 has been kind of a special year of joy and happiness and at the same time, an year filled with some mistakes paving way to my slow learning curve. It started out with a high flying midnight rush to release the saarang 2010 portal amidst the jamming network for the new year wishes. It ends with a peaceful quiet day(hopefully one that does not involve a server crash in the midnight). An year where I returned back to my old studious ways and shutting out extravagances, yet achieving some small things along the way that should keep me away from disheartening.

Rubble and Rants of Reeksrok

Monday, March 4, 2013

Parallel Sorting networks

Sunday, July 17, 2011

Conky : Smooth transparent cool widget on Linux

Saturday, May 21, 2011

Plot graphs fast on Matplotlib

Saturday, March 19, 2011

Linked list representation using dot language

Thursday, January 13, 2011

Random ip ping test

Thursday, January 6, 2011

Cache Configuration for Multicore Multithread system in Multi2Sim

Friday, December 31, 2010

Personal 2010 in Review