Modern GPUs, like the NVIDIA G80 graphic processors, open a completely new field to optimize embarrassingly parallel algorithms. With the introduction of general purpose programming framework such as NVidia CUDA, it is possible to make full use of the computational power of such devices. Implementing an algorithm on a GPU confronts the programmer with a new set of challenges for program optimization. Some of the most notable ones are isolating the part of the algorithm that can be optimized to run on the GPU; tuning the program for the GPU memory hierarchy whose organization and performance implications are radically different from those of general purpose CPU's; and optimizing programs at the instruction-level for the GPU.

In this presentation, I'll discuss different approaches for optimizing the memory usage and access patterns for GPUs and propose a class of memory layout optimizations that can take full advantage of the unique memory hierarchy of NVIDIA CUDA. Furthermore, we analyze the performance increase by fully unrolling the innermost loop of the algorithm and propose guidelines on how to best unroll a program for the GPU. In particular, even that loop unrolling is a common optimization, the performance improvement on a GPU derives from a completely different aspect of this architecture.

To demonstrate these optimizations, we picked an embarrassingly parallel algorithm used by the Gravit gravity simulator that calculates the far field forces. The algorithm allows us to demonstrate and to explain the performance increase gained by loop unrolling, as well as, the memory hierarchy optimizations. Our results show that our approach is quite effective. After applying our technique to the algorithm used in the Gravit gravity simulator, we observed a 1.27x speedup compared to the baseline GPU implementation. This represents a 87x speedup to the original CPU implementation.