Our work is primarily concerned with the infrastructure and search techniques for empirical code optimization. The goal is to be able to easily substitute different code generators and search engines in the tuning process. To that end, we have developed a simple syntax for specifying the way to generate code to perform the evaluation of the code variants. From this specification, the test case generator produces C code and a Makefile which are used to generate the timing executable. To allow easily switching search techniques, we split the application-specific aspects of the evaluation process into a shell script, which has a simple specification of the search bounds and constraints embedded in comments at the top. This way, the search engine doesn't need to know the details of the code generator, build, or test case generation.
As a simple example of the kind of search space we are dealing with, the figure above shows the results of running an exhaustive search over two dimensions: block sizes up to 128 and unrolling up to 128. The x and y axes represent block size and unrolling amount, while the z axis represents the performance in Mflop/s of the generated code. The code being optimized is an implementation of square matrix-matrix multiplication (N=400). The dimension size is relatively small to allow faster evaluations at each point while being large enough to ensure repeatable measurements using PAPI. In general, we see the best results along the blocking axis with a low unrolling amount as well as along the diagonal where blocking and unrolling are equal, but there are also peaks along areas where the block size is evenly divisible by the unrolling amount. The best performance was found with block size 80 and unrolling amount 2. This code variant ran at 1459 Mflop/s compared to 778 Mflop/s for the naive version compiled with gcc.
More recently, we have been working on evaluating various search techniques to determine which work best in an empricial optimization context. We have evaluted some classical techniques like Genetic Algorithms, Simulated Annealing, and Particle Swarm Optimization, as well as some more ad-hoc techniques like a modified orthogonal search. To date, the code we have been tuning comes from dense linear algebra and the code generators have been limited to the ROSE LoopProcessor and the POET code generator, but in the future we would like to experiment with different kinds of code and different code generators to determine how that affects the search process.