The architectural model developed in the current SLOPE implementation is rather simple (but not simplistic) in several respects. First, it assumes a zero overhead instruction scheduling. This is clearly not the case although pipelining execution techniques can emulate this aspect. Second, it does not take into account register pressure in the execution. Lastly, it does not yet take into account advanced execution techniques such as software pipelining and multi-threading. Ignoring these techniques and compiler implementation details clearly leads to quantitative results that differ, maybe substantially, from current high-end machines. Last but not least, this approach needs to be validated against a real architecture.
Nevertheless, this approach allows the development of quantitative architectural performance trends and hence allows architecture designers to make informed decisions about how to most efficiently allocate transistors. In the above case, a determination could be made between the complexity and power consumption (for example) of having an arithmetic that pipelines all operations (including divides) and simply replicating standard arithmetic. This information also allows developers to have an idea of what the performance trend increases will be on a proposed “future” machine for a given code.
We have described SLOPE, a system for performance prediction and architecture sensitivity analysis using source level program analysis and scheduling techniques. SLOPE provides a very fast qualitative analysis of the performance of a given kernel code. We have experimented with a real scientific code that engineers and scientists use in practice. The results yield important qualitative performance sensitivity information that can be used when allocating computing resources to the computation in a judicious fashion for maximum resource efficiency and/or help guide the application of compiler transformations such as loop unrolling.
2SPEC - www.spec.org/
3Snavely, A., Carrington, L., Wolter, N., Labarta, J., Badia, R., Purkayastha, A. “A Framework for Application Performance Modeling and Prediction,” In Proceedings of the 2002 ACM/IEEE SuperComputing Conference (SC’02), 2002.
4Saavedra, R. H., Smith, A. J. “Measuring Cache and TLB Performance and Their Effect on Benchmark Run Times,” IEEE Transactions on Computers, vol. 44:10.
5Kerbyson, D. J.,Hoisie, A., Wasserman, H. J. “Modeling the Performance of Large-Scale Systems,” Keynote paper, UK Performance Engineering Workshop (UKPEW03), July, 2003.
6The ASCII Purple Benckmark Codes - www.llnl.gov/asci/purple/benchmarks/limited/umt/umt1.2.readme.html#Cod...
7This intermediate representation of the Open64 [10] infrastructure consists of annotated Abstract-Syntax-Tree representation of the input source code similar to other high-level representation of source code.
8Banerjee, U., Eigenmann, R., Nicolau, A., Padua, D. “Automatic Program Parallelization,” In Proceedings of the IEEE, 1993.
9Mowry, T. “Tolerating Latency in Multiprocessors through Compiler-Inserted Prefetching,” ACM Transactions on Computer Systems, 16(1), pp. 55-92, Feb. 1998.
10The Open64 Compiler and Tools - sourceforge.net/projects/open64/
11Pipelining allows the functional unit allocated to an operation to be available again the cycle after it is obtained.