David Patterson of Berkeley has a formulaic summary of the serial performance problem: "The power wall + the memory wall + the ILP wall = a brick wall for serial performance." Thus, the heroic line of development followed by materials scientists and computer designers to increase serial performance now yields diminishing returns. Computer architects have been forced to turn to parallel architectures to continue to make progress. Parallelism can be exploited by adding more independent CPUs, data-parallel execution units, additional registers sets (hardware threads), more independent memory controllers to increase memory bandwidth (this requires more output pins) and bigger caches. Computer architects can also consider incorporating different execution units, which dramatically improve some computations but not others (e.g., GPU like units that excel at structured data parallelism and streaming execution units with local memory as Cray did in many of its early machines3). Heterogeneity need not only mean completely different "abstract" execution unit models but may include incorporating computation engines with the same instruction set architecture, but different performance and power consumption characteristics. All of these take advantage of dramatically higher on-chip interconnect data rates.
Moore's law will grant computer architects ever more gates for the foreseeable future, and the challenge is to use them to deliver performance and power characteristics fit for their intended purpose. Figure 1 below illustrates a few hardware design choices. In 1(a), a client configuration might consist of two large "out of order" cores (OoC) incorporating all the ILP of current processors to run existing programs together with many smaller "in-order" cores (IoC) for programs that can take advantage of highly parallel software. Why many IoCs rather than correspondingly fewer of the larger OoCs? The reason is that spending gates on out-of order has poorer performance returns than simpler in-order cores, if parallel software can scale with cores. The server configuration in 1(b) incorporates many more IoCs and a "custom" core (say a crypto processor). Finally, multi-core computers are not just beneficial for raw performance but also reliability and power management so embedded processors will also undergo an architecture shift as illustrated in 1(c).







