CTWatch Quarterly » The Many-Core Inflection Point for Mass Market Computer Systems

The Many-Core Inflection Point for Mass Market Computer Systems

John L. Manferdelli, Microsoft Corporation

System Software Architecture

Many-core computers are more like "data-centers-on-a-chip" than traditional computers. System software will change to effectively manage resources on these systems while decomposing and rationalizing the system software function to provide more reliability and manageability. General purpose computer operating systems (which have not fundamentally changed since system and application software separated with the advent of "time shared" computers in the 1950's) will change as much as development tools.

To understand why, consider the following. Supercomputing applications are typically assigned dedicated system wide resources for each application run. This allows applications to tune algorithms to available resources: knowledge of the actual CPU resources available to the application at runtime, as well as memory, can drastically improve a sophisticated application's performance (database systems do a good job of this right now and too so often avoid, or out and out deceive, current operating systems to control real resources). By contrast, most commercial operating systems "time multiplex" the hardware resources⁸ to provide good utilization of expensive resources and anticipate that an application will run on a fairly narrow spectrum of architectures. Older operating systems also suffer from service, program and device isolation models, which are no longer appropriate but made perfect sense given earlier assumptions:

Current operating systems manage devices with a uniform device driver model and, if all such drivers are in the same address space, this simplifies I/O programming for applications and optimizes performance but creates huge OS kernels with management and security problems.
Time shared operating systems model security under a single authority (the "root" or "Administrator") who personally installs all software that is shared or requires OS modification software, knows all the users personally, and can determine a uniform security and resource allocation policy across (relatively simple) user programs. Today's computers operate in multiple trust domains, and different programs need different levels of protection and security policy; there are so many devices and some are so complex that no single authority can possibly uniformly and safely manage them. Right now, a buggy device driver used by one program jeopardizes all programs, while highly performant applications using special hardware (high speed graphics, for example) prefer to manage the device directly without incurring the sometimes catastrophic degradation incurred by "context switches" in the OS.
Homogeneous operating systems are usually designed for one of three modes or operation: high throughput, high reliability or high real-time guarantees. General purpose OSs fall into the first category, an OS designed to run a central phone switch in a major location falls into the second, and an entertainment or media device falls into the third. It is difficult to design a single scheduler that serves all three environments, but these computers will have each of these applications running simultaneously.
Most general purpose operating system configurations contain "everything any application could want." This has dramatically increased OS complexity by decreasing utility and slowing down all application development.
Most operating systems, again to simplify programming, have a "chore scheduling" model in which each independent "thread of execution" is scheduled by the OS. This means that every "chore" switch incurs a context switch into the kernel, which is very expensive. The OS scheduler, which knows nothing about the individual application, must "guess" as to what's best to do next. Historically, operating systems have given their applications one millisecond to run before interrupting, and rescheduling and switching to another thread might have taken 100 instructions. On a 1 MIP machine, this means a thread can run about 1000 instructions of useful work so system overhead was a very acceptable 10%. On a very fast machine, a millisecond accounts for a few million of instructions and it is very hard to write general purpose programs where this "quantum" of instructions yields a highly concurrent duty cycle. This forces programs with high concurrency to structure themselves into bigger, less parallel subtasks or suffer catastrophic performance. The solution is to let a "runtime," linked into the application, handle the vast majority of "chore switches" without OS intervention. These runtimes can have very detailed knowledge of the actual hardware configuration and make resource and scheduling decisions appropriately.

Pages: 1 2 3 4 5 6

CTWatch is a collaborative effort				Sponsored By