Prospective students, please read this if you are interested in joining my group.


Parallel and Locality Programming and Compilation

As VLSI processor technology matures, parallelism, locality, and bandwidth conservation become more critical. However, current programming models and compilers do not explicitly address these issues, which leads to reduced performance and low programmer productivity. My first attempt at tackling these issues was as a member of the Brook stream language developer team. We designed the language for scientific computing that exposed parallelism and locality to the programmer, and worked on a sophisticated optimizing compiler targeting Merrimac. The language eventually shifted focus towards programmable graphics processors, and was released to the public domain as BrookGPU. Currently, I am taking part in the development of the Sequoia programming model and software system, which builds on our experience with stream programming and Brook. A Sequoia programmer is empowered to explicitly reason about and express locality and parallelism at multiple levels. The result is a high-performance application that can easily be ported to a variety of traditional and emerging architectures.
Merrimac Streaming Supercomputer

Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. Merrimac is designed to be a streaming scientific computer that can be scaled from a $20K 2 TFLOPS workstation to a $20M 2 PFLOPS supercomputer. As lead architect my research involves all aspects of the system from hardware architecture, through the compiler and programming language, to the applications and algorithms.
System Reliability and Dependability

As the number of devices per computational node grows larger and as computer systems rely more and more on multiple nodes for high performance, reliability aspects, especially soft-error tolerance, will become critical for single processors and consumer computer systems. Taking this into account, I am developing techniques for the Merrimac processor that assure correct output through rigorous fault detection and recovery mechanisms. The main goal of the proposed schemes is to conserve the critical and costly off-chip bandwidth and on-chip storage resources, while maintaining high peak and sustained performance. We achieve this by adding reconfigurability to the Merrimac processor and allowing the software system and programmer to take advantage of it. These techniques apply to compute-intensive architectures in general and are not limited to Merrimac.
Scalar Processor Architecture and Compilation

Most current scalar processors still rely on an ISA that was designed in the late 1970s and does not reflect many changes in modern architectures. Therefore, I am exploring ISA designs that expose abstract, scalable, forward and backward compatible representations of internal modern microarchitectures to the compiler. The hope is that better communication mechanisms will allow the hardware and compiler to cooperate in achieving high performance, as opposed to the compiler tricking the hardware to achieve its goals or the hardware dynamically rediscovering information that is readily available to the compiler. One example of a cooperative ISA mechanism that I developed deals with register allocation. The Spills, Fills, and Kills technique allows hardware to rely on compiler-communicated liveness information to improve performance and reduce energy consumption.
Microarchitecture

High-performance single-threaded execution will remain critical in the future, even as processors turn more and more to data-parallel execution units. Regardless of this increasing use of parallelism, most applications contain significant portions of control code that is difficult to parallelize, and certain key algorithms simply have no known parallel representation. In addition to my research on scalar architecture and compilation, I have also targeted several aspects of optimizing single-thread execution at the microarchitecture level. These include the novel eXtended Block Cache for efficient and effective instruction supply, techniques for better dynamic instruction scheduling, and various predictive techniques for both performance and hardware efficiency. My includes exploring how existing and new microarchitecture features can be exposed in an abstract way to the compiler and programmer. As an example, I am currently working on extending a traditional general-purpose processor core with stream-architecture mechanisms to form a hybrid processor that can efficiently execute both control-intensive and compute-intensive code. Our goals include re-using existing microarchitectural components whenever possible.