Prospective students, please read this if you are interested in joining my group.
As VLSI processor technology matures, parallelism,
locality, and bandwidth conservation become more
critical. However, current programming models and compilers do not
explicitly address these issues, which leads to reduced performance and
low programmer productivity. My first attempt at tackling these issues
was as a member of the Brook stream language developer team. We
designed the language for scientific computing that exposed
parallelism and locality to the programmer, and worked on a
sophisticated optimizing compiler targeting Merrimac. The language
eventually shifted focus towards programmable graphics processors,
and was released to the public domain as BrookGPU.
Currently, I am taking part in the development of the Sequoia
programming model and software system, which builds on our
experience with stream programming and Brook. A Sequoia
programmer is empowered to explicitly reason about and express
locality and parallelism at multiple levels. The result is a
high-performance application that can easily be ported to a
variety of traditional and emerging architectures.
- Compilation for Explicitly Managed Memory Hierarchies, PPoPP'07 2007 (T. J. Knight, J. Park, M. Ren, M. Houston, M. Erez, K. Fatahalian, A. Aiken, W. J. Dally, P. Hanrahan)
-
Programming the Memory Hierarchy, SC'06 2006 (K. Fatahlian, T. Knight, M. Houston, M. Erez, D. Horn, L. Leem, J-Y Park, M. Ren, A. Aiken, W. J. Dally, P. Hanrahan)
Merrimac uses stream
architecture and advanced interconnection networks to give an order of
magnitude more performance per unit cost than cluster-based scientific
computers built from the same technology. Organizing the computation
into streams and exploiting the resulting locality using a register
hierarchy enables a stream architecture to reduce the memory bandwidth
required by representative applications by an order of magnitude or
more. Hence a processing node with a fixed bandwidth (expensive) can
support an order of magnitude more arithmetic units (inexpensive).
This in turn allows a given level of performance to be achieved with
fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes)
resulting in greater reliability, and simpler system management.
Merrimac is designed to be a streaming scientific computer that can be
scaled from a $20K 2 TFLOPS workstation to a $20M 2 PFLOPS
supercomputer.
As lead architect my research involves all aspects of the system from
hardware architecture, through the compiler and programming language,
to the applications and algorithms.
- Executing Irregular Scientific Applications on Stream Architectures, ICS'07 2007 (M. Erez, J. Ahn, J. Gummaraju, M. Rosenblum, W. J. Dally)
- Tradeoff between Data-, Instruction-, and Thread-level Parallelism in Stream Processors, ICS'07 2007 (J. Ahn, W. J. Dally, M. Erez)
- The Design Space of Data-Parallel Memory Systems, SC'06 2006 (J. Ahn, M. Erez, W. J. Dally)
- Merrimac -- High-Performance, Highly-Efficient Scientific Computing with Streams
, Stanford University Ph.D. dissertation (M. Erez)
-
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer, SC'05 2005 (M. Erez, N. Jayasena, T. Knight, W. J. Dally)
-
Scatter-Add in Data Parallel Architectures, HPCA-11 2005 (J. Ahn, M. Erez, W. J. Dally)
-
Analysis and Performance Results of a Molecular Modeling Application on Merrimac, SC'04 2004 award paper (M. Erez, J. Ahn, A. Garg, W. J. Dally, E. Darve)
-
Stream Register Files with Indexed
Access, HPCA-10 2004 (N. Jayasena, M. Erez, J. Ahn, W. J. Dally)
-
Merrimac: Supercomputing with Streams, SC'03 (W. J. Dally, P. Hanrahan, M. Erez, T. J. Knight, F. Labonté,
J. Ahn, N. Jayasena, U. J. Kapasi, A. Das, J. Gummaraju, I. Buck)
As the number of devices per computational node grows
larger and as computer systems rely more and more on multiple nodes
for high performance, reliability aspects, especially soft-error
tolerance, will become critical for single processors and consumer
computer systems. Taking this into account, I am developing techniques
for the Merrimac processor that assure correct output through rigorous
fault detection and recovery mechanisms. The main goal of the proposed
schemes is to conserve the critical and costly off-chip bandwidth and
on-chip storage resources, while maintaining high peak and sustained
performance. We achieve this by adding reconfigurability to the
Merrimac processor and allowing the software system and programmer to
take advantage of it. These techniques apply to compute-intensive
architectures in general and are not limited to Merrimac.
Most current scalar processors still rely on an ISA that was designed
in the late 1970s and does not reflect many changes in modern
architectures. Therefore, I am exploring ISA designs that
expose abstract, scalable, forward and backward compatible
representations of internal modern microarchitectures to the
compiler. The hope is that better communication mechanisms will allow
the hardware and compiler to cooperate in achieving high performance,
as opposed to the compiler tricking the hardware to achieve its goals
or the hardware dynamically rediscovering information that is readily
available to the compiler. One example of a cooperative ISA mechanism
that I developed deals with register allocation. The Spills, Fills, and Kills technique allows hardware to rely on compiler-communicated liveness
information to improve performance and reduce energy consumption.
High-performance single-threaded execution will remain critical in the
future, even as processors turn more and more to data-parallel
execution units. Regardless of this increasing use of parallelism,
most applications contain significant portions of control code that is
difficult to parallelize, and certain key algorithms simply have no
known parallel representation. In addition to my research on scalar
architecture and compilation, I have also targeted several aspects of
optimizing single-thread execution at the microarchitecture
level. These include the novel eXtended Block Cache for efficient and
effective instruction supply, techniques for better dynamic
instruction scheduling, and various predictive techniques for both
performance and hardware efficiency. My includes
exploring how existing and new microarchitecture features can be exposed
in an abstract way to the compiler and programmer. As an example, I am
currently working on extending a traditional general-purpose processor
core with stream-architecture mechanisms to form a
hybrid processor that can efficiently execute both control-intensive
and compute-intensive code. Our goals include re-using existing
microarchitectural components whenever possible.
- The Design Space of Data-Parallel Memory Systems, SC'06 2006 (J. Ahn, M. Erez, W. J. Dally)
-
Memory Cache Bank Prediction, US Patent #6,880,063, 2005 (A. Yoaz, R. Ronen, L. Rappoport, M. Erez, S. Jourdan, R. Valentine)
-
Fast Branch Misprediction Recovery Method and System, US Patent #6,757,816, 2004 (A. Yoaz, G. Pribush, F. Gabbay, M. Erez, R. Ronen)
-
System and Method for Early Resolution of Low Confidence Branches and Safe Data Cache Accesses, US Patent #6,757,816, 2004 (A. Yoaz, M. Erez, R. Ronen)
-
Cache Memory Bank Access Prediction, US Patent #6,694,421, 2004 (A. Yoaz, R. Ronen, L. Rappoport, M. Erez, S. Jourdan, R. Valentine)
-
eXtended Block Cache, HPCA-6 2000 (S. Jourdan, L. Rappoport, Y. Almog, M. Erez,
A. Yoaz, R. Ronen)
-
Speculation Techniques for Improving Load Related Instruction Scheduling, ISCA-26 1999 (A. Yoaz, M. Erez, R. Ronen, S. Jourdan)