Oblivious Processors

We introduce Phantom, a new secure processor that obfuscates its memory access trace. To an adversary who can observe the processor’s output pins, all memory access traces are computationally indistinguishable (a property known as obliviousness). We achieve obliviousness through a cryptographic construct known as Oblivious RAM or ORAM.

We first improve an existing ORAM algorithm and construct an empirical model for its trusted storage requirement. We then present Phantom, an oblivious processor whose novel memory controller aggressively exploits DRAM bank parallelism to reduce ORAM access latency and scales well to a large number of memory channels. Finally, we build a complete hardware implementation of Phantom on a commercially available FPGA-based server, and through detailed experiments show that Phantom is efficient in both area and performance. Accessing 4KB of data from a 1GB ORAM takes 26.2us (13.5us for the data to be available), a 32× slowdown over accessing 4KB from regular memory, while SQLite queries on a population database see 1.2 − 6× slowdown.

Phantom is the first demonstration of a practical, oblivious processor and can provide strong confidentiality guarantees when offloading computation to the cloud.

Gate Level Information Flow Tracking (GLIFT)

From datacenters and mobile devices to aircraft, automobiles, and banks, many systems require new ways of tightly constraining the flow of information such that untrusted and malicious devices can never affect critical system tasks, or that cryptographic secrets can never leak. To create such systems, information flow controls must be incorporated from the ground up, including at hardware design time, in order to provide a formal basis for a system's root of trust.

My thesis demonstrates a new method for constructing and analyzing architectures capable of tracking all information flow within the machine, including timing and covert channels. The key to such an approach is a novel gate-level information flow tracking method (GLIFT) which provides a way to compose complex logical structures with well defined information flow properties. Starting from a simple NAND gate, we create more complex structures including muxes, control, registers, and finally a microprocessor.

This Execution Lease processor provides programmers (i.e. system software) with explicit control over all information flows, and I am currently working on implementing a micro-kernel that uses our proposed ISA to construct information-tight partitions. Finally, by formalizing the GLIFT technique as a static analysis tool, we can prove that the kernel and hardware together enforce a security policy for all possible user-level processes and external inputs. A short summary of this work is available as an IEEE Micro Top Pick paper.

Languages for Secure Hardware Design

I have recently started to collaborate with Xun Li and Ben Hardekopf to formalize common secure-hardware design patterns in the form of a type system for a hardware description language. We propose Caisson -- a finite state-machine based language -- using which we are able to design complex processors that maintain strict information flow control even in the presence of pipelining, caches, and other micro-architectural features. Our initial work has been presented in PLAS 2010 workshop, and a more complete version is currently under submission.

3D Die-Stacking for Security

I am working with Jonny Valamehr to examine the use of novel technologies such as 3D-Integration to add introspection hardware in the last stages of processor fabrication. This 3-D Integration decouples the function and economics of security policy enforcement from the underlying computing hardware. As a result, security enhancements are manufacturing options applicable only to those systems that require them. In our ACSAC paper, we present a set of circuit level primitives that can be used by the 3D control plane to manipulate signals on the base computation plane.

Program Analysis on Heterogeneous Systems

As programmers are asked to manage more complicated parallel machines, it is likely that they will become increasingly dependent on tools such as multi-threaded data race detectors, memory bounds checkers, dynamic dataflow trackers, and various performance profilers to understand and maintain their software. Rather than performing all the instrumentation and analysis on the main processor, we explore the idea of using the increasingly high-throughput board level interconnect available on many systems to offload analysis to a parallel off-chip accelerator.

We have developed Hardgrind, a prototype system that maps a DMA based analysis engine, sitting on a PCI-mounted FPGA, into the Valgrind instrumentation framework. Using this prototype we characterize the potential of such a system to both accelerate existing software development tools and enable a new class of heavyweight dynamic analysis. While many issues still remain with such an approach, we demonstrate that program analysis speedups of 29% to 440% could be achieved today with strictly off-the-shelf components on some of the state-of-the-art tools, and we carefully quantify the bottlenecks to illuminate several new opportunities for further architectural innovation.

Accelerator for Dynamic Program Analysis

Many dynamic analysis schemes rely on the ability to associate metadata (or ``tags'') with all of virtual or physical memory. If one wishes to store large 32-bit tags, multiple tags per data element, or tags at the granularity of bytes rather than words, then directly storing one tag on chip to cover one byte or word (in a cache or otherwise) can be an expensive proposition. We show that dataflow tags naturally exhibit a very high degree of spatial-value locality, an observation we can exploit by storing metadata on ranges of addresses (which cover a non-aligned contiguous span of memory) rather than on individual elements. In fact, a small 128 entry on-chip range cache (with area equivalent to 4KB of SRAM) hits more than 98% of the time on average.

The key to this approach is our proposed method by which ranges of tags are kept in cache in an optimally RLE-compressed form, queried at high speed, swapped in and out with secondary memory storage, and (most important for dataflow tracking) rapidly stitched together into the largest possible ranges as new tags are written on every store, all the while correctly handling the cases of unaligned and overlapping ranges. We examine the effectiveness of this approach by simulating its use in definedness tracking (covering both the stack and the heap), in tracking network-derived dataflow through a multi-language web application, and through a synthesizable prototype on an FPGA.


One summer at NEC Labs, I worked with Hari Cadambi on scaling MapReduce for many-core architectures. Back in college at IIT Guwahati, I interned at the EDA Group in Torino (Italy) with Professors Enrico Macii and Luca Benini, and the VLSI-ASIC group at CDAC, Trivandrum, designing hardware for fun and profit.