Selected Projects

Relaxed Dependence Tracking for Parallel Runtime Support

Memory access dependence tracking is a fundamental component for building many runtime support systems. Existing forms of runtime support slow programs significantly in order to track (i.e., detect or control) an execution’s cross-thread memory access dependences accurately. This paper investigates the potential for runtime support to hide latency introduced by dependence tracking, by tracking dependences in a relaxed way—meaning that not all dependences are tracked accurately. The key challenge in relaxing dependence tracking is to preserve both the program’s semantics and the runtime support’s guarantees. We present an approach called relaxed tracking (RT) and demonstrate its practicality by building two types of RT-based runtime support. Our evaluation shows that RT hides much of the latency incurred by dependence tracking, although RT-based runtime support incurs costs and complexity in order to handle relaxed dependence information. By demonstrating how to relax dependence tracking to hide latency while preserving correctness, this work shows the potential for addressing a key cost of dependence tracking, thus advancing knowledge in the design of parallel runtime support.

Paper: Pdf
TalkSlides (Google Slides)
Open source code linkSource Code, which includes implementation of RT, RT-based recorder, and RT-based STM.

Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics

Transactional memory offers a high-level abstraction for writing concurrent programs. Existing software transactional memory systems are impractical because they add high overhead and often provide weak progress guarantees and/or semantics. LarkTM is a new software transactional system. It is fast and provides strong atomicity/isolation. It is significantly faster than DeuceSTM, IntelSTM, and NOrec; scales well; and provides a strong progress guarantees that are not usually found in other systems.

Paper: Pdf
TalkPdf, pptx
Open source code linkSource Code, which includes implementation of LarkTM-O, LarkTM-S, and NOrec (JVM), IntelSTM (JVM)
BenchmarkSTAMP (Java version)

  • Designed and implemented a low overhead STM system with strong semantics and strong progress guarantee.
  • Significantly reduced STM’s overhead to 40% versus two state-of-the-art high performance STMs (188% and 232%).
  • Approved by the PPoPP committee, “We believe it has a good potential to impact future Java TM systems.”
  • Artifact was accepted by the PPoPP AEC.

Memcached Design on High Performance RDMA Capable Interconnects

There is a tremendous increase in the interest on data retrieval and analysis. Data lookups are expensive, and Memcached, which is a distributed memory caching layer, is used to facilitate lookups. Memcached is implemented using traditional BSD sockets, and although socket interface provides portability, it entails additional processing and multiple message copies. Meanwhile, High-Performance Computing (HPC) has adopted advanced interconnects (e.g. InfiniBand, 10 Gigabit Ethernet/iWARP, RoCE) and provides low network latency, high bandwidth, and low CPU overhead. This project provides a novel design of Memcached for RDMA capable networks. It extends the existing open-source Memcached software and makes it RDMA capable.

  • Integrated memcached with the INCR communication library for Infiniband
  • Benchmarked memcached + InfiniBand with memcached + sockets/10GigE, memcached + sockets/SDP and IPoIB
  • Improved Memcached’s latency to retrieve 4KB data size by a factor of four than the latency from using 10GigE , and obtained a factor of six’s improvement in throughput.