

# 2023 ACE/SCR Annual Review | OCTOBER 4-5,2023





Semiconductor Research Corporation











### **ACE Themes & Tasks**

The goal of the ACE Center is to devise novel technologies for scalable distributed computing that will improve the performance and the energy efficiency of diverse applications by 100x over the expected computer systems of 2030.

Distributed computing in 2030 will be defined by the need to process vast swaths of data for insights in a timely manner. Minimizing data movement to curtail energy consumption in an energy-conscious Earth will be the overriding constraint. The compute infrastructure will be a seamless hierarchy of compute centers from edge to geo-distributed mega-datacenters. Each compute center will contain a large number of heterogeneous hardware accelerators, and tasks of unprecedented small granularity will seamlessly ship computation to where data is. To further minimize data movement, key data will be replicated or re-materialized on demand. The computational environment will be highly dynamic, with the constant introduction of new classes of accelerators for barely-emerging workloads, and of new applications/protocols that could benefit from yet-to-be-conceived accelerators.

| THEME 1: H             | HETEROGENEOUS COMPUTING PLATFORMS                                            |
|------------------------|------------------------------------------------------------------------------|
| 3134.001               | Evolvable Distributed Accelerators                                           |
| 3134.002               | Composable Distributed Acceleration                                          |
| 3134.003               | Making Distributed Accelerator Ensembles Usable: Multilatency & Code Mapping |
| 3134.004               | Energy-efficiency Driven CPU-centric Nodes                                   |
| THEME 2: [             | DISTRIBUTED EVOLVABLE MEMORY & STORAGE                                       |
| 3134.005               | Scalable heterogeneous Memory Hierarchies                                    |
| 3134.006               | Scalable Management of distributed Memory & Storage Assets                   |
| 3134.007               | Near- and In- memory/storage Acceleration                                    |
| THEME 3: F             | FINE-GRAINED COMMUNICATION AND COORDINATION                                  |
| 3134.008               | An Accelerator-Rich Datacenter Architecture and Beyond                       |
| 3134.009               | An Evolvable Network Stack                                                   |
| 3134.010               | A Self-balancing Planet-Scale Distributed Runtime                            |
| 3134.011               | In-network Computing                                                         |
| 3134.012               | Hardware-Supported Intelligent Distributed Data Stores                       |
| THEME 4: S             | SECURITY, PRIVACY AND CORRECTNESS                                            |
| 3134.013               | Data-centric Security that Evolves with Threat Models and Systems            |
| 3134.014               | Domain-specific TEE on Evolving heterogeneous Accelerators                   |
| 3134.015               | Security and Privacy Assurance                                               |
| 3134.016               | Design for Verification of Evolvable Hardware Accelerators                   |
| THEME 5: DEMONSTRATORS |                                                                              |
| 3134.017               | Demonstrator 1: A Reconfigurable Multi-Accelerator Compute Engine            |
| 3134.018               | Demonstrator 2: A heterogeneous Large Cluster with Specialized Intelligence  |
| 3134.019               | Demonstrator 3: Applications Benchmark                                       |

## **ACE PI Directory**

| NAME & AFFILIATION            | ASSOCIATED TASKS               | EMAIL                    |
|-------------------------------|--------------------------------|--------------------------|
| Josep Torrellas               | All                            | torrellas@cs.uiuc.edu    |
| Center Director, Illinois     |                                |                          |
| Tarek Abdelzaher              | 3134.019                       | zaher@illinois.edu       |
| Illinois                      |                                |                          |
| Mohammad Alian                | 3134.005, 3134.007, 3134.011   | alian@ku.edu             |
| University of Kansas          |                                |                          |
| Adam Belay                    | 3134.010, 3134.013, 3134.018   | abelay@csail.mit.edu     |
| MIT                           |                                |                          |
| Manya Ghobadi                 | 3134.008, 3134.009, 3134.011   | ghobadi@csail.mit.edu    |
| MIT                           |                                |                          |
| Rajesh K. Gupta               | 3134.002, 3134.004, 3134.016   | rgupta@ucsd.edu          |
| UCSD                          |                                |                          |
| Christoforos Kozyrakis        | 3134.003, 3134.008, 3134.009,  | christos@cs.stanford.edu |
| Stanford University           | 3134.014, 3134.018, 3134.019   |                          |
| Tushar Krishna                | 3134.001, 3134.002, 3134.008,  | tushar@ece.gatech.edu    |
| Georgia Tech                  | 3134.001, 3134.017, 3134.018   |                          |
| Arvind Krishnamurthy          | 3134.0006, 3134.009, 3134.011, | arvind@cs.washington.edu |
| University of Washington      | 3134.012, 3134.018             |                          |
| José F. Martínez              | 3134.004, 3134.005, 3134.006,  | martinez@cornell.edu     |
| Cornell                       | 3134.007, 3134.017, 3134.018   |                          |
| Charith Mendis                | 3134.002, 3134.003, 3134.009,  | charithm@illinois.edu    |
| Illinois                      | 3134.017                       |                          |
| Subhasish Mitra               | 3134.015, 3134.016, 3134.017   | subh@stanford.edu        |
| Stanford                      |                                |                          |
| Muhammad Shahbaz              | 3134.003, 3134.004, 3134.009,  | mshahbaz@purdue.edu      |
| Purdue University             | 3134.011, 3134.012, 3134.017,  |                          |
|                               | 3134.018                       |                          |
| Gookwon Edward Suh            | 3134.013, 3134.014, 3134.015,  | suh@ece.cornell.edu      |
| Cornell                       | 3134.017, 3134.018             |                          |
| Steven Swanson                | 3134.005, 3134.006, 3134.007,  | swanson@cs.ucsd.edu      |
| UCSD                          | 3134.017, 3134.018             |                          |
| Michael Taylor                | 3134.001, 3134.004, 3134.017   | profmbt@uw.edu           |
| University of Washington      |                                |                          |
| Mircea Radu Teodorescu        | 3134.004, 3134.013, 3134.014,  | teodorescu.1@osu.edu     |
| Ohio State University         | 3134.015, 3134.016, 3134.017,  |                          |
|                               | 3134.018                       |                          |
| Mohit Tiwari                  | 3134.013, 3134.014, 3134.015,  | tiwari@austin.utexas.edu |
| University of Texas at Austin | 3134.017, 3134.018             |                          |

```
Page 3
```

| Minlan Yu              | 3134.008, 3134.009, 3134.010, | minlanyu@g.harvard.edu |
|------------------------|-------------------------------|------------------------|
| Harvard University     | 3134.011, 3134.012, 3134.018  |                        |
| Zhengya Zhang          | 3134.001, 3134.002, 3134.003, | zhengya@eecs.umich.edu |
| University of Michigan | 3134.017, 3134.018, 3134.019  |                        |
| Zhiru Zhang            | 3134.001, 3134.002, 3134.003, | zhiruz@cornell.edu     |
| Cornell University     | 3134.004, 3134.007, 3134.014, |                        |
|                        | 3134.016, 3134.017            |                        |

# **Center Administration**

| NAME                   | ROLE                  | EMAIL                  |
|------------------------|-----------------------|------------------------|
| Josep Torrellas        | Center Director       | torrella@illinois.edu  |
| Minlan Yu              | Assistant Director    | minlanyu@g.harvard.edu |
| Mircea Radu Teodorescu | Director of Logistics | teodorescu.1@osu.edu   |
| Jill Peckham           | Executive Director    | jpeckham@illinois.edu  |

# **Students Presenting Project Deep Dives**

| RESEARCH SCHOLAR                                                                                     | PRESENTATION TITLE & BIO                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Jovan Stojkovic                                                                                      | Theme 1 Deep Dive Presentation: Server Design in the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| Illinois                                                                                             | Age of Microservices and Serverless Computing                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Email: jovans2@illinois.edu<br>PI: Josep Torrellas<br>Avail for Hire Date:<br>Internship Summer 2024 | <b>Bio:</b> Jovan Stojkovic is a fourth year PhD student at<br>University of Illinois at Urbana-Champaign advised by<br>Professor Josep Torrellas. Hist research focuses on the cloud<br>computing data platforms and deployment paradigms, such<br>as microservices and serverless computing. He explores ways<br>to make systems fast, reliable, and efficient in a holistic<br>manner: from the hardware up to the platform and<br>application layers. Jovan's work has been published at top-<br>tier computer architecture conferences, such as ISCA,<br>ASPLOS and HPCA. He was awarded the Kenichi Miura<br>Award for excellence in High-Performance Computing. Prior<br>to joining UIUC, Jovan completed his undergraduate studies<br>at the University of Belgrade and graduated as the best<br>student of his class. |
| Xiyuan Zhang                                                                                         | <b>Theme 1 Deep Dive Presentation:</b> <i>ML on the Edge:</i>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| UCSD                                                                                                 | Physics-Informed Data Denoising for Real-Life                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Fmail: xiz032@ucsd.edu<br>Pi: Rajesh Gupta<br>Available for Hire Date: 5/2024                        | <b>Bio:</b> Xiyuan Zhang is a Ph.D. student at Computer Science and<br>Engineering, University of California, San Diego. She is advised by<br>Prof. Rajesh Gupta and Prof. Jingbo Shang. Prior to UCSD, she<br>obtained her B.S. degree in Computer Science with honors from<br>Zhejiang University in 2020. Her research interests are in robust<br>and efficient machine learning for sensing systems. She has<br>received the Qualcomm Innovation Fellowship and has been<br>selected as CPS Rising Star. She has also held internships in AWS<br>AI Labs, MIT and UC Davis                                                                                                                                                                                                                                                 |
|                                                                                                      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                                                                                                      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |

| ACE CENTER FOR EVOLVABLE COMPUTIN | IG Page 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Suyash Mahar<br>UCSD              | <ul> <li>Theme 2 Deep Dive Presentation: Telepathic<br/>Datacenters: Efficient and High-Performance RPCs using<br/>Shared CXL Memory</li> <li>Bio: Suyash Mahar is a fourth-year Ph.D. student at UC San<br/>Diego interested in the datacenter's memory efficiency. He has<br/>worked with Google, Meta, and Intel on datacenter efficiency,<br/>studying their memory hierarchy and acceleration opportunities.<br/>Before starting his Ph.D. program, he worked on architecture and<br/>safety of persistent memories at the University of Virginia, CMU,<br/>and Technion. His works on memory systems have appeared in<br/>Eurosys, ASPLOS, PACT, and ICCD.</li> </ul> |
| <b>Mark Zhao</b><br>Stanford      | <b>Theme 3 Deep Dive Presentation:</b> <i>End to End</i><br><i>Optimization of Large-scale ML Training</i>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| <image/> <image/>                 | <b>Bio:</b> Mark is a Ph.D. student at Stanford, advised by<br>Christos Kozyrakis. His research centers around improving<br>the scalability, performance, and security of systems for<br>datacenter-scale applications such as machine learning. He<br>was recently a visiting researcher at Meta, where he worked<br>on data infrastructure for ML training. Mark was selected as<br>an MLCommons ML and Systems Rising Star, and he is<br>supported by a Stanford Graduate Fellowship and a Meta<br>Ph.D. Fellowship.                                                                                                                                                     |
|                                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |

#### Page 6



# **Theme 1 & Application Benchmarks Poster Session**

Page 7

| RESEARCH SCHOLAR                         | POSTER DETAILS                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <image/> <image/> <image/> <text></text> | <b>Title:</b> Building Evolvable Accelerators for Sparse Data<br>Processing<br><b>Abstract:</b> As general-purpose scaling yields diminishing<br>benefits and modern applications become increasingly data<br>intensive, there has been a surge of research focused on using<br>specialized hardware to accelerate sparse workloads. This<br>poster presents our recent research efforts on building<br>efficient yet versatile sparse accelerators, which aim to strike<br>a balance between domain specialization and adaptability to<br>accommodate the rapidly evolving application requirements<br>and technological capabilities. We will begin with GraphLily,<br>an FPGA-based graph processing overlay leveraging the<br>GraphBLAS abstraction to accelerate a rich set of graph<br>processing algorithms. Next, we will demonstrate our latest<br>efforts to develop a versatile sparse accelerator that supports<br>a broader range of sparse linear algebra kernels and compute<br>patterns. Additionally, we will outline our ongoing work on<br>developing a unified abstraction to support a multitude of<br>sparse formats that are customized for varying degrees and<br>patterns of sparsity. |
|                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |

### **Gerasimos Gerogiannis** Illinois



Email: gg24@illinois.edu PI: Josep Torrellas Avail for Hire Date: 6/2026

## **Title:** SPADE: A Flexible and Scalable Accelerator for SpMM and SDDMM

Abstract: The widespread use of Sparse Matrix Dense Matrix Multiplication (SpMM) and Sampled Dense Matrix Dense Matrix Multiplication (SDDMM) kernels makes them candidates for hardware acceleration. However, accelerator design for these kernels faces two main challenges: (1) the overhead of moving data between CPU and accelerator (often including an address space conversion from the CPU's virtual addresses) and (2) marginal flexibility to leverage the fact that different sparse input matrices benefit from different variations of the SpMM and SDDMM algorithms. To address these challenges, this paper proposes SPADE, a new SpMM and SDDMM hardware accelerator. SPADE avoids data transfers by tightly-coupling accelerator processing elements (PEs) with the cores of a multicore, as if the accelerator PEs were advanced functional units---allowing the accelerator to reuse the CPU memory system and its virtual addresses. SPADE attains flexibility and programmability by supporting a tile-based ISA---high level enough to eliminate the overhead of fetching and decoding fine-grained instructions. To prove the SPADE concept, we have taped-out a simplified SPADE chip. Further, simulations of a SPADE system with 224--1792 PEs show its high performance and scalability. A 224-PE SPADE system is on average 2.3x, 1.3x and 2.5x faster than a 56-core CPU, a server-class GPU, and an SpMM accelerator, respectively, without accounting for the host-accelerator data transfer overhead. If such overhead is taken into account, the 224-PE SPADE system is on average 43.4x and 52.4x faster than the GPU and the accelerator, respectively. Further, SPADE has a small area and power footprint. **CoAuthors:** Gerasimos Gerogiannis, Serif Yesil, Damitha Lenadora, Dingyuan Cao, Charith Mendis and Josep Torrellas



Email: ag82@illinois.edu PI: Charith Mendis

Avail for Hire Date: 5/15/2024

Title: FLuRKA: Fast fused Low-Rank & Kernel Attention **Abstract:** Many efficient approximate self-attention techniques have become prevalent since the inception of the transformer architecture. Two popular classes of these techniques are low-rank and kernel methods. Each of these methods has its own strengths. We observe these strengths synergistically complement each other and exploit these synergies to fuse low-rank and kernel methods, producing a new class of transformers: FLuRKA (Fast Low-Rank and Kernel Attention). FLuRKA provide sizable performance gains over these approximate techniques and are of high quality. We theoretically and empirically evaluate both the runtime performance and quality of FLuRKA. Our runtime analysis posits a variety of parameter configurations where FLuRKA exhibit speedups and our accuracy analysis bounds the error of FLuRKA with respect to full-attention. We instantiate three FLuRKA variants which experience empirical speedups of up to 3.3x and 1.7x over low-rank and kernel methods respectively. This translates to speedups of up to 30x over models with full-attention. With respect to model quality, FLuRKA can match the accuracy of low-rank and kernel methods on GLUE after pre-training on wiki-text 103. When pre-training on a fixed time budget, FLuRKA yield better perplexity scores than models with full-attention.



Email: damitha2@illinois.edu PI: Charith Mendis Avail for Hire Date: 5/2026 **Title:** SENSEi: Input Sensitive primitive compositions for GNNs

Abstract: Graph neural networks (GNN) have become an important class of neural networks that have gained popularity in domains such as social and financial network analysis. As a result, there have been many frameworks and optimization techniques proposed in the literature to accelerate GNNs. However, getting consistent high performance across many input graphs with different sparsity patterns and embedding sizes has remained difficult. In this paper, we observe that different algebraic reassociations of GNN computations lead to interesting input-sensitive performance characteristics. We use these observations to introduce novel dense and sparse matrix primitive compositions targeting convolution-based and attentionbased GNNs and show how their profitability changes with the input graph, embedding size, and target hardware. We developed SENSEi, a system that uses a data-driven adaptive strategy to select the best composition given the input graph and embedding sizes. Our evaluations on a wide range of graphs and embedding sizes show that SENSEi achieves speedups on canonical convolution and attention-based GNNs, of up to 2.049× and 1.153× on graph convolutional networks, and up to 51.123× and 6.868× on graph attention networks, on CPUs and GPUs respectively, compared to the widely used Deep Graph Library. We also show that our technique generalizes and gives speedups to other convolution (SGC, TAGCN) and attention (GATv2, GaAN) based GNN variants, as well as the decisions made by SENSEi do not change across sampled graphs, enabling it to support sampled variants. Further, we show that the compositions yield notable synergistic performance improvements on top of other established sparse optimizations, such as sparse matrix tiling, by evaluating against a well-tuned baseline. **CoAuthors:** Vimarsh Sathia, Gerasimos Gerogiannis, Serif Yesil, Josep Torrellas, Charith Mendis



PI: Muhammad Shahbaz Avail for Hire Date: 5/2024

Title: µManycore: A Cloud-Native CPU for Tail at Scale Abstract: Microservices are emerging as a popular cloudcomputing para- digm. Microservice environments execute typically-short service requests that interact with one another via remote procedure calls (often across machines), and are subject to stringent tail-latency constraints. In contrast, current processors are designed for tradi- tional monolithic applications. They support global hardware cache coherence, provide large caches, incorporate microarchitecture for longrunning, predictable applications (such as advanced prefetching), and are optimized to minimize average latency rather than tail latency. To address this imbalance, this paper proposes µManycore, an architecture optimized for cloudnative microservice environments. Based on a characterization of microservice applications,  $\mu$ Manycore is designed to minimize unnecessary microarchitecture and miti- gate overheads to reduce tail latency. Indeed, rather than supporting manycore-wide hardware cache coherence.  $\mu$ Manycore has multiple small hardware cache-coherent domains, called Villages. Clusters of villages are interconnected with an on-package leaf-spine net- work, which has many redundant, low-hop-count paths between clusters. To minimize latency overheads,  $\mu$ Manycore schedules and queues service requests in hardware, and includes hardware sup- port to save and restore process state when doing a context-switch. Our simulation-based results show that  $\mu$ Manycore delivers high performance. A cluster of 10 servers with a 1024-core  $\mu$ Manycore in each server delivers  $3.7 \times$  lower average latency,  $15.5 \times$  higher throughput, and, importantly, 10.4× lower tail latency than a cluster with isopower conventional server-class multicores. Similar good results are attained compared to a cluster with power-hungry iso-area conventional server-class multicores. **CoAuthors:** Jovan Stojkovic, Muhammad Shahbaz, Josep Torrellas





Email: misra8@illinois.edu PI: Tarek Abdelzaher Avail for Hire Date: 5/2026

### AND

Sakshi Tayal Illinois



Email: stayal2@illinois.edu PI: Tarek Abdelzaher Avail for Hire Date: 8/2024

# **Title:** Adaptive Precision Inference for Audio Signal Classification

Abstract: Quantization techniques have shown great promise in reducing inference times and memory footprint of Deep Neural Networks (DNNs), which are critical to real-time cyber-physical systems that run in resource-constrained environments. Some notable schemes include post-training quantization, pre-training quantization, and learnable dynamic precision quantization. Due to the dynamic nature of the operating environment of IoT devices, static fixed-point inference across the lifetime of the inference engine results in sub-optimal accuracy on out-of-distribution inputs. We propose a temporally dynamic precision inference engine for real-time audio signal classification that learns an efficient precision selection scheme that defers casting of each layer to runtime. The resulting precision is contingent on the theoretical properties of an initial fixed number of audio frames. The cost of precision selection is amortized over a predefined time period leading to an overall reduction in the number of arithmetic operations. Our framework achieves equivalent performance as static fixed-point precision quantization per inference and is robust to a wider range of input variations. The framework consists of an LSTM convolution encoder using spectrograms as the input followed by dense weight layers for classification. The model performs a casting of the initial 32-bit fixed-point layer following each predefined time period. This results in dynamic precision across layers and time. The input signal is downsampled to 1500 Hz to isolate relevant frequencies and improve performance. Initial experiments performed on ARM-Cortex M3 show that casting has a negligible contribution to the runtime. Our preliminary implementation achieves a classification accuracy of 74% for classification between two vehicles. Further hyperparameter tuning is expected to increase the accuracy.





Email: vsathia2@illinois.edu PI: Charith Mendis Date Avail for Hire: 5/2024 **Title:** *Exploring and Exposing Redundancy-Aware Optimizations for Temporal Graph Neural Networks* **Abstract:** We address optimization challenges in the realm of dynamic graphs by focusing on Temporal Graph Attention Networks (TGATs). Despite their effectiveness in predictive tasks, existing optimization methods for Graph Neural Networks (GNNs) fall short when applied to TGATs and TGNNs. To bridge this gap, we detail optimization opportunities in TGOpt, which exploit redundancies in temporal node embedding computations. Our results led to inference speedups of up to 4.9× on CPU and 2.9× on GPU, with notable gains of 6.3× on the CPU for the Reddit Posts dataset.

We then introduce TGLite, a lightweight framework to enable the efficient construction of TGNN models on Continuous-Time Dynamic Graphs(CTDGs). To capture message flow dependencies and accommodate temporal attributes, we introduce the *TBlock* abstraction. TBlocks serve as a central representation on which many different operators can be defined, such as temporal neighborhood sampling, scatter/segmented computations, as well as optimizations tailored to CTDGs. On 4 existing TGNN models, TGLITE is able to accelerate runtime performance of training (1.06 –  $3.43\times$ ) and inference (1.09 –  $4.65\times$ ) across different experimental settings when compared against TGL framework.



Email: jianming.tong@gatech.edu PI: Tushar Krishna Avail for Hire Date: 1/2024

**Title:** SUSHI: Model-System-Accelerator Co-Design for Real-Time Latency/Accuracy Navigation in Edge Applications

Abstract: A growing number of applications depend on Machine Learning (ML) functionality and benefits from both higher quality ML predictions and better timeliness (latency) at the same time. A growing body of research in computer architecture, ML, and systems software literature focuses on reaching better latency/accuracy tradeoffs for ML models. Efforts include compression, quantization, pruning, early-exit models, mixed DNN precision, as well as ML inference accelerator designs that minimize latency and energy, while preserving delivered accuracy. All of them, however, yield improvements for a single static point in the latency/accuracy tradeoff space. We make a case for applications that operate in dynamically changing deployment scenarios, where no single static point is optimal. We draw on a recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that uses (activates) different SubNets within this weight-shared construct. This creates an opportunity to exploit the inherent temporal locality with our proposed SubGraph Stationary (SGS) optimization. We take a hardware-software co-design approach with a real implementation of SGS in SushiAccel and the implementation of a software scheduler SushiSched controlling which SubNets to serve and what to cache in real-time. Combined, they are vertically integrated into SUSHI---an inference serving stack. For the stream of queries SUSHI yields up to 25% improvement in latency, 0.98% increase in served accuracy. SUSHI can achieve up to 78.7% off-chip energy savings.

**CoAuthors:** Athinagoras Skiadopoulos, Zhiqiang Xie, Mark Zhao, Saksham Agarwal, Johann Hauswald, Jacob Adelmann, David Ahern, Carlo Contavalli, Michael Goldflam, Raghu Raja, Daniel Walton, Rachit Agarwal, Shrijeet Mukherjee, Christos Kozyrakis





Email: tianshi3@illinois.edu PI: Tarek Abdelzaher Avail for Hire Date: TBD

**Title:** SudokuSens: Enhancing Deep Learning Robustness for IoT Sensing Applications using a Generative Approach Abstract:

**Abstract:** This poster introduces SudokuSens, a generative framework for automated generation of training data in machinelearning-based Internet-of-Things (IoT) applications, such that the generated synthetic data mimic experimental configurations not encountered during actual sensor data collection. The framework improves the robustness of resulting deep learning models, and is intended for IoT applications where data collection is expensive. The work is motivated by the fact that IoT time-series data entangle the signatures of observed objects with the confounding intrinsic properties of the surrounding environment and the dynamic environmental disturbances experienced. To incorporate sufficient diversity into the IoT training data, one therefore needs to consider a combinatorial explosion of training cases that are multiplicative in the number of objects considered and the possible environmental conditions in which such objects may be encountered.

Our framework substantially reduces these multiplicative training needs. To decouple object signatures from environmental conditions, we employ a Conditional Variational Autoencoder (CVAE) that allows us to reduce data collection needs from multiplicative to (nearly) linear, while synthetically generating (data for) the missing conditions. To obtain robustness with respect to dynamic disturbances, a session-aware temporal contrastive learning approach is taken. Integrating the aforementioned two approaches, SudokuSens significantly boosts the robustness of deep learning for IoT applications. We show that SudokuSensis general enough to benefit a variety of downstream neural network architectures and improve the performance of multiple temporal activity classification tasks.

CoAuthors: Tarek Abdelzaher

#### Page 17



Email: billywty@umich.edu PI: Zhengya Zhang Avail for Hire Date: 4/2027

**Title:** A high-bandwidth, energy-efficient chiplet interface for composable acceleration platform

**Abstract:** Integrating heterogeneous chiplets within a single package emerges as a promising and cost-effective strategy for constructing new compute platforms capable of a wide spectrum of workloads. Designing energy-efficient chiplet interfaces that satisfy the high bandwidth demands of various applications is an intricate task. In this study, we present a high-performance interface design by an automated design flow. We target the UCIe standard as our initial target. The interface design features auto-calibration and built-in selftesting to enable seamless adaptation. The automated design flow is streamlined through a series of automated steps from I/O cell synthesis, automatic place and route (APR), to the generation of bump maps and distribution of clock signals. The automations will contribute to the subsequent development of an I/O interface generator to expedite chiplet design cycle.

CoAuthors: Wei Tang, Zhengya Zhang

#### Page 18



Email: yaoy4@illinois.edu PI: Josep Torrellas Avail for Hire Date: Summer 2024

**Title:** Optimize Graph Attention Network Training and Inference on CPUs

Abstract: Traditional Deep Neural networks (DNNs) such as Convolutional Neural Networks are only applicable to Euclidean data, such as a grid of pixels in an image, but lack the power to process non-Euclidean data, such as graphs. Graph Neural Network (GNN) is a type of DNNs that specializes in processing graph structured data. It is becoming popular and has wide application domains such as Recommender Systems, Social Networks, and Knowledge Graphs. However, the performance of running these heavily memory-bound GNNs on CPUs can be limited due to the stress on memory. Typically, a GNN layer, such as in GraphSAGE and Graph Convolutional Network (GCN), is composed of a memory-intensive aggregation phase, where each vertex collects information from its neighbors, and a compute-intensive update phase, where a deep learning operator such as a fully-connected layer processes the collected information. Graph Attention Network (GAT) is a special type of GNNs that incorporates attentions on its edges to learn the importance of the neighbors for each vertex. It gives substantial performance improvement at the cost of increasing computational complexity. However, this also potentially introduces rooms for optimizations using laverfusion techniques, where we can accelerate its execution on CPU by fusing the phase for attention calculation and the phase for aggregation such that the memory accesses can be overlapped with the computation and DRAM traffic can then be significantly reduced. Therefore, in this project, we are interested in exploring different possible ways, including layer fusion, to optimize full-batch GAT training and inference on CPUs.

**CoAuthors:** Zhangxiaowen Gong, Christopher W. Fletcher, Christopher J. Hughes, Josep Torrellas



### **Junkang Zhu** Univ. of Michigan



Email: jkzhu@umich.edu PI: Zhengya Zhang Avail for Hire Date: 10/2024 **Title:** An Evolvable and Composable Chiplet Design for Future Heterogeneous Machine Learning and Big data Processing

Abstract: Future machine learning and big data processing require both intensive computation power and support for extensive heterogeneous computation kernels. New hardware accelerators for such applications rely on evolvability and composability to fulfill these demands. Evolvability allows an accelerator to be reconfigured and reprogrammed to support a wide range of heterogeneous computation kernels. With composability, an accelerator can host multiple heterogeneous computation kernels, and multiple accelerators can communicate and coordinate with each other for extensible and scalable hyperscale computing. We present ECOM, an evolvable and composable chiplet design for future heterogeneous machine learning and big data processing. The chiplet design consists of CPU tiles and evolvable CGRA tiles. The CPU tiles provide programmability to schedule heterogeneous workloads and reconfigure the CGRA tiles. Each evolvable CGRA tile contains an array of programmable processing elements (PEs) connected through a reconfigurable interconnect network. An evolvable CGRA can be reconfigured and reprogrammed to effectively and efficiently perform computation and data movement in a variety of kernels. Different computation kernels can be mapped onto one or multiple evolvable CGRA tiles to compose larger computation tasks. The composable mapping is highly flexible and can be achieved across PEs in a CGRA tile and across CGRA tiles in a chiplet. Furtherly, multiple chiplets equipped with high-bandwidth standard interfaces can communicate and coordinate with each other for hyperscale machine learning and big data processing.

### **Theme 2 Poster Session**

### **RESEARCH SCHOLAR**

Narangerelt Batsoyol UCSD



Email: nbatsoyo@ucsd.edu PI: Steven Swanson Avail for Hire Date: NA

**Title:** DPU-accelerated Near-Storage Data Filtering **Abstract:** In the context of data-intensive applications, transferring large datasets over constrained network links (such as the Internet) often results in performance bottlenecks. To address this issue and improve overall system performance, we introduce a novel framework leveraging Data Processing Units (DPUs) for near-storage data filtering. DPUS are specialized system-on-chip solutions that integrate high-performance CPUs, network interfaces, and programmable acceleration engines. These units facilitate the computational offloading of data filtering tasks, allowing data to be processed closer to where it is stored. Our framework operates transparently, requiring no alterations to the existing storage infrastructure, thereby maintaining flexibility and security isolation. This approach is especially well-suited for applications dealing with rapidly growing data volumes, such as database queries on Parquet files stored in data lakehouses, scientific research analytics, and the preparation of machine learning training sets. By performing data filtering close to storage, we achieve substantial reductions in data transfer volume, thereby optimizing overall system performance.

**POSTER DETAILS** 

| <image/> <image/> <image/> | <b>Title:</b> <i>Telepathic Datacenters: Fast RPCs With Shared CXL</i><br><i>Memory</i><br><b>Abstract:</b> Compute Express Link (CXL) enables memory<br>sharing between devices, presenting opportunities to rethink<br>application-to-application communication within data<br>centers. We propose utilizing CXL to optimize remote<br>procedure calls (RPCs) in microservices. Current RPCs suffer<br>from high overheads stemming from serialization,<br>deserialization, and data copying, which consume up to 27%<br>of CPU cycles. We aim to mitigate this "data center tax" by<br>designing RPC frameworks that leverage CXL shared<br>memory. Our approach relies on CXL shared memory for<br>communication and falls back to RDMA/TCP for datacenter<br>scale requests. This CXL/RDMA hierarchy can be used to<br>create a unified virtual address space in the datacenter,<br>enabling true zero-copy messaging via pointer passing.<br>Realizing the potential of CXL shared memory poses<br>challenges, including isolation, signaling, and orchestration.<br>In this study, we implement communication channels and<br>several sandboxing, isolation, and protection mechanisms,<br>benchmark against microservice workloads, and prototype<br>replacements for network services. Finally, we explore<br>different failure models for shared memory in which the<br>server and client can fail independently.<br><b>CoAuthors:</b> Suyash Mahar, Zifeng Zhang, Steven Swanson |
|----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Amin Mamandipoor<br>Kansas | <b>Title:</b> <i>SmartDIMM: In-Memory Acceleration of Upper Layer I/O Protocols</i><br><b>Abstract:</b> With high-throughput I/O devices deployed in data- center servers, DRAM is on the path of processing the layered, asynchronous I/O software stack. In this setting, the buffer devices of memory modules are an ideal place for inline acceleration of upper-layer I/O protocols (ULPs). In this work, we architect Smart- DIMM, a platform for near-<br>memory acceleration of ULPs. We prototype SmartDIMM using Samsung AxDIMM and implement the end-to-end offload of Transport-Layer Security (TLS) and (de)compression, two key datacenter ULPs that are categorized under datacenter tax operations. We compare the performance of SmartDIMM with CPU, SmartNIC, and PCIe-based accelerator offload implementations. Our results show that TLS offload on SmartDIMM outperforms the CPU implementation as well as SmartNIC, and PCIe-based offload configurations. Compared to a server that executes (de)compression and (en)decryption on CPU, SmartDIMM delivers 21.0%-10.28× higher request per second and 36.3%-88.9% lower memory bandwidth utilization. <b>CoAuthors:</b> John Salihu, Mohammad Alian                                                                                                                                                                                                                                                                 |







From JUMP 2.0 PRISM Center collaborating with ACE on project 005 Email: <u>as8hu@virginia.edu</u> PI: Kevin Skadron/ José F. Martínez **Title:** *Membrane: A PIM-based Architecture to Accelerate Database OLAP Queries* 

Abstract: This work explores application of processing- inmemory (PIM) techniques to Online Analytical Processing (OLAP) database workloads. We explore how to map queries onto subarray-level PIM, which enables parallelism across sub- arrays and banks. We systematically explore mapping strategies and trade-offs between bit-serial/element-parallel and bit- parallel/element-serial designs adapted from the prior Sieve and Fulcrum architectures, respectively. We find that join operations do not map well to subarray-level PIM architectures, and thus we need to use a software prejoin/denormalization method to transform join operations to selection/filter operations. We also learn that certain operations, such as aggregation, remain better served using the CPU. Thus, we propose a cooperative approach for analytic query processing between CPU and PIM. We then explore several dimensions in the design space of PIM architectures, including different ways to perform filter operations, and a new way to return data to the CPU. We conclude that a traditional columnar-database layout with a scalar processing element in the PIM-enabled subarrays (Membrane-H) for the table scan, combined with a rank-level unit (RLU) for gathering the selected elements, is the best configuration. An evaluation of an end-to- end query processing on the popular analytic benchmark SSB at scale factor 100 (a 60GB database) yields a 45.39× geometricmean speedup over a hand-optimized AVX-512 implementation of SSB.

**CoAuthors:** Lingxi Wu, Kevin Gaffney, Martin Prammer, Helena Caminal, Yimin Gao, Ashish Venkat, José F. Martínez Jignesh Patel, Kevin Skadron



Email: mis015@ucsd.edu PI: Steven Swanson Avail for Hire Date: 9/2024

Title: CXL-based SSD-autonomic scheduling system Abstract: NAND flash memory-based solid-state drives (SSDs) have been widely used in data centers due to their better performance compared with hard disk drives (HDDs). However, SSDs do not always provide low access latency, which can be attributed to their background jobs and uneven workload distribution. This results in unstable performance and adversely affects the quality of service (QoS) requirements. To address the issue of SSDs' long tail latency, we propose an SSD-autonomic distributed scheduling system based on the new cache-coherent memory access protocol, Compute Express Link (CXL). The system employs CXL.mem and CXL.cache to provide high-performance state communication, which allows SSDs to handle scheduling work. By offloading scheduling work from host CPUs to processors in SSDs, the computing capacity required for scheduling work naturally scales with the storage capacity. even when storage devices are disaggregated. Additionally, scheduling work does not interfere with the main workloads processed on host CPUs. Since SSDs' processors manage the scheduling work, scheduling decisions can be made instantly based on SSDs' internal states, which are not visible to host CPUs for most commodity market SSDs. CXL also enables low-overhead request redirection. By carefully designing the backup method and choosing concurrent data structures, while the original SSD is busy processing normal requests or background jobs, requests can be redirected to other SSDs to mitigate the effects on latency.





Email: ct652@cornell.edu PI: José F. Martínez Avail for Hire Date: 2027 **Title:** Increasing the Efficiency of Associative Processors via CMOS-Compatible Hybridization

Abstract: Associative processors (AP) have recently reemerged as an appealing architecture that provides vast amounts of data-level parallelism. Internally, APs carry out arithmetic and logic operations on very long vectors (tens of thousands of elements or more) via sequences of bulk search and update operations, without the need for ALU circuitry. Emerging memory technologies (EMTs) could further enhance APs through gains in density and energy efficiency. However, EMTs often suffer from slower write speeds, higher write energy costs or lower endurance. High write latencies and wearout levels, in particular, can be lethal to APs' performance and endurance, as most arithmetic and logic operations involve multiple bulk updates. In this work, for the first time, we propose a hybrid CMOS-EMT AP solution that reaps the energy and area advantages of EMTs in addition to the performance and endurance benefits of CMOS. A small fraction of the total AP vector register storage is implemented in CMOS, which the microarchitecture engages selectively to take advantage of CMOS' faster and more resilient writes. At the same time, a FeFET-based organization serves as the primary storage of the vector registers, resulting in significant area-delay-power (ADP) improvement over a full-CMOS implementation. All of this is transparent to the programmer, as it requires no changes to the ISA or the program. We evaluate our proposed mechanism using a sophisticated cycle-approximate execution-driven simulation infrastructure. Results show that our hybrid AP design is hardly 1\% slower than a full CMOS implementation while at the same time achieving a 2.29x ADP\$^{-1}\$ improvement over a pure FeFET design (1.11x ADP\$^{-1}\$ improvement over pure CMOS) and essentially eliminating the FeFET design's endurance disadvantages. **CoAuthors:** Socrates Wong, Dayane Reis, Xiaobo Sharon Hu, Michael Niemier, José Martínez



Email: johnson.chinedu@ku.edu PI: Mohammad Alian Avail for Hire Date: 5/2024

**Title:** Userspace Networking in gem5 Abstract: Full-system simulation of computer systems is critical to capture the complex interplay between various hard- ware and software components in future systems. Modeling the network subsystem is indispensable to the fidelity of the full- system simulation due to the increasing importance of scale- out systems. The network software stack has undergone major changes over the last decade, and kernel-bypass networking stacks and data-plane networks are rapidly replacing the conventional kernel network stack. Nevertheless, the current state-of- the-art architectural simulator, gem5, still uses kernel networking which precludes realistic network application scenarios. In this work, we first show the limitation of gem5's current network stack in achieving a high network bandwidth. Then we enable kernel bypass networking stack on gem5. We extend gem5's NIC hardware model and device driver to enable the support for userspace device drivers to run the DPDK framework. We also implement a network load generator hardware model in gem5 to generate various traffic patterns and perform perpacket timestamp and latency measurements without introducing packet loss. We develop a suite of five networking micro-benchmarks for stress testing the host network stack. These applications can run on both gem5 and a real system with a fast turnaround for gem5. Our experimental results show that enabling userspace networking improves gem5's network bandwidth by 5.4× compared with the current Linux software stack. We characterize the performance differences when running the DPDK network stack on a real system and gem5 and evaluate the sensitivity of DPDK performance to various system and microarchitecture parameters. This work is the first step in refactoring the networking subsystem in gem5.

**CoAuthors:** Siddharth Agarwal, Derrick Quinn, Nikita Lazarev, Mohammad Alian



Email: ky362@cornell.edu PI: José Martínez Avail for Hire Date: 5/2024

**Title:** VersaTile: Flexible Tiled Architectures via Processing-Using-Memory Cores

Abstract: As modern applications demand more data, processing-in-memory (PIM) and processing-using-memory (PUM) architectures have emerged to address the challenges of data movement and parallelism. In this paper, we propose VersaTile, a heterogeneous, fully CMOS-based tiled architecture that combines conventional out-of-order (OoO) superscalar CPUs and processing-using-memory (PUM) cores, both leveraging the RISC-V ISA and its standard vector extensions for vector-SIMD execution. VersaTile fosters collaboration between multiple low-latency CPUs and highthroughput PUM cores by sharing the same software stack and adopting a CPU programming and compilation frontend. Moreover, we introduce PUM Fusion, a mechanism enabling the aggregation of multiple PUM cores' memory arrays into a single vector super-unit with modest hardware support and no programming effort, to pursue optimal performance across a wide range of applications. We provide a detailed case study including a scalable floorplan example, as well as a comprehensive evaluation over various design points. Our experiments show that when only using PUM cores, VersaTile can achieve, on average across the Phoenix benchmark suite and 3D convolution, a 5.7× speedup with respect to areaequivalent OoO CPU cores with SIMD ALUs (up to 23×), and 4.6× with respect to an equivalent-sized monolithic PUM baseline (up to 29×). For the apps with both DLP (vector) and ILP (scalar) regions, VersaTile can use PUM and OoO cores collaboratively to achieve better performance than solely using either one of them, up to  $4.4 \times$ .

# Theme 3 & Application Benchmarks Poster Session

### **RESEARCH SCHOLAR**

Charles Block Illinois



Email: coblock2@illinois.edu PI: Josep Torrellas Avail for Hire Date: 2027

### **POSTER DETAILS**

Title: Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM Abstract: Sparse matrix times dense matrix multiplication (SpMM) is commonly used in applications ranging from scientific processing to graph neural networks. Often, when this operation is performed in a distributed system, the communication costs dominate due to poor data reuse. Prior work has investigated algorithms that execute data transfers in a sparsity-unaware manner or in a sparsity-aware manner. In the former category, techniques such as collectives or shifting algorithms are employed to transfer data in a coarsegrained manner without considering the input sparsity pattern. In the latter category, the locations of input sparse matrix nonzeros determine asynchronous, fine-grained accesses. Although both can be effective, each of these approaches contains pitfalls. On the one hand, sparsityunaware transfers can lead to unnecessary data transfers. On the other hand, sparsity-aware transfers typically carry a high software overhead and require more network round-trips. We claim that a combination of the two communication flavors can produce a more efficient distributed SpMM kernel. Towards this goal, we utilize MPI collectives for larger, contiguous data transfers, and finer-grained asynchronous one-sided communications for residual data. We propose and implement an algorithm, Two-Face, which partitions the input into a collective portion and a one-sided portion. We describe how this algorithm can be calibrated, and detail its implementation using MPI and OpenMP. We evaluate Two-Face against several baselines using large real-world sparse matrices and show that Two-Face displays an average speedup of 1.99x over the next-best baseline. Additionally, we compare Two-Face's scaling behavior to our best performing baseline and show that Two-Face scales well with the number of nodes in the system.

**CoAuthors:** Gerasimos Gerogiannis, Charith Mendis, Ariful Azad, Josep Torrellas





Email: ajaybr@mit.edu PI: Manya Ghobadi Avail for Hire Date: 8/2025

**Title:** *LAKEPLACID: Compiling Datacenter Applications to the Microsecond Latency Regime* 

Abstract: We present LAKEPLACID, a compiler-based framework that enables data center applications with legacy TCP/UDP sockets to achieve us-scale latency with minimal programmer effort. LAKEPLACID leverages an important but perhaps overlooked observation: depending on the workload, only a small fraction of the code impacts the overall performance of some networking applications. We refer to this small fraction of the application as the EliteCode, and use a custom-designed compiler to automatically identify parts of the application logic that belong to the EliteCode. LAKEPLACID automatically transforms the EliteCode to run inside the kernel using an optimized TCP/UDP network stack. To enable data center operators to fine-tune EliteCode's behavior, LAKEPLACID's approach is parameterized, reconfigurable, and automatic. We implement three µs-scale applications using LAKEPLACID: Memcached, NGINX, and an echo server. Our evaluations demonstrate that LAKEPLACID achieves 2.55µs median round-trip latency, which is on par with the performance of eRPC and Demikernel. LAKEPLACID realizes this low latency while requiring the developer to change only  $\approx 0.5\%$  of the code, leaving the rest of code optimizations to its compilers CoAuthors: Manya Ghobadi and Saman Amarasinghe





Email: girfan@mit.edu PI: Adam Belay Avail for Hire Date: NA

#### AND

Zain Ruan MIT



Email: zainruan@mit.edu PI: Adam Belay Avail for Hire Date: NA

Title: Towards Self-Balancing Cloud Storage Abstract: In today's cloud, the best available option for high performance storage is to dedicate a locally attached flash device to a specific workload. This is necessary because flash performs poorly when it is shared across tenants. Unfortunately, this leaves the flash device mostly idle because it is unlikely enough demand will be generated to saturate it. We propose a new self-balancing approach to cloud storage that can efficiently share flash resources among many tenants over the network. Our goal is to drive up utilization while delivering performance that is equivalent or better to locally attached flash. However, to realize this vision, we must overcome two challenges. First, sharing flash often leads to hotspots, which can cause long delays in accessing disk blocks. Second, when mixing reads and writes on the same device, flash suffers from a collapse in throughput and higher tail latency. To resolve these problems, we propose finegrained request steering and adaptive block replication/migration to prevent hotspots and segregate reads and writes onto specific flash devices. Our preliminary analysis suggests our solution has the potential to improve utilization by 300% without impacting performance. CoAuthors: Zhenyuan Ruan, Adam Belay





**Title:** Information-Theoretic Variational Graph Auto-Encoders for Unsupervised Belief Representation Learning and Ideology Detection

Abstract: This project proposes a novel unsupervised algorithm for belief representation learning in social networks that jointly embeds users and content items into an underlying belief space, facilitating a number of downstream tasks, such as stance detection, stance prediction, and ideology mapping. We propose the Information-Theoretic Variational Graph Auto-Encoder (InfoVGAE) that learns to project both users and content items (e.g., posts that represent user views) into an appropriate disentangled latent space. To better disentangle latent variables in that space, we develop a total correlation regularization module, a Proportional-Integral (PI) control module, and adopt rectified Gaussian distribution to ensure the orthogonality. The latent representation of users and content can then be used to quantify their ideological leaning and predict their stances on issues. We evaluate the performance of the proposed InfoVGAE on three real-world datasets, of which two are collected from Twitter and one from the U.S. Congress voting database. The evaluation results show that our model outperforms state-of-the-art unsupervised models by reducing 10.5% user clustering errors and achieving 12.1% higher F1 scores for ideological separation of content items. In addition, we discuss the scalability bottleneck of the proposed InfoVGAE algorithm and potential improvements to speed up the proposed belief representation learning algorithm.

**CoAuthors:** Huajie Shao, Dachun Sun, Xinyi Liu, Ruijie Wang, Yuchen Yan, Jinyang Li, Shengzhong Liu, Hanghang Tong, and Tarek Abdelzaher



Title: eZNS: An Elastic Zoned Namespace for Commodity

Abstract: Emerging Zoned Namespace (ZNS) SSDs, providing the coarse-grained zone abstraction, hold the potential to significantly enhance the cost-efficiency of future storage infrastructure and mitigate performance unpredictability. However, existing ZNS SSDs have a static zoned interface, making them in-adaptable to workload runtime behavior, unscalable to underlying hardware capabilities, and interfering with co-located zones. Applications either under-provision the zone resources vielding unsatisfied throughput, create over-provisioned zones and incur costs, or experience unexpected I/O latencies. We propose eZNS, an elastic-zoned namespace interface that exposes an adaptive zone with predictable characteristics. eZNS comprises two major components: a zone arbiter that manages zone allocation and active resources on the control plane, a hierarchical I/O scheduler with read congestion control and write admission control on the data plane. Together, eZNS enables the transparent use of a ZNS SSD and closes the gap between application requirements and zone interface properties. Our evaluations over RocksDB demonstrate that eZNS outperforms a static zoned interface by 17.7% and 80.3% in throughput and tail CoAuthors: Chenxingyu Zhao, Ming Liu, and Arvind

| <image/> <image/> <image/>              | <b>Title:</b> <i>NetEye: Extending the capabilities of a</i><br><i>Programmable Switch using Time-Shifted Streams</i><br><b>Abstract:</b> Managing and securing networks requires<br>collecting and analyzing network traffic in real time. To this<br>end, network operators often rely on telemetry systems and<br>machine learning models to monitor the state of their<br>network. These systems rely on programmable data plane<br>targets to scale query execution. They offer high packet-<br>processing speeds, but their limited computing and memory<br>resources necessitate employing approximation techniques<br>(e.g., sampling, sketches, and iterative refinement) that affect<br>accuracy. In this paper, we explore a different way to increase<br>the computational capacity of a programmable switch to<br>increase the accuracy of a given system. We augment the<br>recirculation path of a packet by leveraging the additional<br>computational and storage capabilities of a modern near-<br>switch. Packet recirculation helps us in resolving queries and<br>classifying packets with a minimal hit on accuracy while<br>incurring an acceptable delay. We introduce a buffer-based<br>packet-header collection and storage architecture, named<br>NetEye, that allows us to store packets worth of data streams<br>in an efficient manner. On the near-switch device, we employ<br>compression mechanisms, which reduces storage overhead by<br>92% and network bandwidth by 6.4%, allowing for dynamic<br>resource usage on the switch. Consequently, our system<br>supports more than fifteen multiple simultaneous queries<br>without compromising accuracy to scale their execution.<br><b>CoAuthors:</b> Enkeleda Bardhi |
|-----------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <complex-block><image/></complex-block> | <b>Title:</b> <i>Efficient Recovery from Faults in Leaderless</i><br><i>Distributed Systems</i><br><b>Abstract:</b> In the high-performance realm of modern<br>distributed systems, resilience against failures such as crashes<br>and network partitions poses a significant challenge. The<br>solution lies partly in data distribution across nodes and their<br>durable mediums, particularly with the rising prevalence of<br>low-latency persistent memories. The complexity increases in<br>leaderless distributed systems that permit client requests to<br>be served by multiple nodes. This work introduces a novel<br>system, IASO, designed for efficient recovery in leaderless<br>distributed systems equipped with persistent memory. IASO<br>allows systems to harness the high performance typically<br>associated with leaderless configurations, while also<br>providing resilience against failures under Linearizable<br>consistency and various persistency models. and various<br>persistency models.<br><b>Coauthors:</b> Burak Ocalan, Fabien Chaix, Ramnatthan<br>Alagappan, Josep Torrellas                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |

| Sudarsanan Rajasekaran               | <b>Title:</b> <i>CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters</i>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|--------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| MIT                                  | <b>Abstract:</b> We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters. CASSINI introduces a novel geometric abstraction to consider the communication pattern of different jobs while placing them on network links. To do so, CASSINI uses an affinity graph that finds a series of time-shift values to adjust the communication patterns of jobs sharing the same network link are interleaved with each other. Experiments with 13 common ML models on a 24-server testbed demonstrate that compared to the state-of-the-art ML schedulers, CASSINI improves the average and tail completion time of jobs by up to 1.6x and 2.5x, respectively. Moreover, we show that CASSINI reduces the number of ECN marked packets in the cluster by up to 33x. <b>Coauthors:</b> Manya Ghobadi and Aditya Akella                                                                                                                                                                                                                                                                                                                                                   |
| Athinagoras Skiadopoulos<br>Stanford | <b>Title:</b> <i>High-throughput and Flexible Host Networking via</i><br><i>Control and Data Path Physical Separation</i><br><b>Abstract:</b> End-host network stacks can offer high<br>performance or protocol flexibility, but not their combination.<br>This limitation can largely be attributed to the tight<br>integration of the network data and control path in current<br>solutions. We argue that physical separation of the data and<br>control path enables a performant and flexible host network<br>stack. We present a co-designed hardware NIC and software<br>stack that can execute arbitrary transport protocols anywhere<br>(e.g., in kernel in a CPU, in user space in a CPU, or even in<br>specialized packet processing accelerators), while asserting<br>control over a zero-copy data path directly between the NIC<br>and the memory of arbitrary devices (e.g., CPUs, GPUs, or<br>other storage/compute components).<br><b>CoAuthors:</b> Zhiqiang Xie, Mark Zhao, Saksham Agarwal,<br>Johann Hauswald, Jacob Adelmann, David Ahern, Carlo<br>Contavalli, Michael Goldflam, Raghu Raja, Daniel Walton,<br>Rachit Agarwal, Shrijeet Mukherjee, Christos Kozyrakis |



Email: ruijiew2@illinois.edu PI: Tarek Abdelzaher Avail for Hire Date: 5/2024 **Title:** Online Inference Acceleration by Learning to Sample and Refresh on Streaming Temporal Graphs

Abstract: This paper studies online link prediction on streaming temporal graphs, aiming to efficiently update deployed models on freshly acquired temporal data to ensure sustained long-term performance. State-of-the-art methods fall short in retaining and adapting informative knowledge distilled from existing data onto freshly gathered data for online updates, as they either cater exclusively to offline scenarios where all training data is available upfront or lack sufficient modeling of temporal information and temporal graph structures during online updates. We propose a temporal meta-training framework, namely OnlineSAFE, that extracts enduringly valuable knowledge across data collection periods during the offline phase and efficiently fine-tunes the model to encode newly emerging patterns during the online phase. To this end, we design a bi-level optimization to metalearn the model parameters that ensure sustained long-term performance and adaptability to new data, where outer/inner loops are nested to optimize the global model parameters and the fine-tuning procedure, respectively. Considering the potentially distinct distribution exhibited in the new data, we analyze and derive an empirical bound based on the PAC-Bayes theory to enhance the stability and generalizability of the online updating process. Furthermore, we investigate a simple but effective sample reduction heuristic that accelerates online updates by bypassing edge samples that lack additional information. Extensive experiments on four real-world streaming graphs demonstrate the effectiveness and efficiency of OnlineSAFE, compared with 17 state-of-the-art baselines. **CoAuthors:** Tarek Abdelzaher, Charith Mendis



Email: ewarraic@purdue.edu PI: Muhammad Shahbaz Avail for Hire Date: 6/2025

**Title:** Ultima: Robust and Tail-Optimal All-Reduce for Distributed Deep Learning

Abstract: Distributed Deep Learning (DDL) is the de-facto standard for training large-scale models (comprising billions of parameters) that form the backbone of numerous mainstream enterprise applications. Central to DDL's efficiency is the synchronization process, where model gradients are exchanged among workers of the distributed cluster. However, this synchronization is often hampered by stragglers ---- workers that lag behind --- leading to systemwide delays. To overcome this, we introduce Ultima, a DDL framework that capitalizes on deep-learning models' inherent resiliency against some degree of gradient loss. Ultima introduces a novel approach by embracing a time-bounded, unreliable transport mechanism for DDL communication as a way to address the stragglers. Ultima pairs this transport with a novel Transpose Allreduce collective algorithm which curbs the propagation of gradient loss when using the unreliable time-bounded transport. Additional design choices in Ultima further disperse the occurred losses, contributing to the system's overall resilience. Our evaluations show that Ultima is able to achieve speed-ups of up to 60% in straggler-prone environments over state-of-the-art DDL frameworks and preserves comparable performance with these frameworks in optimal lossless environments.

**CoAuthors:** Omer Shabtai, Shay Vargaftik, Lalith Suresh, Matty Kadosh, Muhammad Shahbaz



Email: william.won@gatech.edu PI: Tushar Krishna Avail for Hire Date: 1/2025

**Title:** ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale Abstract: As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emerging distributed training systems. This results in a complex SW/HW co-design stack of distributed training, necessitating a modeling/simulation infrastructure for design-space exploration. In this paper, we extend the opensource ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii) we implement a parameterizable multidimensional heterogeneous topology generation infrastructure with analytical performance estimates enabling simulating target systems at scale, and (iii) we enhance the memory system modeling to support accurate modeling of innetwork collective communication and disaggregated memory systems. With such capabilities, we run comprehensive case studies targeting emerging distributed models and platforms. This infrastructure lets system designers swiftly traverse the complex co-design stack and give meaningful insights when designing and deploying distributed training platforms at scale. **CoAuthors:** Taekyung Heo, Saeed Rashidi, Srinivas

Sridharan, Sudarshan Srinivasan, Tushar Krishna



Avail for Hire Date: 9/2025

**Title:** *Efficient Offloading Channel for DPU* **Abstract:** In this poster, we first identify four cross-PCIe Host-DPU communication primitives by analyzing the architectural peculiarities of a DPU SoC and systematically characterize their capabilities and limitations. We then design and implement a Host-DPU offloading channel by carefully synthesizing these underlying primitives and tailoring them to our requirements. Essentially, the channel operates as an adapter interface, a collection of elastic communication abstractions and an execution framework.



**Title:** Gigaflow - An Accelerator for the Slow Path at the End Host

**Abstract:** Packet-processing data planes at the end-hosts have been enhanced in performance over the last decade to the point that, nowadays, they are increasingly implemented in hardware (e.g., in SmartNICs and programmable switches). However, little attention is given to the slow path residing between the data plane and the control plane, as it is not typically considered performance-critical. Recent research indicates that due to the growth in physical network bandwidth and topological complexity of modern networks, the slow path is set to become a new key bottleneck in Software-Defined Networks (SDN). We present the design and implementation of a new Domain Specific Accelerator (DSA) for the slow path at the end-host that sits between the hardware-offloaded data plane and the logically-centralized control plane. Our accelerator aims to capture most of the CPU-bound slow path traffic on virtual switches (flow cache misses from user traffic), thus reducing the load on end-host CPUs. We implement our slow path accelerator as a new caching layer in the Open vSwitch and implement its hardware-offload using NetFPGA on Xilinx Alveo data center accelerators.

**CoAuthors:** Venkat Kunaparaju, Ben Pfaff, Gianni Antichi, Muhammad Shahbaz

## **Theme 4 Poster Session**

### **RESEARCH SCHOLAR POSTER DETAILS Dingyuan Cao Title:** ElaCache: Fine-Grain Dynamic Partitioning of Illinois Coherence Directories in Multiprocessors **Abstract:** Cache side channel attacks pose severe security challenges in the multi-tenant cloud environment. To mitigate this vulnerability, several cache partitioning schemes have been proposed to provide isolation between mutually untrusted domains. However, none of the existing cache partition schemes incorporates the coherence directory into its partition, thus leaving this attack surface unprotected. This can lead to side-channel leakage across security domains. In this work, we propose ElaCache, which is a novel partitioning scheme that partitions both extended directory(ED) and traditional directory(TD) in order to provide a strong isolation in the cache hierarchy. To provide such isolation without impacting the application's performance, we utilize an indirection structure to provide fine-grained partitioning, while leveraging incoming memory traffic to profile applications' resource needs and adjust allocation sizes accordingly. Experiments show that ElaCache can achieve good performance while providing strong isolation compared to an unprotected cache hierarchy. Email: dc29@illinois.edu PI: Josep Torrellas Avail for Hire Date: 5/2026

Г

| Saranyu Chattopadhyay                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Title: Pre-silicon G-QED Verification                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Stanford<br>Stanford<br>Field<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanford<br>Stanfo | <b>Abstract:</b> G-QED Generalized Quick Error Detection is<br>a highly thorough pre-silicon verification technique that<br>significantly boosts design productivity. G-QED can be<br>applied to any digital design that satisfies the following<br>conditions: (1) actions, architectural states and idling, similar<br>to instructions, software-visible states and idling in<br>processors, can be defined; and (2) the content of each<br>architectural state element can be read by an action to<br>produce corresponding design outputs. G-QED is provably<br>sound and complete, i.e., it detects all logic bugs without any<br>false fails, within the capabilities of existing Bounded Model<br>Checking (BMC) tools. Results on a wide range of processor<br>and hardware accelerator designs demonstrate the<br>effectiveness and practicality of G-QED. For an industrial<br>case study using production-ready AI engines, G-QED<br>detected 9 new critical bugs (in addition to all bugs detected<br>by the industrial verification flow) with a drastic productivity<br>boost 3 person weeks of verification effort using G-QED vs.<br>1 person-year using the industrial verification flow.<br><b>CoAuthors:</b> Keerthikumara Devarajegowda, Bihan Zhao,<br>Florian Lonsing, Brandon A. D'Agostino, Ioanna Vavelidou,<br>Vijay D. Bhatt, Sebastian Prebeck, Wolfgang Ecker, Caroline<br>Trippel, Clark Barrett, Subhasish Mitra |
| <section-header>Hoein Ghaniyoun<br/>OhioOhio</section-header>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | <b>Title:</b> <i>TEESec: Pre-Silicon Vulnerability Discovery for Trusted Execution Environments</i> <b>Abstract:</b> Trusted execution environments (TEE) are CPU hardware extensions that provide security guarantees for applications running on untrusted operating systems. The security of TEEs is threatened by a variety of microarchitectural vulnerabilities, which have led to a large number of demonstrated attacks. While various solutions for verifying the correctness and security of TEE designs have been proposed, they generally do not extend to jointly verifying the security of the underlying microarchitecture. We present TEESec, the first pre-silicon framework for discovering microarchitectural vulnerabilities in the context of trusted execution environments. TEESec is designed to jointly and systematically test the TEE and underlying microarchitecture against data and metadata leakage across isolation boundaries. We implement TEESec in the Chipyard framework and evaluate it on two open-source RISC-V out-of-order processors running the Keystone TEE. Using TEESec we uncover 10 distinct vulnerabilities in these processors that violate TEE security principles and could lead to leakage of enclave secrets. <b>CoAuthors:</b> Kristin Barber, Yuan Xiao, Yinqian Zhang, Radu Teodorescu                                                                                               |



# Mohammad Rahmani Fadiheh<br/>StanfordTit<br/>Ver



Email: fadiheh@stanford.edu PI: Subhasish Mitra Avail for Hire Date: 2024

Title: A Scalable Solution for End to End Formal Verification of Millions Gate Designs **Abstract:** Scalability is the biggest hurdle in functional verification with more bug escapes as design size increases. This happens after spending a major chunk of the design project time in just verifying the design. We present a novel provably complete and scalable verification approach that can handle very large designs (over a million gates) that would otherwise not fit into a commercial formal tool. Instead of creating separate hand-crafted abstractions for each verified sub-component in a large design, Our approach uses a generic abstract model to reduce the overall complexity thereby saving time and effort. The proposed approach 1. does not need an understanding of the gory implementation details of the sub-component to be abstracted thereby drastically reducing the verification time and effort. 2. guarantees complete scalable verification for over million-gate designs. We believe that with some design discipline, false counterexamples can also be avoided in this approach. Preliminary results have shown that our approach can handle academic and industrial designs that are too large for loading into any off-the-shelf formal verification tool, including NVIDIA's 16M gate AI accelerator. Our novel abstraction technique reduces the design size (number of gates) by 10-20X. Our technique enabled the detection of new as well as previously detected bugs in these designs. **CoAuthors:** Saranyu Chattopadhyay, Caroline Trippel, Clark Barrett, Subhasish Mitra





Email: mu94@cornell.edu PI: Edward Suh Avail for Hire Date: 7/2025 **Title:** Efficient Memory Protection for Secure Machine Learning

Abstract: Machine learning, especially deep learning, is a data-intensive application that can potentially consume private or sensitive data, which demands a strong security protection. A promising approach to provide strong confidentiality and integrity guarantees even under untrusted system software and potential physical tampering is to rely on trusted hardware to create a trusted execution environment (TEE). One important facility provided by TEEs is to protect sensitive data values and access patterns in the untrusted offchip memory (DRAM). However, current techniques to protect memory incur a high overhead. In this poster, we describe our proposed techniques to lower the off-chip memory protection overhead. Firstly, to protect confidentiality and integrity of data in DRAM, TEEs use memory encryption and integrity verification, which incurs high performance overhead as it requires additional memory accesses for protection metadata such as version numbers (VNs) and MACs. To mitigate this, we exploit the simple access patterns of machine learning algorithms to generate the VNs on-chip, and optionally have MACs to protect chunks of larger granularity. As such, we propose MGX and SoftVN memory protection schemes for accelerator and processor TEEs respectively and show significant overhead reduction across a variety of deep learning benchmarks. Secondly, we study the confidentiality leakage via the memory access pattern side-channel in deep learning recommender systems, specifically via the indices of the categorical embedding tables. Typically, TEEs employ ORAM schemes to obfuscate the memory access pattern, which incurs a huge overhead especially for large table sizes. In this work, we propose to use an alternative technique to embedding tables. Deep Hash Embedding (DHE), to eliminate the input-dependent memory access pattern, as this technique has a deterministic access pattern with similar accuracy. We show some preliminary results on the overhead reduction due to using a combination of DHE and ORAM schemes for different table sizes.

**CoAuthors:** Weizhe Hua, Akhilesh Parag Marathe, Wenjie Xiong, Zhiru Zhang, and G. Edward Suh





Email: wei.1276@osu.edu PI: Radu Teodorescu Avail for Hire Date: 2027

**Zirui Neil Zhao** Illinois



Email: ziruiz6@illinois.edu PI: Josep Torrellas Avail for Hire Date: 4/2024

**Title:** Adversarial Attacks on Machine Learning-Based Hardware Prefetchers

Abstract: Machine Learning-based data prefetchers have emerged as a promising solution for capturing irregular memory access patterns more effectively than traditional rule or table-based prefetchers. However, ML models are known to be vulnerable to so-called adversarial attacks, in which inputs are manipulated to induce models to produce outputs that are beneficial to an adversary. Moreover, in order to accommodate irregular memory prefetch requests, most machine learning-based prefetchers have implemented crosspage prediction. This enables attackers to construct an adversarial memory access sequences that deceives a victim prefetcher model into making a prefetch request for a page that should be inaccessible to the attacker. We present the first comprehensive study of adversarial attacks on ML prefetchers. Evaluation on five different state-of-the-art MLbased prefetchers shows that adversarial attacks can be constructed with high success rates.

CoAuthors: Moein Ghaniyoun, Radu Teodorescu

**Title:** Everywhere All at Once: Co-Location Attacks on Public Cloud FaaS

Abstract: Microarchitectural side-channel attacks exploit shared hardware resources and pose severe threats to modern cloud environments. Achieving physical host co-location with a victim, a crucial step in these attacks, is challenging due to the widespread adoption of the virtual private cloud (VPC) and the ever-growing size of data centers. Moreover, cloud computing is increasingly moving towards Function-as-a-Service (FaaS) environments, characterized by highlydynamic function instance placements and limited control for attackers. In this paper, we present the first comprehensive study of risks and techniques for co-location attacks in public FaaS environments. We develop two novel physical host fingerprinting techniques and propose a new, inexpensive methodology for large-scale instance co-location verification. Utilizing these techniques, we conduct an extensive study on Google Cloud Run, uncovering exploitable instance placement behaviors. Leveraging our findings, we devise a highly effective strategy for function instance launching that achieves 100\% co-location probability and covers 59\%--100\% of victim instances in three major Cloud Run data centers.

**CoAuthors:** Adam Morrison, Christopher W. Fletcher, Josep Torrellas