April 3, 2023

Semiconductor Research Corporation (SRC), Durham, NC 27703

Semiconductor Research Corp. (SRC) Global Research Collaboration (GRC) is soliciting white papers in the Artificial Intelligence Hardware (AI Hardware) research program. The principal goal of this program is to create new highly efficient AI platforms to enable neuro-inspired, cognitive, and learning abilities which will be required to address the vast range of future data types and workloads as intelligence is enabled from edge devices to the cloud.

The AIHW research needs are described in five major categories:

- 1. Architectures for Energy Efficient Al Acceleration
- 2. Modeling, Analysis, and Simulation/Emulation of AI Hardware for Early System Exploration
- 3. HW/SW Co-design of AI Compute Systems
- 4. Secure and Robust AI Hardware
- 5. Interplay of AI and System Architecture/Microarchitecture Design

Each of these major categories is broken down into several sub-categories which describe the need in more detail. Even so, these are written to be broad in nature to not restrict the investigator's approach. There is no priority order for either the major or minor needs that follow. In each category, there may be applications from large systems to small (data center and the edge/end node) and investigators should consider this in their submissions. Members are looking for significant innovations, for example, 100X improvement in energy-performance efficiency or other key metrics for systems for emergent Al applications.

The use of appropriate benchmarks and metrics to assess how far the effort advances the state-of-the-art will be a key part of the evaluation process. It is important that performance and efficiency metrics such as "TOPS/W" (tera ops/Watt) and "% utilization" of hardware be qualified as "peak," "sustained," or "average". The primary metrics should include a performance metric, a power efficiency metric, and a mapping efficiency metric. For example, the end-to-end wall-clock execution time for a set of benchmarks, the energy consumed by the hardware on a benchmark set, and the utilization of the hardware resources during the execution. Breakdown of any metrics for training vs. inference helps identify the suitability of the innovation for deployment in different settings such as cloud, edge, mobile, etc. Appropriate metrics should be used to establish the impact of the advances in each setting. For instance, total throughput and throughput per watt might be metrics for data center applications while optimal energy usage might be more appropriate for the edge/end node. Accuracy of the results and/or reporting the metrics at iso-accuracy becomes an important factor for understanding the benefits of approximate computing techniques such as reduced precision FP.

In addition to what is mentioned above, some metrics for consideration include:

- Inference accuracy (%)
- Inference robustness to antagonistic inputs
- Inference/unit of energy (per uJ/mJ/J/kJ)
- Training/unit of energy (model training/J)
- Throughput: inferences per unit time, training per unit time
- HW cost metric: MACs (or equivalent) required per unit time
- Memory metrics: local/global memory requirements (access time, latency, bandwidth, average per unit time and total energy per inference)
- Statistical performance guarantees
- Robustness, Fairness, and Explainability metrics
- Scalability across edge to cloud platforms
- Adaptability to different applications: Custom v/s generic Al acceleration

The needs in the AIHW space cover a broad range of applications, including high performance processors for data centers, automotive, industrial, mobile and edge node computing and communication, and healthcare. Investigators are encouraged to link the results of their work with a potential application to help show the relevance of the proposed work.

April 3, 2023

Semiconductor Research Corporation (SRC), Durham, NC 27703

SRC has released the Decadal Plan for Semiconductors (<a href="www.src.org/about/decadal-plan/">www.src.org/about/decadal-plan/</a>) which describes five "Seismic Shifts" facing the electronics industry in the coming decade. Research should address issues arising from one of them:

- Smart Sensing The Analog Data Deluge
- Memory & Storage The Growth of Memory and Storage Demands
- Communication Communication Capacity vs. Data Generation
- Security ICT Security Challenges
- Energy Efficient Compute Energy vs. Global Energy Production

SRC and a consortium of industry experts have refined these seismic shifts into the Microelectronic and Advanced Packaging Technologies (MAPT) Roadmap. MAPT is a critical multidisciplinary field with the potential to transform the design and manufacture of future microchips. These advances build upon breakthroughs in advanced packaging, 3D monolithic and 2.5D/3D heterogeneous integration, electronic design automation, nanoscale manufacturing, and energy-efficient computing. The interim report is now available (https://srcmapt.org/) and will be used as a guide for future research activities.

The Global Research Collaboration (GRC) division of SRC focuses on research in a time frame five or more years ahead of technology release. Research on advanced tools and techniques such as modeling, simulation, and characterization can be of value with implementation timelines as low as one to two years post project completion. This time frame represents the "sweet spot" for pre-competitive, collaborative research, after which the industry focuses on proprietary development for technology differentiation. Successful research proposals will need to match this timing.

Moving forward, the SRC is also embarking on an effort to broaden participation in its funded research programs. This aggressive agenda will help us drive meaningful change in advanced information and communication technologies that seem impossible today. In the programs we lead, we must increase the participation of women and under-represented minorities as well as strike a balance between U.S. citizens and those from other nations, creating an inclusive atmosphere that unlocks the talents inherent in all of us. Please visit <a href="https://www.src.org/about/broadening-participation/">https://www.src.org/about/broadening-participation/</a> for more information about the 2030 Broadening Pledge.

Investigators who are funded will be expected to publish at top-tier conferences, including but not limited to ISSCC, VLSI, HPC, ISCA, MICRO, HPCA, ESSCIRC, and ESWEEK (CASES, CODESISSS, & EMSOFT).

### **CONTRIBUTORS**

AMD Ganesh Dasika

GLOBALFOUNDRIES Ted Letavic, Nick Viggiani, Richard Bolster

IBM Krishnan Kailas, Matt Ziegler

Intel Omesh Tickoo, Michael Kishinevsky, Greg Chen, Rosario Cammarota, Jeff Parkhurst, Subarna Tripathi

MediaTek Jenwei Liang

NXP Ben Eckermann, Adam Fuks Qualcomm Ramesh Chauhan, Francois Atallah

Samsung Joon Ho Song, Dongkyun Kim, Jonghun Lee

Siemens EDA Russell Klein, Neil Hand

Texas Instruments Mahesh Mehendale, Nagendra Gulur

SRC John Oakley

April 3, 2023

Semiconductor Research Corporation (SRC), Durham, NC 27703

### 1 Architectures for Energy Efficient Al Acceleration

Accelerating future AI systems may benefit from architectures, circuits, and/or devices beyond today's conventional computing approaches. New architectures or extensions of existing approaches that depart from the deep learning neural network paradigm may provide significant performance and/or power improvements for certain applications. Novel circuits and/or devices may also unlock capabilities unattainable from conventional circuit design and CMOS technology. At the datacenter system level, the challenge of integrating multiple chips or approaches to achieve the equivalent of multi-chip heterogeneous integrated systems are of high importance for the future of AI computing.

General challenges include but are not limited to: Energy-efficient end-to-end system architectures and partitioning (cloud to sensor) and optimizing energy/bandwidth/latency tradeoffs at all levels within the computational hierarchy (data center, gateway, and edge/end node).

These challenges are most acute at both ends of the AI computational hierarchy (e.g. Datacenter, Edge AI, and TinyML-type applications). Devices on the edge/end nodes are typically heavily resource constrained with stringent cost, performance, power, communication latency, and bandwidth limitations. Also, all edge/end node AI and microcontroller functionality typically resides on a single die and is implemented on older process nodes to gain access to integrated NVM and high-performance analog, creating additional area/power efficiency challenges. Research is needed to optimize the interplay of on-chip sensing, compute, and off-chip communication requirements at the edge/end node.

In the datacenter, high throughput is crucial, but must be balanced by power efficiency. Datacenter computing environments must combine energy efficient processor designs, multi-chip/module communication for data movement and memory access, and the flexibility/programmability to support diverse workloads. Center of cloud AI computation is highly data access limited (bandwidth, latency, storage), data movement limited (I/O bandwidth, power), and often thermally bounded. Extensions to existing approaches as well as novel architectures such as AI compute-in-memory that address fundamental limitations are of interest.

- 1.1 New Al architectures, including but not limited to those using emerging devices and circuits, e.g., reduced precision/dynamic range computation, in-memory and near-memory computing based on charge-based and resistance-based memory devices, other NVM devices, mixed signal techniques, compute-in-DRAM, compute-in-cache, etc.
   1.2 System-level integration solutions for emerging architectures, e.g., SoC, 3D, heterogeneous integration packaging, interchip / module communication, partitioning, etc.
   1.3 Neuromorphic computing: algorithms and hardware for biologically plausible neuron models and learning rules, such as spiking neural networks, spike timing dependent plasticity, and bio-plausible deep learning
- 1.4 Probabilistic and approximate computing: use for Al/Machine Learning architectures as well as acceleration of probabilistic Al
- 1.5 High/Hyper-dimensional computing: algorithms, practical applications, energy efficient architectures
- 1.6 Al architectures using quantum computing
- 1.7 Resource efficient training and inference at the edge: self-teaching/adaptation/optimization/incremental-training of initial algorithms to local application conditions/needs within the strict computational/memory/power/costs constraints imposed by edge hardware/software including incremental learning systems (e.g. TinyML-type applications and reduced precision systems)
- 1.8 End-to-end optimization schemes that span system-algorithm-architecture-circuit-technology stacks for minimizing energy per decision without compromising accuracy, throughput, and cost (power, area, performance), security/privacy constraints for AI systems consisting of sensors, pre- and post-processors, communication networks, and AI computer hardware

April 3, 2023

Semiconductor Research Corporation (SRC), Durham, NC 27703

| 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Modeling, Analysis, and Simulation/Emulation of Al Hardware for Early System Exploration                                                                                                                                                                                                                                                                                                                                                 |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| End-to-end performance and energy efficiency of AI systems are determined by various components including memory subsystem, I/O, on-chip and off-chip network, in addition to core AI computation. Challenges include, but are not limited to, characterizing and modeling long running AI computations that often take days/weeks to complete. Novel methods for modeling, simulation and emulation are essential for early design-space exploration of next generation AI systems. Finally, a better understanding of the theoretical behavior and limits of AI to better guide a design of AI systems is needed. |                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| 2.1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Al workload analysis and characterization                                                                                                                                                                                                                                                                                                                                                                                                |
| 2.2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Efficient techniques for end-to-end performance/power/reliability modeling, simulation, emulation, and exploration of AI systems                                                                                                                                                                                                                                                                                                         |
| 2.3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Benchmarks for emerging AI applications, and metrics for comparing AI systems (including applications in 2.4)                                                                                                                                                                                                                                                                                                                            |
| 2.4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | <ul> <li>Application-level understanding and profiling of new AI applications including:         <ul> <li>recent deep learning networks (e.g. graph convolutional networks, energy-based models, foundation models, large language models)</li> <li>techniques for machine reasoning</li> <li>neuro-symbolic approaches</li> <li>Emerging application domains: examples include mmWave sensing, Industry 4.0, etc</li> </ul> </li> </ul> |
| 2.5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Modeling infrastructure and techniques for AI computation at the edge/end node, including sensors, applications in 2.4, and more                                                                                                                                                                                                                                                                                                         |
| 2.6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Analysis and comparison of theoretical limits of algorithms and compute efficiency of AI systems (e.g. understanding theoretical limits of precision, sparsity, and compression)                                                                                                                                                                                                                                                         |
| 2.7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Architecture and data representation optimization techniques (e.q. quantization search)                                                                                                                                                                                                                                                                                                                                                  |

## 3 HW/SW Co-design of AI Compute Systems

Interactions and dependencies between hardware and software are integral for achieving high performance on AI workloads. These two fields of study cannot be decoupled. Topics of interest include compilers that map deep learning models to CPU, GPU, and accelerator hardware with reduced data movement, training algorithms (e.g. NAS) that are hardware-cognizant in their optimizations and enabling traditionally non-AI Applications with AI.

- 3.1 Compilers and run-time management that optimize data storage, compute performance and power efficiency in compute in/near memory for reduced data movement
- 3.2 Run-time management of large number of accelerators/cores including virtualization and security of AI computation
- 3.3 Co-design of AI exploration, smart sensing, and training at the edge/end node
- 3.4 Systems supporting efficient self-supervised learning algorithms
- 3.5 Co-design of AI and HPC and other scientific applications (e.g. AI-based surrogate models)
- 3.6 Co-design of CPU-friendly AI model training and inference algorithms including using AI-specific ISA extensions
- 3.7 Co-design of Al accelerators and interconnect/communication for power-performance-memory trade-offs

April 3, 2023

Semiconductor Research Corporation (SRC), Durham, NC 27703

### 4 Secure and Robust Al Hardware

Machine Learning has made enormous strides in recent years in its ability to train models and infer results with higher degrees of accuracy than many other types of algorithms. However, one of the potential stumbling blocks for machine learning adoption in many applications is the issue of fairness, robustness, privacy, and explainability. Many machine learning algorithms are somewhat of a "black box", with no easy way to determine why the algorithm produced the specific output. Explainability is key to challenge an Al/ML-based decision, especially in safety-critical applications from a SOTIF (Safety of The Intended Functionality) perspective. This may be required, for example, to understand whether a correct decision was made in scenarios such as why a loan application was rejected by an Al/ML-based application, or why an autonomous vehicle in an accident decided to drive the route it did. Another important vector is achieving privacy in Al hardware architectures.

- 4.1 Methods and architectures that return a result and a rationale for that result, or that add explainability to existing AI/ML-based solutions
- 4.2 Architectures and algorithms to add fairness into machine learning algorithms and architectures while maintaining best possible performance and accuracy, even when trained with biased data
- 4.3 Architectures robust against natural variations of input data to ensure stability of machine learning and AI decisions.

  Also, included under this are architectures capable of uncovering corruption/bias of training phase data and model integrity
- 4.4 Enhancing robustness by building prior knowledge about the task to be learned and/or about the training data into the ML solution (e.g. training with a potentially limited set of input data supplemented by rules-based data, and/or prewiring the neural network, and/or data synthesis to enlarge training data sets)
- 4.5 Architectures with the ability to assess the functionality of its AI/ML process, so that a system with functional safety requirements can identify a malfunction and establish appropriate safety actions
- 4.6 Privacy and confidentiality preserving Al architectures and systems. Included in this are methods for anonymizing and securing training data. (e.g. Homomorphic Deep Learning)

## 5 Interplay of AI and System Architecture/Microarchitecture Design

Advances in AI/ML can significantly impact system design in at least two ways. First, AI/ML-based or AI/ML-inspired components can be directly used in hardware designs. For example, branch predictors, prefetchers, and other hardware predictors can be based on ML models or can be optimized using ML models; scheduling and resource management at the core, chip, node, and data center levels can be based on ML and improve over heuristic-based approaches. Second, AI/ML can be part of the system design process itself, e.g., architecture and microarchitecture exploration and RTL generation, providing optimizations at the system, architecture and micro-architecture levels that improve over traditional hardware design methods and flows.

On the other hand, hardware and systems for AI/ML can benefit from groundbreaking advances in system-level architecture, memory systems and optimizations across multiple levels of the hardware/software stack that can directly impact future AI hardware on different design targets: performance, energy efficiency, security, etc. This interplay of AI and system level design is fundamental for design, construction, and management of intelligent self-optimizing systems.

- 5.1 Al-based or Al-inspired components that can be used in hardware designs (e.g., hardware predictors, prefetchers, resource management controllers, etc.)
- 5.2 All methods for optimization of hardware designs at the system, architecture and micro-architecture levels (e.g. communication, multi-media & graphics excluding CAD software optimizations; which are part of the CADT thrust)
- 5.3 Al-based design and optimization of Al accelerators and their integration in bigger systems
- 5.4 Synergistic advances in system design and AI/ML to improve performance, energy-efficiency, reliability/robustness, and security (e.g. AI/ML based techniques for detecting anomalous behavior of system components in intelligent sensors)
- 5.5 Al-assisted run-time and control systems, operating system, and hardware for thread scheduling, DVFS, power state transitions and other run-time orchestration and hardware resource management
- 5.6 Memory compression schemes to maximize efficiency of an AI accelerator in a system with constrained memory capacity or bandwidth (e.g. quantization schemes, loss-less and lossy power/hardware-efficient entropy encoding schemes).