The Stanford Software Research Lunch is a weekly event on Thursday where students and researchers present their latest work to peers. Talks are open to anybody, but regular attendees are expected to give a presentation on their work.
Mailing list: software-research-lunch@lists.stanford.edu (subscribe via mailman)
Calendar: ical
Format: The lunch is held every week during fall, winter and spring quarter. The first week of every quarter is an organizational lunch where people can sign up to give a talk. If you'd like to give a talk, please contact Rohan Yadav.
Past quarters: Fall 2023, Spring 2023, Winter 2023, Fall 2022, Winter 2021, Fall 2020, Winter 2020, Fall 2019, Spring 2019, Winter 2019, Fall 2018, Spring 2018, Winter 2018, Fall 2017, Spring 2017, Winter 2017, Fall 2016.
Ordering Food: For suggestions for those ordering food for the lunch, see here.
1/9: Organizational Lunch
Time: Thursday, January 9, 2025, 12 noon - 1pm
Location: CoDa E401
Organizational lunch. Come sign up to give a talk during the quarter.
Food:
1/16: Cypress: Task-Based Tensor Computations on Modern GPUs
Time: Thursday, January 16, 2025, 12 noon - 1pm
Location: CoDa E401
Speaker: Rohan Yadav
Abstract: Domain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and interfaces of these fixed-function units continue to change. NVIDIA’s latest Hopper GPU architecture contains multiple fixed-function units per compute unit, including an asynchronous data movement unit and an asynchronous matrix multiplication unit. Efficiently utilizing these units requires a fundamentally different programming style than previous architectures, where programmers must develop complex warp-specialized kernels that orchestrate producer- consumer pipelines between the asynchronous units. To manage the complexity of programming these new architectures, we introduce Cypress, a task-based programming model with sequential semantics. Cypress programs are a set of designated functions called tasks that operate on tensors, and are free of communication and synchronization. Cypress programs are bound to the target machine through a mapping specification that describes where tasks should run and in which memories tensors should be materialized. We present a compiler architecture that lowers Cypress programs into CUDA programs that perform competitively with expert-written codes. Cypress achieves 0.88x-1.06x the performance of cuBLAS on GEMM, and between 0.80x-0.98x the performance of the currently best-known Flash Attention implementation while eliminating all aspects of explicit data movement and asynchronous computation from application code.
Food:
1/23: Efficient Optimization with Encoded Ising Models
Time: Thursday, January 23, 2025, 12 noon - 1pm
Location: CoDa E401
Speaker: Devrath Iyer
Abstract: Many promising computing substrates, including quantum computers, oscillator-based computers, and P-computers, solve constrained combinatorial optimization problems by minimizing energy functions called Ising models. Because Ising solvers explore an unconstrained search space, Ising models for many popular optimization problems must include penalty terms to raise the energy of infeasible solutions that would appear optimal otherwise. We observe that for some problems, Ising solvers spend the majority of computation time exploring this invalid state and often never find a feasible solution. We introduce the Encoded Ising Model (E-I model), an extension to the Ising Model that uses a digital encoding circuit to vastly reduce the proportion of time a solver spends exploring invalid state. We present FUSE, a software framework that enables the description of such functions and automatically lowers them to a P-computer. Our formulation reduces the number of iterations to a solution by a factor of 7.2-52000x and achieves up to 100.0% higher estimated success probability over baseline formulations.
Food:
1/30: Compiling Recurrences over Dense and Sparse Arrays
Time: Thursday, January 30, 2025, 12 noon - 1pm
Location: CoDa E401
Speaker: Shiv Sundram
Abstract: We present a framework for compiling recurrence equations into native code, for problems across linear algebra, bioinformatics, and graph analysis. In our framework, users specify a system of recurrences, the types of data structures that store inputs and outputs, and scheduling commands for optimization. Our compiler then lowers these specifications into native code that respects the dependencies in the recurrence equations. Our compiler can generate code over both sparse and dense data structures, and determines if the recurrence system is solvable with the provided scheduling primitives. We evaluate the performance and correctness of the generated code on several recurrences, from domains as diverse as dense and sparse matrix solvers, dynamic programming, graph problems, and sparse tensor algebra. We demonstrate that the generated code has competitive performance to hand-optimized implementations in libraries. However, these handwritten libraries target specific recurrences, specific data structures, and specific optimizations. Our system, on the other hand, automatically generates implementations from recurrences, data formats, and schedules, giving our system more generality than library approaches.
Food:
2/6: Energy-Efficient ML Using Hyperdimensional Computing
Time: Thursday, February 6, 2025, 12 noon - 1pm
Location: CoDa E401
Speaker: Chaeyoung Lee
Abstract: We present HyperCam, an energy-efficient image classification pipeline that enables computer vision tasks onboard low-power IoT camera systems. HyperCam leverages hyperdimensional computing to perform training and inference efficiently on low-power microcontrollers. We implement a low-power wireless camera platform using off-the-shelf hardware and demonstrate that HyperCam can achieve an accuracy of 93.60%, 84.06%, 92.98%, and 72.79% for MNIST, Fashion-MNIST, Face Detection, and Face Identification tasks, respectively, while significantly outperforming other classifiers in resource efficiency. Specifically, it delivers inference latency of 0.08-0.27s while using 42.91-63.00KB flash memory and 22.25KB RAM at peak. Among other machine learning classifiers such as SVM, xgBoost, MicroNets, MobileNetV3, and MCUNetV3, HyperCam is the only classifier that achieves competitive accuracy while maintaining competitive memory footprint and inference latency that meets the resource requirements of low-power camera systems.
Food:
2/13: Benchmarking Code Reasoning Capabilities of Large Language Models for Semantic Equivalence Checking
Time: Thursday, February 13, 2025, 12 noon - 1pm
Location: CoDa E401
Speaker: Anjiang Wei
Abstract: While large language models (LLMs) excel in code generation and repair, their ability to perform deep semantic reasoning remains under explored. In this work, we assess LLMs’ capabilities in equivalence checking, a fundamental problem in programming languages that determines whether two programs produce the same output for all inputs. We introduce EquiBench, a dataset of 2400 program pairs across multiple languages, including Python, C, CUDA, and x86-64 assembly, covering six equivalence categories. These pairs are systematically generated using compiler transformations, static analysis, and formal verification to ensure semantic correctness, requiring deep semantic reasoning rather than reliance on superficial syntactic variations. Experiments show that state-of-the-art LLMs, including GPT o3-mini, achieve only 78.0% accuracy—a moderate improvement over random guessing (50%), yet far from demonstrating robust semantic reasoning. EquiBench establishes a challenging benchmark for advancing deep program understanding in LLMs.
Food:
2/20: Optimizing DNN Training with SP-ization
Time: Thursday, February 20, 2025, 12 noon - 1pm
Location: CoDa E401
Speaker: Colin Unger
Abstract: Abstract redacted due to in-progress work.
Food:
2/27: Letting users write evaluators for Fix, while guaranteeing correctness
Time: Thursday, February 27, 2025, 12 noon - 1pm
Location: CoDa E401
Speaker: Keith Winstein
Abstract: Fix is a new architecture for 'serverless' computing that Yuhan Deng and Akshay Srivatsan are leading in our group. It's based on an operating system that understands the computational relationships between data, and the dataflow of computations. In this system, each task is a fine-grained block of deterministic machine code run in a hermetic environment of deterministically addressed data-dependencies. The OS guarantees that all computations are reproducible (even if they try hard not to be). In this system, user code can't load files from disk, can't talk over the network, and won't necessarily decide when to memoize a computation: these are the responsibility of the runtime evaluator and scheduler.
What I'd like to talk about are our efforts to let users write their own evaluators/schedulers, based on their understanding of a particular computing job -- but without letting an incorrect evaluator ever produce an incorrect answer. There's some similarity to Halide schedulers and Legion mappers in this. We're trying to 'operationalize' the operational semantics of the OS abstractions, by letting users write evaluators that effectively prove that a given evaluation is correct. We'd like to prove the soundness of the module that checks these proofs -- e.g., it should be impossible for the user to write an evaluator that 'proves' that two different byte strings are equivalent. I would welcome any feedback from this group or interest in helping us write this module in a way that it admits a mechanized proof of soundness.
Food:
3/6: User-extensible and Productive Programming of Specialized Hardware
Time: Thursday, March 6, 2025, 12 noon - 1pm
Location: CoDa E401
Speaker: Yuka Ikarashi
Abstract: As single-core performance has reached its limit, exploiting the peak performance of heterogeneous accelerators and specialized instructions has become crucial in many applications. Compilers struggle to keep pace with the diverse and rapidly evolving hardware targets, and automatic optimization often fails to guarantee state-of-the-art performance. Consequently, high-performance libraries are still commonly coded and optimized by hand, at great expense, in low-level C and assembly. User-schedulable languages (USLs) have been proposed to address this challenge by decoupling algorithms and scheduling. I will share our work on Exo, a USL based on the principle of exocompilation, which externalizes hardware-specific code generation and scheduling library implementation in the user code, decoupled from the compiler. Additionally, I will discuss other projects that borrow ideas from USLs and the lessons we have learned from the industry adoption of Exo.
Food:
3/13: TBD
Time: Thursday, March 13, 2025, 12 noon - 1pm
Location: CoDa E401
Speaker: Shadaj Laddad
Abstract: TBD
Food:
3/20: Early Termination for Hyperdimensional Computing Using Inferential Statistics
Time: Thursday, March 20, 2025, 12 noon - 1pm
Location: CoDa E401
Speaker: Pu (Luke) Yi
Abstract: Hyperdimensional Computing (HDC) is a brain-inspired, lightweight computing paradigm that has shown great potential for inference on the edge and on emerging hardware technologies, achieving state-of-the-art accuracy on certain classification tasks. HDC classifiers are inherently error resilient and support early termination of inference to approximate classification results. Practitioners have developed heuristic methods to terminate inference early for individual inputs, reducing the computation of inference at the cost of accuracy. These techniques lack statistical guarantees and may unacceptably degrade classification accuracy or terminate inference later than is needed to obtain an accuracy result. We present Omen, the first dynamic HDC optimizer that uses inferential statistics to terminate inference early while providing accuracy guarantees. To realize Omen, we develop a statistical view of HDC that reframes HD computations as statistical sampling and testing tasks, enabling the use of statistical tests. We evaluate Omen on 19 benchmark instantiations of four classification tasks. Omen is computationally efficient, delivering up to 7.21–12.18× inference speed-ups over an unoptimized baseline while only incurring a 0.0–0.7% drop in accuracy. Omen outperforms heuristic methods, achieving an additional 0.04–5.85× inference speed-up over the unoptimized baseline compared to heuristic methods while maintaining higher or comparable accuracy.
Food: