Stanford Software Research Lunch

The Stanford Software Research Lunch is a weekly event on Thursday where students and researchers present their latest work to peers. Talks are open to anybody, but regular attendees are expected to give a presentation on their work.

Mailing list: software-research-lunch@lists.stanford.edu (subscribe via mailman)

Calendar: ical

Format: The lunch is held every week during fall, winter and spring quarter. The first week of every quarter is an organizational lunch where people can sign up to give a talk. If you'd like to give a talk, please contact Rohan Yadav.

Past quarters: Winter 2025, Fall 2024, Spring 2024, Winter 2024, Fall 2023, Spring 2023, Winter 2023, Fall 2022, Winter 2021, Fall 2020, Winter 2020, Fall 2019, Spring 2019, Winter 2019, Fall 2018, Spring 2018, Winter 2018, Fall 2017, Spring 2017, Winter 2017, Fall 2016.

Ordering Food: For suggestions for those ordering food for the lunch, see here.

4/3: Optimizing DNN Training with SP-ization

Time: Thursday, April 3, 2025, 12 noon - 1pm
Location: CoDa E401

Speaker: Colin Unger

Abstract: Abstract redacted due to in-progress work.

Food:

4/10: Compilation of Modular and General Sparse Workspaces

Time: Thursday, April 10, 2025, 12 noon - 1pm
Location: CoDa E401

Speaker: Genghan Zhang

Abstract: Recent years have seen considerable work on compiling sparse tensor algebra expressions. This paper addresses a shortcoming in that work, namely how to generate efficient code (in time and space) that scatters values into a sparse result tensor. We address this shortcoming through a compiler design that generates code that uses sparse intermediate tensors (sparse workspaces) as efficient adapters between compute code that scatters and result tensors that do not support random insertion. Our compiler automatically detects sparse scattering behavior in tensor expressions and inserts necessary intermediate workspace tensors. We present an algorithm template for workspace insertion that is the backbone of our code generation algorithm. Our algorithm template is modular by design, supporting sparse workspaces that span multiple user-defined implementations. Our evaluation shows that sparse workspaces can be up to 27.12× faster than the dense workspaces of prior work. On the other hand, dense workspaces can be up to 7.58× faster than the sparse workspaces generated by our compiler in other situations, which motivates our compiler design that supports both. Our compiler produces sequential code that is competitive with hand-optimized linear and tensor algebra libraries on the expressions they support, but that generalizes to any other expression. Sparse workspaces are also more memory efficient than dense workspaces as they compress away zeros. This compression can asymptotically decrease memory usage, enabling tensor computations on data that would otherwise run out of memory. This work was published on PLDI 2024 https://dl.acm.org/doi/pdf/10.1145/3656426.

Food:

4/17: AI in Software Engineering at Google

Time: Thursday, April 17, 2025, 12 noon - 1pm
Location: CoDa E401

Speaker: Satish Chandra

Abstract: In this talk, I’ll give an overview of how at Google we have been working on weaving AI capabilities in internal developer workflows to improve productivity over the past few years. The talk will cover not just the features as they exist currently, but importantly, our journey through improving them iteratively based on model improvements and user feedback. I will then describe some of the recent work we have done in using agentic AI techniques for automatically fixing bugs. I’ll talk about our eval curation strategy, highlighting differences that we see from the popular SWE Bench. I'll talk about our continuing journey through making automatic bug fixing work for real-world enterprise use, and the challenges we face in this task. I'll conclude with some comments on evals for coding tasks in general.

Food:

4/24: Literate Tracing

Time: Thursday, April 24, 2025, 12 noon - 1pm
Location: CoDa E401

Speaker: Matthew Sotoudeh

Abstract: Literate tracing is a software documentation style that emphasizes interactive visualizations of concrete program traces. After explaining some pros and cons of literate tracing, we'll describe a new tool, TReX, that makes writing literate traces easier.

Food:

5/1: Fast Branch-Free Algorithms for Extended-Precision Floating-Point Arithmetic

Time: Thursday, May 1, 2025, 12 noon - 1pm
Location: CoDa E401

Speaker: David Zhang

Abstract: Many scientific and mathematical problems demand extremely precise calculations exceeding the limits of double precision floating-point arithmetic. However, existing methods for extended-precision arithmetic involve complex branching algorithms that perform poorly on modern data-parallel processors, such as SIMD CPUs and GPUs. In this talk, I introduce a class of algorithms called floating-point accumulation networks (FPANs) that enable fast branch-free extended-precision arithmetic. FPAN-based algorithms outperform standard multiprecision libraries by orders of magnitude, achieving up to 11.7x the peak performance of QD, 34.4x over CAMPARY, 35.6x over MPFR, and 41.4x over FLINT. I also introduce a new formal verification technique that leverages automatic theorem provers to rigorously establish the correctness of these algorithms.
Note: This is the finished version of an in-progress talk given at Software Lunch in Fall 2024.

Food:

5/15: KernelBench –– Can LLMs Write Efficient Kernels?

Time: Thursday, May 15, 2025, 12 noon - 1pm
Location: CoDa E401

Speaker: Simon Guo and Anne Ouyang

Abstract: Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise. This talk explores how language models (LMs) can be leveraged to automate this process. We introduce KernelBench, an open-source framework for evaluating LMs’ ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We also highlight recent work built on KernelBench that reflect growing momentum in using LMs for low-level systems optimization and underscore KernelBench’s role as a central testbed for this emerging area.

Food:

5/22: HICCL: A Hierarchical Collective Communication Library

Time: Thursday, May 22, 2025, 12 noon - 1pm
Location: CoDa E401

Speaker: Simon Garcia de Gonzalo

Abstract: HiCCL (Hierarchical Collective Communication Library) addresses the growing complexity and diversity in high-performance network architectures. As GPU systems have evolved into networks of GPUs with different multilevel communication hierarchies, optimizing each collective function for a specific system has become a challenging task. Consequently, many collective libraries struggle to adapt to different hardware and software, especially across systems from different vendors. HiCCL’s library design decouples the collective communication logic from network-specific optimizations through a compositional API. The communication logic is composed using multicast, reduction, and fence primitives, which are then factorized for a specified network hierarchy using only point-to-point operations within a level. Finally, striping, and pipelining optimizations applied as specified for streamlining the execution. Performance evaluation of HiCCL across four different machines—two with Nvidia GPUs, one with AMD GPUs, and one with Intel GPUs demonstrates an average 17x higher throughput than the collectives of highly specialized GPU aware MPI implementations, and competitive throughput with those of vendor-specific libraries (NCCL, RCCL, and OneCCL), while providing portability across all four machines.

Food:

5/29: Automated Verification of Monotonic Data Structure Traversals in C

Time: Thursday, May 29, 2025, 12 noon - 1pm
Location: CoDa E401

Speaker: Matthew Sotoudeh

Abstract: Bespoke data structure operations are common in real-world C code. We identify one common subclass, monotonic data structure traversals (MDSTs), that iterate monotonically through the structure. For example, strlen iterates from start to end of a character array until a null byte is found, and a binary search tree insert iterates from the tree root towards a leaf. We describe a new automated verification tool, Shrinker, to verify MDSTs written in C. Shrinker uses a new program analysis strategy called scapegoating size descent, which is designed to take advantage of the fact that many MDSTs produce very similar traces when executed on an input (e.g., some large list) as when executed on a ‘shrunk’ version of the input (e.g., the same list but with its first element deleted). We introduce a new benchmark set containing over one hundred instances proving correctness, equivalence, and memory safety properties of dozens of MDSTs found in major C codebases including Linux, NetBSD, OpenBSD, QEMU, Git, and Musl. Shrinker significantly increases the number of monotonic string and list traversals that can be verified vs. a portfolio of state-of-the-art tools.
This is a practice talk, so feedback is very much appreciated!

Food:

6/5: Cypress: Task-Based Tensor Computations on Modern GPUs (PLDI Practice Talk)

Time: Thursday, June 5, 2025, 12 noon - 1pm
Location: CoDa E401

Speaker: Rohan Yadav

Abstract: Domain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and interfaces of these fixed-function units continue to change. NVIDIA’s latest Hopper GPU architecture contains multiple fixed-function units per compute unit, including an asynchronous data movement unit and an asynchronous matrix multiplication unit. Efficiently utilizing these units requires a fundamentally different programming style than previous architectures, where programmers must develop complex warp-specialized kernels that orchestrate producer- consumer pipelines between the asynchronous units. To manage the complexity of programming these new architectures, we introduce Cypress, a task-based programming model with sequential semantics. Cypress programs are a set of designated functions called tasks that operate on tensors, and are free of communication and synchronization. Cypress programs are bound to the target machine through a mapping specification that describes where tasks should run and in which memories tensors should be materialized. We present a compiler architecture that lowers Cypress programs into CUDA programs that perform competitively with expert-written codes. Cypress achieves 0.88x-1.06x the performance of cuBLAS on GEMM, and between 0.80x-0.98x the performance of the currently best-known Flash Attention implementation while eliminating all aspects of explicit data movement and asynchronous computation from application code.

Food: