What is your favorite work related to MLSys in 2023?

Research papers, open-source projects, SaaS, and PaaS products are all acceptable.

It should be continuous batching, elevating group batch operations to the ceiling and changing my previous view of CUDA Kernel performance supremacy. There are also many tricks to be done with scheduling.


Insights into Advanced Language Models and Optimizations

My favorite is vllm, which contributes not only to paged attention and continuous batching. Its scheduling logic is also quite stunning, giving me a new understanding of OOM. Additionally, vllm greatly aids the evaluation work within the group.

The second favorite is one of the works on linear Attention, RWKV5. It’s primarily an addition of multiple attention heads on top of RWKV, resembling a compromise for multi-head self-attention, with a noticeable performance improvement.

I also had the privilege of being fully involved in the development of RWKV5’s CUDA kernel, developing the RWKV5 huggingface model and WorldTokenizer implementation. Furthermore, I managed to get it deployed on Mac and mobile through mlc-llm.

Interest in Kernel Fusion and Chimera Research

I have been following and studying kernel fusion, which encompasses both compute-intensive and memory-intensive scenarios. There has been considerable research on these two types of fusion in the past. This year, a new study on fusion compilation specifically targeting compute-intensive scenarios caught my attention. It’s called “Chimera” and was published by a Ph.D. candidate from Peking University in HPCA. If you’re interested, you can check it out.

Key LLM Inference Research Topics in 2023

In the field of mlsys research in 2023, one of the most prominent topics was undoubtedly LLM inference, which saw numerous remarkable works. While LLM training shares many similarities with previous transformer-based model training in terms of basic distributed training methods, the focus shifted towards LLM inference, where differences in scale, increased memory pressure, and computational demands made optimization more interesting.

Continuous Batching

Continuous batching, introduced in the OSDI'22 work “Orca,” is a significant contribution. The concept of “continuous batching” became familiar through a blog in 2023 titled “How continuous batching enables 23x throughput in LLM inference while reducing p50 latency.” It improved hardware utilization during LLM serving by employing fine-grained batching.


The “vLLM” approach, presented in the SOSP'23 paper and accompanying code, is a product of Ion Stoica’s LLMSYS group. It introduced paged-attention, similar to virtual paging in operating systems, and an attention kernel for discontinuous sequences. This innovation addressed memory fragmentation issues caused by the dynamic growth of kv-cache during LLM decoding, increasing batching capacity and throughput.

Speculative Inference

Google’s “Speculative Inference,” described in an ICML'23 paper, leverages a smaller draft model to quickly generate multiple tokens. The original model then validates these tokens in a single pass, breaking the limitation of generating only one token per LLM decode step. This approach improves hardware utilization at the cost of some redundant computation, ultimately enhancing performance. LLMSYS’s “Lookahead Decoding,” introduced through a blog and accompanying code, speeds up token generation using parallel sampling and validation.

Flash-Decoding and Flash-Decoding++

“Flash-Decoding” and “Flash-Decoding++” address the issue of low parallelism in the attention mechanism during LLM decode when the batch size is small. Flash-Decoding combines the concepts of Flash-Attention-v2 with common parallel reduction techniques. Flash-Decoding++ introduces a unified max value to eliminate the dependency on key-value-seq dimensions, significantly improving parallelism and efficiency compared to Flash-Decoding’s parallel reduction. Additionally, Luis Ceze’s “FlashInfer” project provides highly efficient kernel implementations for LLM inference.

Several works focused on reducing memory and computational demands by “compressing” models and leveraging dynamic sparsity for LLM localization deployment, including Tri Dao’s “Deja Vu” (ICML'23 paper and code), IPADS’s “PowerInfer” (paper and code), and Apple’s “LLM in a Flash” (paper). Another noteworthy area is low-bit dynamic quantization in LLM, exemplified by Luis Ceze’s “Atom” (paper and code).

For a comprehensive view of how these works balance various system elements, refer to the attached figure. [Insert System Element Balancing Figure Here]

MLSys Innovations of 2023: Paged Attention and More

My understanding of MLSys work may not be as profound, and I initially hesitated to respond. However, when I received an invitation from the esteemed @方佳瑞, I decided to share my perspective.

Due to my limited knowledge, I may not fully grasp the intricacies of the latest technologies in this field. Therefore, I’ll provide insights from a user’s standpoint.

In 2023, one innovation that truly impressed me was “Paged Attention.” Drawing inspiration from operating systems, this concept felt like a breath of fresh air in the MLSys landscape. I must commend the high productivity of the LMSYS team, as their previous works, such as Vicuna and LLM Judge, have been quite impressive, along with their validations of long-context models.

Around the latter half of 2023, I decided to delve into inference acceleration. When I came across the concept of KV Cache, I found it to be a clever design that effectively leveraged the Decoder’s characteristics to speed up inference. I even created a graphical representation:

{Link to Graph: Large Model Inference Acceleration: Understanding KV Cache}

At that time, I thought KV Cache was promising, albeit a bit resource-intensive in terms of GPU memory usage. However, it didn’t take long for Paged Attention to emerge.

After reading the Paged Attention paper, I was struck by its elegance. It felt remarkably refined. It even made me wonder if, in the future, GPU clusters would come bundled with an operating system, allowing algorithm engineers to focus on developing software or apps at a higher level—a new era for AI developers (with a chuckle).

If AGI truly becomes a reality, these MLSys algorithms will operate silently, much like the algorithms in today’s operating systems, guarding upper-level applications. I salute the MLSys experts who make this possible.

Top Picks: Papers and Projects

Thanks for the invitation!

When it comes to papers, I’d like to nominate “ZeRO++” and “FP8-LM.” I’ve spent a significant amount of time researching FP8 training from the summer until now. There’s not much engineering experience in this area, but these articles are straightforward yet valuable:

  • ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
  • FP8-LM: Training FP8 Large Language Models

As for projects, my personal favorite has to be “ColossalChat,” which I was involved in writing. After all, we’ve encountered quite a few bugs along the way!

{ColossalChat on GitHub}

I recently had the opportunity to delve into some work related to accelerating large model inference, so I thought I’d share some insights.

As mentioned by others, DejaVu and Apple’s LLM in a flash are two recent and representative works in the field of large model inference acceleration. With the widespread adoption of large models on the edge, I believe that the approaches taken in these two works have promising applications. Furthermore, these approaches are relatively straightforward and do not involve complex algorithmic logic, making them relatively easy to engineer. They are likely to become classic works in the field of edge-based large model inference acceleration.

The llama.cpp Project

The llama.cpp project is my personal favorite work. Initially, it was a project led by a prominent figure, GG, for deploying large models on Apple’s A-series CPUs. With the participation of various experts, llama.cpp has expanded its support for different model types and deployable hardware environments. It has become invaluable learning material for individuals engaged in model deployment and optimization research.

The llama.cpp project is built on top of the ggml framework. GG has provided a model inference framework in ggml that is compatible with both CNN and Transformer models, with a primary focus on Transformer large models. One of the most significant aspects of ggml is its provision of model quantization methods. Model quantization has become an essential step in deploying large models. Another work based on ggml is whisper.cpp, which serves as a star collector. With the accumulation of prior projects like ggml and whisper.cpp, one wouldn’t be as astonished and sleep-deprived by GG’s remarkable achievement of completing the first version of llama.cpp in a single night.

powerinfer/llm in a flash, I believe there will be more corresponding frameworks for end-to-end testing requirements.