
Large language models (LLMs) have rapidly advanced as they’ve proven to be extremely effective tools. However, as the demand for them grows, training LLMs on GPUs can slow down adoption due to challenges like memory limits, throughput, and deep learning framework overhead. To help resolve this across the ML community, we introduced Liger-Kernel, a new open-sourced library designed to enhance GPU efficiency for training LLMs.
Liger-Kernel’s efficient Triton kernels provide a simple solution for improving performance and resource optimization. The library, available on GitHub, can improve training throughput by 20% and reduce memory usage by 60% with just a single line of code for popular models like Llama, Gemma, and Qwen.
Since its initial release in August 2024, Liger-Kernel has grown rapidly across the community, accumulating 3,000+ stars and 200k+ downloads. We have also integrated with mainstream training frameworks, including Axolotl, LLaMa-Factory, SFTTrainer, Hugging Face Trainer, SWIFT, and supported distributed training frameworks, such as PyTorch FSDP, and Microsoft DeepSpeed.
In this blog post, we’ll talk about the problems that Liger-Kernel aims to solve, such as the high per-operation and GPU memory I/O overhead, as well as how we encapsulated the efficient kernels in Liger-Kernel into APIs that are easy to use and adaptable. We will also provide a summary of LinkedIn’s current LLM training infrastructure stack and the benchmarking results that demonstrate the effectiveness of Liger-Kernel.
What are the inefficiencies in LLM training?
Scaling LLM training is susceptible to efficiency bottlenecks and heavily depends on the stability of the compute infrastructure. Host/device memory management and latency-bandwidth trade-offs for tensor operations are central to these efficiency issues. Despite recent advancements in distributed training hardware and software usability, optimizing the training process remains a highly specialized and complex endeavor. It requires not only a deep understanding of LLM algorithms and hardware architectures, but also significant time and financial investments.
There are two types of performance bottleneck we are particularly interested in addressing here:
Extensive GPU memory access
overhead per operation GPU memory access overhead
GPU has hierarchical memory architecture, composed of a large but slow high-bandwidth memory (HBM) and fast but limited sized shared memory (SRAM). A GPU’s streaming multiprocessor (SM) processing units can only directly access SRAM memory. As a result, a HBM->SRAM load and SRAM->HBM write are required for every single GPU kernel launched, which incurs significant overhead and hinders the benefit of GPU’s fast computation capacity. This is particularly important for simple kernels where the arithmetic Intensity (FLOP / memory access) is low such as element-wise ops and reduction ops.
Per-operation overhead
Operations are blocked and synchronous in deep learning frameworks that use eager execution (such as PyTorch without torch.compile and TensorFlow 2 eager mode) because the model code must be executed line by line, even though the GPU kernel is async. Overhead in CPU time on the framework side is the result. In addition, in the training scenario, output activations of all operations have to be stored for later use during backward pass, which brings a significantly high memory footprint that stops us to have full leverage of GPU parallel computing power by increasing the problem size (batch size, etc).
How we built Liger-Kernel
In designing Liger-Kernel, we built upon well-established techniques used in successful approaches, such as FlashAttention and torch.compile, leveraging tried-and-true optimizations that have been proven effective in advancing GPU performance. While these methods are widely adopted, we aimed to push the boundaries further through careful and innovative design.
enhancing the kernel The cornerstone of Liger-Kernel’s design is operator fusion. It works by combining several standalone GPU kernels into one, to avoid the per-operation time and memory overhead in step-by-step execution mentioned earlier. Simple examples include combining multiple element-wise ops or combining activation computation with other operations etc., which are supported by the majority of the out-of-the-box model compilers. With more careful algorithm design and manual backprop derivation, we can even collapse all ops into one kernel which might involve >5 standalone operator calls in eager execution.
More advanced optimizations can also be implemented with operation fusion, such as chunked/blockwise computation – the backbone of many memory efficient algorithms, including FlashAttention and Ring Attention. In Liger-Kernel, we implemented various chunked losses that avoids materializing the full logits, leading to a meaningful reduction in GPU memory footprint. This is particularly important for model families with huge vocab space, such as Llama and Qwen.
Triton-based kernels
We chose OpenAI’s Triton as the programming language for implementing our kernels. Programming in Python with Triton, a language and compiler for high-performance GPU kernels, makes it simpler to optimize deep learning operations without having to deal with low-level GPU programming. Because users are able to work on a tile rather than a thread like in CUDA, the tile-based abstraction makes it possible to access a variety of underlying optimizations and provides user-friendly interfaces. The JIT-compile nature of Triton also allows the libraries and tools that use it to be more lightweight and portable especially on the training front, where the JIT compilation time is negligible compared to the whole lifetime of the training process.
API interface
The guiding principle behind Liger’s API design is to be the least disruptive to users’ existing codebases while providing the flexibility needed for various levels of customization. Depending on the level of customization required, there are three ways to apply Liger kernels:
1. Using AutoLigerKernelForCausalLM:
The simplest way to leverage Liger kernels is through the AutoLigerKernelForCausalLM class. There are no model-specific patching API imports required for this. If the model type is supported, the modeling code will be automatically patched by Liger.