Pushing Open-Source LLM Inference to Its Limits

How We Achieved 5.4x Cheaper Inference than Together AI at AlpineX

Large language models (LLMs) are revolutionizing industries, but their computational demands can be a significant barrier to adoption. At AlpineX, we've been working to optimize LLM inference, making these powerful models more accessible and cost-effective.

By pushing the boundaries of open-source inference libraries like vLLM and TensorRT, we've achieved a significant reduction in LLM inference costs. Our optimized solution can offer over 5x cost reduction compared to competitors like Together AI.

This level of cost reduction is only possible through economies of scale. By aggregating multiple customer requests into a single batch, we can optimize resource utilization and significantly reduce costs. This is particularly beneficial for smaller-scale users who might otherwise waste resources on underutilized hardware.

In this blog post, we'll delve into the technical aspects of our optimization process, sharing insights into how these techniques can be applied to any LLM projects.

Experiment Setup

Open-Source LLM Inference Libraries

When it comes to open-source LLM inference, several powerful libraries are available, such as, vLLM, TensorRT, exllama and llama.cpp. Each library offers unique strengths, and the best choice depends on specific use-case requirements. For example, llama.cpp is ideal for smaller models on CPUs, while vLLM and TensorRT excel at handling large batch sizes, which is common in production environments.

Given our focus on maximizing throughput for large language models, we opted to experiment with TensorRT-LLM and vLLM. This blogpost will cover our experiments with vLLM and we shall document the TensorRT-LLM experiments in another post in future.

Testing Methodology

To simulate a real-world load, we sent 1000 requests from the ShareGPT dataset to the models at a rate of 32 requests per second. The requests were exponentially distributed to mimick the bursty, and unpredictable demand we see in production environments. We also kept the context length capped at 8192 tokens—a common range in practical applications. We aimed to optimize the model throughput, although we do agree there are other possible metrics such as Time to First Token, Intra Token Latency etc. We also wanted to ensure that we were using the latest possible versions of each library, ensuring that we were achieving the highest performance numbers by using the latest optimizations that the NVIDIA and VLLM teams have been working on releasing in the past couple of months.

Baseline Performance

For testing, we chose the Llama-3.1 70B model in fp8 quantization. We decided not to use the 16-bit precision because the performance gap between the two is minimal and fp8 is currently the industry standard. For our baseline, we ran the model on one H100 with the default vLLM parameters. We saw a throughput of 1980 tokens (input and output together) per second, which is respectable but far less than ideal. With a little bit of optimization, such as increasing GPU usage to 95%, and chunked prefilling, we could push this performance to 2039 tokens per second. This was used as the baseline performance for benchmarking.

Optimization Strategies

GPU Memory Bottleneck

Our first observation was that the 70B model’s weight occupied almost 67GB of the 80GB available memory of the H100. This left very little space for the KV cache storage. Consequently, the bottleneck was the GPU memory bandwidth, which ultimately lowered the throughput.

Our solution was to switch to 2x H100 clusters with NVLink. In this setup, the model weights are divided into both GPUs, leaving over 90GB for the KV cache storage. Theoretically, this setup should outperform two separate H100s individually running the model, since the model weights only need to be stored once. However, there is some communication latency between two GPUs, which is minimized due to NVLink.

The results also agreed with our intuition. Without any other optimization, in a 2xH100 cluster, the throughput was 4705 tokens per second or 2352 tokens per second per GPU. This is immediately a 15% increment over our previous baseline.

image

Attention Kernel Replacement: Flashinfer with fp8 KV Cache

Next, we looked at vLLM’s attention backend kernel. By default, vLLM uses flash_attn, but we found that swapping it out for flashinfer made a real difference. Flashinfer’s optimized implementation, along with fp8 KV Cache made the KV Cache size smaller, allowing even higher throughput.

The Hidden Bottleneck: CPU Overheads

During our experiments and code profiling, we discovered that the CPU overhead is very noticeable for vLLM. While the output token generation happens on the GPU, the CPU still plays a crucial role in serving and scheduling requests and tokenizing prompts.

The vLLM team also arrived at the same conclusion and their profiling revealed that 33% of the total execution time was tied up in the HTTP API server, while 29% went to scheduling. This meant that over 60% of our total execution time was spent on CPU-bound tasks, leaving the GPUs waiting for the CPU to finish its tasks.

In v0.6.0, the vLLM team implemented a multi-step processing, which is disabled by default. After enabling that we observed another major improvement over our previous numbers. For the 70B model, we found that 16-step processing struck the right balance between the CPU and the GPU in our settings. However, this could vary based on the CPU and the GPU used. This change brought a noticeable reduction in idle GPU time, which gave us another substantial boost in efficiency. Additionally, vLLM also has an async tokenizer pool. After enabling that with a pool size of 4, the throughput came to be 3322 tokens per second per GPU. This is already a 1.67x improvement over the initial 1980 tokens per second.

image

Playing with Block Sizes, max number of seqs and Prefill Chunking

After resolving the CPU bottleneck, we explored a few more tweaks based on recommendations from vLLM’s documentation. Larger block sizes were suggested as a potential way to improve throughput, though we didn’t see much difference in practice.

Prefill chunking however made a noticeable difference. This lets the model process more tokens at once by batching them. We set the max number of batched tokens to be 16384, which essentially allows the model to prioritize decoding and leads to another performance improvement. It is unfortunate that currently the Flashinfer kernel doesn’t support prefill chunking, but even with the default flash attention kernel, we were able to get 3401 tokens per second per GPU, which is a whooping 1.72x improvement over the baseline.

image

Further Optimization

Here are a few additional ideas that could improve the performance even further

  • Recompiling the vLLM for H100 GPUs instead of using the default image, which was compiled to support a large number of GPU architectures
  • Using flash attention 3 kernels for attention backend
  • Using H200 GPUs, which have a higher memory bandwidth
  • Summary

  • Large GPU clusters with large batch sizes make the model inference cheaper per query
  • Attention implementation and KV Cache data type matter
  • CPU is also an important bottleneck and must be managed well with multi-step preprocessing, async token pools, and prefill chunking
  • Final Thoughts

    Optimizing open-source LLM inference to this extent isn’t a one-size-fits-all process. Each adjustment—whether it was switching kernels, increasing GPU memory, or offloading work from the CPU—added up to a significant gain. The key lesson we learned? While the GPU might be doing the heavy lifting, maximizing CPU efficiency and strategically managing memory can unlock substantial performance improvements.

    We’re thrilled with the outcome and excited to see how further optimizations in frameworks like vLLM will continue to expand what’s possible in open-source LLM inference. If you’re looking to push your models to new performance heights, we hope our journey offers a helpful blueprint.