Effectively benchmarking OCI Compute Shapes for LLM inference serving

Introduction

The rapid evolution of generative AI models is nothing short of remarkable, with new and more powerful models emerging on an almost weekly basis. Since the release of OpenAI’s GPT-2 in 2019, the landscape has exploded, with more than 50 models introduced to date (Figure 1). While some models continue to push boundaries in terms of sheer parameter count, others prioritize efficiency and versatility. Increasingly, model families are diversifying, offering multiple variants tailored to specific use cases and hardware platforms. Take Meta’s Llama model family as an example. In its 3.2 release, it provides options ranging from a compact one-billion parameter LLM as well as a three-billion parameter variant. In addition, Llama 3.2 also has two vision language models (VLMs) of 11 billion and 90 billion parameters.

A plot showing different large-language models over time split by vendor. On the x-axis we are plotting time, each LLM is represented by a circle whereby its number of parameters is indicated by the size of the circle

Figure 1: The relative size of LLM models over time split by vendors

In parallel with the explosion of model architectures, the broader AI software ecosystem has matured significantly. Not long ago, deploying AI models for inference required a non-trivial effort to assemble the necessary software stack. Today, however, a rich collection of tools and libraries has emerged, streamlining the process. These tools abstract away many of the complexities, enabling seamless integration of models into applications and freeing developers to focus on solving business problems rather than wrangling infrastructure.

This newfound efficiency is a boon for innovation—but it also introduces a critical question: What kind of compute resources are required to serve users with an AI-powered service?

Selecting the right hardware, in particular GPU platforms, such as the ones employed by Oracle Cloud Infrastructure (OCI) GPU Instances, involves more than surface-level estimates like GPU memory requirements. While such first-order approximations provide a starting point, the best way to make an informed decision is to benchmark the target model directly. With the tools and frameworks available today, running benchmarks has become a straightforward process, enabling teams to gather actionable performance data and make decisions aligned with their specific workloads and use cases.

Setting Up a Simple LLM Inference Benchmarking System on OCI

While you could develop your own inference benchmarking utility, several existing tools are well-suited for this purpose. For instance, fw-ai/benchmark is a straightforward solution for running and analyzing AI model benchmarks. Another option is NVIDIA Dynamo Perf Analyzer, which offers robust performance evaluation features for generative AI models.

In this section, we will focus on Ray LLMPerf. This tool facilitates benchmarking model performance across a variety of backends and APIs, such as OpenAI, Anthropic, Hugging Face, and VertexAI. Ray LLMPerf is flexible enough to compare different models on a specific backend, or it can evaluate the performance of a particular model you are serving through an inference server.

For the example presented here, we are interested in analyzing the performance characteristics of the Llama 3.2 3B model and evaluating which OCI Compute shape could offer the desired performance. The OCI Compute service offers various virtual machine (VM) and bare metal (BM) hardware configurations (which are called “shapes”). Since a three-billion parameter large language model should fit comfortably in the NVIDIA A10 GPU we’ll use the VM.GPU.A10.1 shape for the experiment. To serve the models frontend, we will use vLLM, a popular open-source inference server known for its ability to handle thousands of concurrent requests efficiently and support a wide range of large language models. Please refer to this tutorial for details on how to setup the benchmark system and how to run the benchmarks.

Benchmarking LLM Inference

After setting up the benchmark system, one must define a benchmark scenario and execute the benchmarks. Given a particular application scenario with a chosen large language model, we would like to understand on the target system what are the performance characteristics of running concurrent inference requests.

Table 1 summarizes our experiment parameters. Here we are interested in measuring the performance of a simple query-answer chatbot, which should support up to 32 concurrent requests. Two important parameters to consider are the number of input and output tokens, which represent query and answer size. Given that this is a query-answer chatbot, we define the number of input tokens to be normally distributed with a mean number of 200 tokens, while the number of output tokens is defined to be normally distributed with a mean number of 100 tokens.

ParameterValue
output tokensN(100, 10)
input tokensN(200, 40)
use-casechat
modelMeta Llama 3.2 3B Instruct
concurrent requests1 – 32

Given this scenario we can now run repeated inference queries and measure various performance figures that we are interested in:

  • Per-query token throughput in tokens per second (TPS): The rate of tokens per second at which the model, and the system as a whole, is responding to a query. A higher value indicates better performance and a responsive system offering a reasonable user experience should at least provide 10 – 15 tokens/s
  • System token throughput in tokens per second: Since our system can treat requests in parallel, we are interested in the maximum concurrency that the system is able to sustain, which is indicated by the system token throughput. Total token throughput continues to scale with increased concurrency only as long as the system can sustain the same level of performance.
  • Time-to-first-token (TTFT) in seconds: The latency between query and the first token being returned. A smaller value indicates better responsiveness of the system. For real-time inference we typically can only tolerate a few seconds of latency for the time to first token, while in batching scenarios the time to first token latency is less important since there are no real time user interactions.

Analyzing the Results

Running the benchmarks, we’ve discussed above will take a while as it iterates through increasing concurrency levels. The results can be seen in Figure 2 where both token throughput and TTFT are plotted.

The figure shows two line plots side by side. We plot on the x axis the total tokens per second against the per query tokens per second and TTFT respectively. Each plotted point represents an increasing number of concurrent requests.
Figure 2: Token Throughput and TTFT for the chatbot scenario outlined in Table 1

The two plots in Figure 2 illustrate the primary performance metrics discussed above:

  • Left Plot: System throughput (tokens/s) vs. per-query throughput (tokens/s).
  • Right Plot: System throughput (tokens/s) vs. time to first token (TTFT) latency.

Several observations can be drawn from these results:

  • Infrastructure Suitability of the VM.GPU.A10.1 Shape: The selected OCI shape demonstrates strong performance for serving 3.2 3B under the defined use case. Both tokens/s throughput and TTFT remain within acceptable ranges across all concurrency levels tested.
  • System Scalability: The left plot shows a linear increase in per-query throughput as system throughput rises, suggesting that the system has not yet reached saturation. This implies that the chosen GPU shape may effectively handle additional concurrency.
  • Maximum Concurrency: At the highest tested concurrency level (32), the system maintains throughput above 40 tokens/s, with a time to first token (TTFT) latency of less than 500 milliseconds. While actual performance will vary based on input and output token sizes, these results provide a reliable baseline for expectations in real-world scenarios.

These findings indicate that the VM.GPU.A10.1 shape is not only well-suited for this workload but also offers room for scaling to higher levels of concurrency if needed. This provides confidence in its ability to serve inference workloads efficiently while maintaining low latency.

Conclusion

Effectively benchmarking large language models like Llama 3.2 3B on Oracle Cloud Infrastructure’s  VM.GPU.A10.1 shape demonstrates not only the feasibility of deploying high-performance AI inference services in the cloud but also the importance of a well-structured benchmarking process. With tools like Ray LLMPerf, ML and cloud engineers can gain critical insights into system throughput, scalability, and latency, enabling informed decisions about infrastructure optimization. The promising results highlighted by robust throughput and minimal latency even at high concurrency levels underscore the potential of OCI GPU shapes to support scalable AI applications. As generative AI continues to evolve, leveraging such benchmarking practices will be essential for designing efficient, responsive AI-driven solutions that meet diverse user needs.

Leave a Reply