Decoding LLM Inference Math: Your Step-by-Step Guide

April 10, 2025

4 Minutes Read

Understanding the maths behind the LLM inference is a crucial knowledge that everyone working in the LLMOPS should know. The high price rates for the GPUs used in the LLM inference puts the GPU utilization optimization in the top of our priorities list, so I’ll go through the process of memory utilization for the LLM inference step by step and let’s see if deploying Falcon 7B on 70% of A10G is different than deploying it on 100% of the same GPU or even deploy it on H100.

First, we need to know how much memory is required to load any LLM. The general rule is:

inference_size = 2 * n_parameters # GB # n_parameters: number of parameters in Billions

14 GB is the minimum required GPU memory to deploy Falcon 7B

Scenario: we will deploy Falcon 7B on 100% of A10G

gpu_capacity = 24 # GB

Model parameters:

precision = 2 # fp16 n_layers = 32 # hidden layers d_model = 4544 # hidden size n_parameters = 7 # 7 Billion

The concept of KV Cache

What makes the transformers powerful is the ability to predict the next word based on the context.

The KV cache, short for Key-Value cache, is a caching mechanism used in Transformers models, the key-value cache is primarily employed in the context of attention mechanisms.

In attention mechanisms, when processing sequential data such as text, the model attends to different parts of the input sequence to generate the output. This attention process involves computing attention scores between each position in the input sequence and each position in the output sequence. These scores are computed based on the similarity (dot product or other measures) between keys and queries.

In the context of caching, the key-value pairs represent the encoded representations of the input sequence. These representations are stored in a cache for faster retrieval during subsequent iterations. Instead of recomputing the encoded representations of the input sequence at each step, the model can look up the key-value cache to retrieve previously computed representations, saving computational resources and speeding up inference.

so, the free space in the GPU memory (10 GB) after taking 14 GB to load the model can be used for the KV caching. With a simple formula, we can calculate the cache size required based on the model information and the context size

def calc_kv_cache_size(seq_size: int, batch_size: int) -> float: """ Calculate the cache size in GB for the provided sequence length and batch size (number of requests) :param seq_size: context tokens + answer tokens :param batch_size: number of concurrent requests """ return 2 * precision * n_layers * d_model * seq_size * batch_size / (1024 ** 3)

seq_size is the number of tokens in the context (input) and the output generated by the model

For example: we want to serve 100 request at the same time and the sequence size is 1000 tokens

required_size = calc_kv_cache_size(seq_size=1000, batch_size=100) # 54.168701171875 GB

as we can see, we need around 54 GB free in the GPU memory to be able to serve 100 user at the same time while the expected sequence size is 1000 tokens. Which is not available as we only have 10 GB free.

So, we need to find a way to calculate the number of users we can serve based on the free memory available and the expected sequence size.

def calc_max_concurrent_requests(seq_size: int): """ Calculates the max batch size based on the provided seq_length and free space :param seq_size: context tokens + answer tokens """ return (free_size_in_gpu * (1024 ** 3) ) / (2 * precision * n_layers * d_model * seq_size)

With this function we can do that.

For example: we need to know the max concurrent users the model can handle in the free space of the GPU. sequence size: 1000

number_of_users = int(calc_max_concurrent_requests(seq_size=1000)) # 18

So, we can serve 18 concurrent users with our deployment configurations, if the expected sequence size is 1000 tokens.

At the end, all these calculations are empirical, hence we should expect slightly different results in the real-life. However, it gives us an initial overview on what we are dealing with.

Introducing OI AI Security: End-to-End Security Across the AI Lifecycle

OI AI Security delivers end-to-end protection across the AI lifecycle with model scanning, AI red teaming, runtime guardrails, and real-time monitoring.

LoRA Adapters Explained: Efficient Fine-Tuning for LLMs Without Retraining

LoRA adapters offer a lightweight way to fine-tune large language models without retraining billions of parameters. Learn how they reduce costs, accelerate deployment, and enable modular AI systems.

Core AI Platform

OI Cluster Manager

AI Security

OI AI Security

AI Application Suite

OI Agents

OI Chat

OI Code

Public Sector

Transportation & Logistics

Telecommunication

Infosec

Education

Cloud Service Providers

Artificial Intelligence

Banking & Finance

Guides

Demos

whitepapers

Developers documentation

OICM DOCS

OI Agents Docs

Whitepapers

Blog

CASE STUDIES

About

PR in News

Partners

careers

contact us