Understanding the maths behind the LLM inference is a crucial knowledge that everyone working in the LLMOPS should know. The high price rates for the GPUs used in the LLM inference puts the GPU utilization optimization in the top of our priorities list, so I’ll go through the process of memory utilization for the LLM inference step by step and let’s see if deploying Falcon 7B on 70% of A10G is different than deploying it on 100% of the same GPU or even deploy it on H100.
First, we need to know how much memory is required to load any LLM. The general rule is:
inference_size = 2 * n_parameters # GB
# n_parameters: number of parameters in Billions
14 GB is the minimum required GPU memory to deploy Falcon 7B
Scenario: we will deploy Falcon 7B on 100% of A10G
gpu_capacity = 24 # GB
Model parameters:
precision = 2 # fp16
n_layers = 32 # hidden layers
d_model = 4544 # hidden size
n_parameters = 7 # 7 Billion
The concept of KV Cache
What makes the transformers powerful is the ability to predict the next word based on the context.
The KV cache, short for Key-Value cache, is a caching mechanism used in Transformers models, the key-value cache is primarily employed in the context of attention mechanisms.
In attention mechanisms, when processing sequential data such as text, the model attends to different parts of the input sequence to generate the output. This attention process involves computing attention scores between each position in the input sequence and each position in the output sequence. These scores are computed based on the similarity (dot product or other measures) between keys and queries.
In the context of caching, the key-value pairs represent the encoded representations of the input sequence. These representations are stored in a cache for faster retrieval during subsequent iterations. Instead of recomputing the encoded representations of the input sequence at each step, the model can look up the key-value cache to retrieve previously computed representations, saving computational resources and speeding up inference.
so, the free space in the GPU memory (10 GB) after taking 14 GB to load the model can be used for the KV caching. With a simple formula, we can calculate the cache size required based on the model information and the context size
def calc_kv_cache_size(seq_size: int, batch_size: int) -> float:
"""
Calculate the cache size in GB for the provided sequence length and batch size (number of requests)
:param seq_size: context tokens + answer tokens
:param batch_size: number of concurrent requests
"""
return 2 * precision * n_layers * d_model * seq_size * batch_size / (1024 ** 3)
seq_size is the number of tokens in the context (input) and the output generated by the model
For example: we want to serve 100 request at the same time and the sequence size is 1000 tokens
required_size = calc_kv_cache_size(seq_size=1000, batch_size=100)
# 54.168701171875 GB
as we can see, we need around 54 GB free in the GPU memory to be able to serve 100 user at the same time while the expected sequence size is 1000 tokens. Which is not available as we only have 10 GB free.
So, we need to find a way to calculate the number of users we can serve based on the free memory available and the expected sequence size.
def calc_max_concurrent_requests(seq_size: int):
"""
Calculates the max batch size based on the provided seq_length and free space
:param seq_size: context tokens + answer tokens
"""
return (free_size_in_gpu * (1024 ** 3) ) / (2 * precision * n_layers * d_model * seq_size)
With this function we can do that.
For example: we need to know the max concurrent users the model can handle in the free space of the GPU. sequence size: 1000
number_of_users = int(calc_max_concurrent_requests(seq_size=1000))
# 18
So, we can serve 18 concurrent users with our deployment configurations, if the expected sequence size is 1000 tokens.
At the end, all these calculations are empirical, hence we should expect slightly different results in the real-life. However, it gives us an initial overview on what we are dealing with.