OI Performance Benchmark Technical Review

To evaluate the inference capabilities of a large language model (LLM), we focus on two key metrics: latency and throughput. Latency Latency measures the time it takes for an LLM to generate a response to a user’s prompt. It is a critical indicator of a language model’s speed and significantly impacts a user’s perception of […]

Decoding LLM Inference Math: Your Step-by-Step Guide

Understanding the maths behind the LLM inference is a crucial knowledge that everyone working in the LLMOPS should know. The high price rates for the GPUs used in the LLM inference puts the GPU utilization optimization in the top of our priorities list, so I’ll go through the process of memory utilization for the LLM […]