OI Performance Benchmark Technical Review

To evaluate the inference capabilities of a large language model (LLM), we focus on two key metrics: latency and throughput. Latency Latency measures the time it takes for an LLM to generate a response to a user’s prompt. It is a critical indicator of a language model’s speed and significantly impacts a user’s perception of […]

Models Auto-Scaling with Kubernetes Event-Driven Autoscaling (KEDA) Experiment

This experiment demonstrates a sophisticated autoscaling approach using: When the application receives requests, it records active request counts (or active “user requests”) and pushes these metrics to the Prometheus Pushgateway. Prometheus (deployed via the kube-prometheus-stack) then scrapes these metrics and KEDA uses a Prometheus-based trigger to scale the number of pods according to the load. […]

Decoding LLM Inference Math: Your Step-by-Step Guide

Understanding the maths behind the LLM inference is a crucial knowledge that everyone working in the LLMOPS should know. The high price rates for the GPUs used in the LLM inference puts the GPU utilization optimization in the top of our priorities list, so I’ll go through the process of memory utilization for the LLM […]