Infrastructure

Inference

Suy luận mô hình (Inference)

The process of running a trained AI model on new inputs to generate predictions or responses — as opposed to training.

Inference is what happens when you actually use a trained AI model: you feed it an input, and it produces an output. Everything before that — collecting data, training, evaluating — is preparation. Inference is the production runtime.

**Training vs inference**

Training: adjust model weights based on data and a loss function. Requires large GPU clusters, runs for days or weeks, happens once (or periodically). Inference: use fixed weights to process a new input. Runs continuously in production, requires less compute than training, but still needs significant hardware at scale.

**Hardware for inference**

LLMs require GPUs or specialized AI accelerators (TPUs, AWS Trainium, Nvidia H100/H200) because they perform billions of floating-point multiplications per forward pass. A 70B parameter model requires multiple high-VRAM GPUs even at inference.

**Latency vs throughput trade-off**

*Latency*: Time to generate the first token. Critical for interactive chat (users feel every millisecond). *Throughput*: Tokens generated per second across many concurrent users. Critical for batch jobs.

Optimizations like KV-cache, batching, and speculative decoding improve throughput. Quantization (reducing weight precision from FP16 to INT8 or INT4) reduces memory and speeds up inference with modest quality loss.

**Cost of inference**

Running GPT-4 via the OpenAI API costs roughly $0.01–$0.03 per 1,000 tokens. Self-hosting a model on your own GPUs shifts cost from per-call variable to fixed infrastructure. The break-even depends on volume — typically self-hosting makes sense at millions of tokens per day.

**Inference providers**

OpenAI, Anthropic, Google (Vertex AI), AWS Bedrock, Groq, Together AI, and others offer managed inference. Groq and Cerebras specialize in high-speed inference via custom hardware.