Glossary

Multi-LoRA inference

Serving many LoRA adapters concurrently on a single base model, with the runtime swapping the right adapter in per request rather than loading separate fine-tuned copies.

Runtime also: Training also: Weights aka multi-lora, multi-lora serving, lora serving, multi-adapter inference

A serving pattern where a single base model in GPU memory is paired with many small LoRAtrainingA parameter-efficient fine-tuning method that injects small low-rank adapter matrices into a frozen base model, training a tiny fraction of weights instead of the full model. Open full entry adapters that the runtime swaps in per request. The base weights load once; each adapter is a small pair of low-rank matrices (often a few tens of MB each), so serving N fine-tuned variants costs roughly the same memory as one base model plus N small deltas, instead of N full model copies.

The mechanism matters for two production patterns. Tenant-specific fine-tunes: an API serves one base model plus one LoRA per customer, routing each request to the right adapter; the alternative (a separate full model per tenant) does not fit in any reasonable GPU budget once tenants pass single digits. Task-specific adapters: one base, multiple adapters specialized for code, summarization, or domain knowledge, selected per call.

vLLMruntimeAn open-source inference engine introduced by UC Berkeley in 2023, built around PagedAttention to manage KV cache memory and serve tokens efficiently under load. Open full entry , SGLangruntimeAn open inference engine from the LMSYS team featuring RadixAttention for prefix sharing and a structured-generation frontend, particularly strong on agent and tool-calling workloads. Open full entry , and TensorRT-LLM all support multi-LoRA inference. The S-LoRA paper (Sheng et al., 2023) showed a serving system handling thousands of concurrent adapters with throughput within a small factor of plain-base serving when batched well; the technique is now standard in open production runtimes. The tradeoff is small: per-token decode latency rises by single-digit percentage points because each step does a few extra small matmuls against the active adapter.

Sources

Mentioned in

layer Runtime

Back to glossary