Glossary

QLoRA

A fine-tuning method that combines 4-bit quantization of the frozen base model with LoRA adapters, making large-model fine-tuning fit on a single consumer GPU.

Training also: Weights

An extension of LoRA that stores the frozen base model in 4-bit precision using NF4 (a normal-distribution-tuned quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry scheme) and double-quantizes the quantizationweightsStoring or computing model weights in lower-precision number formats (FP8, INT8, INT4) to reduce memory and bandwidth, accepting small quality loss. Open full entry constants themselves to save more memory. LoRAtrainingA parameter-efficient fine-tuning method that injects small low-rank adapter matrices into a frozen base model, training a tiny fraction of weights instead of the full model. Open full entry adapters in full precision sit alongside the quantized base and absorb the gradient.

The combined memory footprint is small enough that 65B parameter base models fit on a single 48GB GPUsiliconA massively parallel processor originally designed for graphics, repurposed since the 2010s as the dominant compute substrate for both training and inference of large neural networks. Open full entry and 7B models fit on a 24GB consumer card. The original paper showed QLoRA-tuned models matching full fine-tuningtrainingContinued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch. Open full entry quality on the Vicuna benchmarkevaluationA standardized dataset and scoring rubric used to compare model capability on a defined task, the unit of model evaluation since GLUE made the format the default. Open full entry .

In practice QLoRA is the default for hobbyist fine-tuningtrainingContinued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch. Open full entry . UnslothtrainingAn open fine-tuning library that uses hand-written Triton kernels and a manual gradient implementation to run LoRA and QLoRA fine-tuning roughly 2x faster than the Hugging Face baseline. Open full entry , AxolotltrainingAn open YAML-driven fine-tuning framework that orchestrates Hugging Face Transformers, PEFT, TRL, and DeepSpeed for one-shot LoRA, QLoRA, and full fine-tuning workflows. Open full entry , and TRLtrainingHugging Face's library for preference and reinforcement learning on transformer models, the canonical open implementation of RLHF, DPO, KTO, ORPO, and related preference-tuning methods. Open full entry all expose it as a configuration flag. The trade-off versus LoRAtrainingA parameter-efficient fine-tuning method that injects small low-rank adapter matrices into a frozen base model, training a tiny fraction of weights instead of the full model. Open full entry is slower training (the quantize-dequantize step adds overhead) in exchange for fitting bigger bases on smaller hardware.

Sources

Mentioned in

Back to glossary