Every Concept You Need Before Fine-Tuning an LLM

A practitioner’s reference — LoRA, QLoRA, batch size, loss curves, and output formats explained. This is the concepts companion to I Fine-Tuned Gemma 4 to Detect Code Vulnerabilities — Here’s What Happened.

Most engineering teams use LLMs through APIs — prompt in, response out. The models themselves are a black box. Fine-tuning opens that box: instead of crafting better prompts, you adjust the model’s weights directly. I recently ran my first fine-tuning experiment and spent more time understanding the concepts than writing the code. This post is the reference guide I wish existed when I started.

What is fine-tuning, and why not just prompt better?

Zero-shot prompting means giving an LLM instructions and hoping it follows them. It works surprisingly well for general tasks. But when you need a model to perform one specific task consistently — same format, same decision boundary, every time — fine-tuning has an edge.

You show the model thousands of input/output examples, and it adjusts its internal weights to reproduce that pattern. The result is a smaller, specialized model that does one thing reliably, versus a large general model that needs careful prompting and still varies.

What’s SFT?

SFT stands for Supervised Fine-Tuning. “Supervised” means you provide the right answers — input/output pairs. The model sees a code snippet (input) and the correct verdict like “VULNERABLE — CWE-120” (output), repeated thousands of times. It adjusts its weights to predict similar outputs for similar inputs.

This is different from RLHF (reinforcement learning from human feedback), where the model gets a score for how good its answer was, or unsupervised pre-training where the model just reads text with no labels. The SFTTrainer from the TRL (Transformer Reinforcement Learning) library — HuggingFace’s toolkit for fine-tuning and aligning LLMs — handles the mechanics: tokenization, masking user messages so the model only learns to predict assistant responses, and running the training loop.

When to use SFT vs alternatives: SFT is the right choice when you have labeled data (input/output pairs) and want the model to produce structured, explainable responses. If you only needed a binary score without explanations, a classification head on top of the model would be simpler — though in my experiment I chose SFT because I wanted the model to also produce reasoning and CWE classifications alongside the verdict, not just a bare label. If you wanted to refine response quality after SFT, DPO (Direct Preference Optimization) takes pairs of good/bad responses and teaches the model to prefer the better one — that’s the SFT → DPO pipeline most production models use. RLHF goes further with a full reward model and reinforcement learning, but that’s overkill unless “good” is subjective and hard to label. For most fine-tuning projects, SFT is where you start.

What are LoRA and QLoRA?

Full fine-tuning updates all of a model’s parameters. For a model like Gemma 4 E4B, that’s 8 billion numbers — you’d need 80–100GB of GPU memory. The breakdown: 16GB for weights in 16-bit. 16GB for gradients — a value per weight that tells the optimizer which direction and how steeply the loss changes with respect to that weight. The optimizer then decides how far to actually move. 64GB for Adam optimizer states (it tracks momentum and variance for every weight, both in 32-bit). Plus activations. Even with memory-efficient optimizers, you’re looking at 60GB minimum. Not practical on most hardware.

LoRA (Low-Rank Adaptation) takes a different approach. You freeze the entire base model and inject small trainable matrices into specific layers. These are called adapters. Think of it as: you have a textbook (the base model). Instead of rewriting every page, you add sticky notes to the pages that matter. The textbook stays the same; the sticky notes customize it for your task.

In my experiment, I trained 42 million parameters out of 8 billion — just 0.53% of the model. Where does that number come from? For each target layer, LoRA adds two small matrices instead of updating the full weight matrix. Say an attention layer has a weight matrix of size 3072 × 3072 (~9.4 million parameters). LoRA replaces that with two tiny matrices:

Original weight:  3072 × 3072 = 9,437,184 parameters
LoRA adapter A:   16 × 3072   = 49,152 parameters
LoRA adapter B:   3072 × 16   = 49,152 parameters
Total per module: 98,304 parameters (vs 9.4 million)

The 16 is the LoRA rank — our chosen adapter size. Multiply across 7 target modules (q, k, v, o, gate, up, down) per layer, across all 42 transformer layers, and you get ~42 million trainable parameters. Increase the rank to 32 and it doubles to ~84M. Drop to rank 8 and it halves to ~21M. The rank is your dial between “learn more” and “use less memory.”

Because the base model is frozen, gradients and optimizer states are only computed for the adapter — 42 million parameters, not 8 billion. That’s why the memory drops dramatically.

The frozen base model still sits in GPU memory though. With standard LoRA, it stays at full 16-bit precision — that’s ~8GB just for weights you’re not even changing. That’s where QLoRA comes in.

QLoRA is LoRA with exactly one change: compress the frozen base model to 4-bit when loading it into memory. The adapters, the training loop, the gradients, the optimizer — all identical to LoRA. The only difference is how much space the frozen base occupies in VRAM. In code, it’s a single flag: load_in_4bit=True. Set it to False and you’re doing standard LoRA. Set it to True and you’re doing QLoRA.

But that one flag triggers more than simple compression. Under the hood, the bitsandbytes library applies several innovations from the QLoRA paper (Dettmers et al., 2023). The key one: it uses a smart compression method called NF4 that’s specifically designed for neural network weights — instead of rounding numbers uniformly (which loses a lot), it places the 4-bit quantization levels where the weight values are most dense. This preserves 95–98% of model quality despite the 4x compression.

To be explicit: the adapters you’re training still run in full 16-bit precision — only the frozen base gets compressed. The base weights shrink from ~8GB to ~2.5GB, and the total setup fits comfortably on a single GPU. Everything else — LoRA rank, target modules, learning rate, batch size, gradient flow — stays the same.

Here’s the memory contrast:

	Full fine-tuning	LoRA	QLoRA
Base weights	16GB (8B × 16-bit)	16GB (8B × 16-bit, frozen)	2.5GB (8B × 4-bit, frozen)
Gradients	16GB (8B params)	84MB (42M adapter params)	84MB (42M adapter params)
Optimizer states	64GB (8B × 2 × 32-bit)	336MB (42M × 2 × 32-bit)	336MB (42M × 2 × 32-bit)
Total (+ activations)	~100GB	~18GB	~10GB

Notice that LoRA and QLoRA have identical adapter sizes, gradient sizes, and optimizer sizes. The only row that changes is base weights — 16GB vs 2.5GB. That’s the entire difference.

Naive 4-bit compression would lose meaningful quality. NF4 is what makes QLoRA work — it’s the reason that one flag doesn’t tank your results.

Why load the model in 4-bit? Won’t that hurt accuracy?

A model’s weights are just numbers — millions of them. Each number can be stored at different precisions:

16-bit: high precision, like 3.141592653589793. Takes more space.
4-bit: lower precision, like 3.1. Takes 4x less space.

The intuition says 4-bit should be much worse. And with naive rounding, it would be. But NF4 (the smart compression method used by QLoRA) places quantization levels where the weight values actually cluster rather than spacing them evenly. That’s why the research shows 95–98% quality retention.

The other key insight: we’re not training those compressed weights. They’re frozen. The LoRA adapters running on top are in full 16-bit precision and can actually compensate for the small precision loss in the base. So by the time training is done, the fine-tuned model often performs nearly identically to one trained from a 16-bit base.

If the base model is in 4-bit, how does training happen in 16-bit?

This is the most common confusion about QLoRA. The answer: the 4-bit is a storage format, not a compute format. The math always happens in 16-bit.

Here’s what happens in a single forward pass through one layer:

Input → [Base layer weights: stored in 4-bit, dequantized to 16-bit on the fly]
       → output_base (16-bit)

Input → [LoRA adapter weights: stored and computed in 16-bit]
       → output_adapter (16-bit)

Final output = output_base + output_adapter

The base weights sit in GPU memory compressed to 4-bit. But when the model needs to do actual matrix multiplication, bitsandbytes dequantizes them to 16-bit temporarily for that one computation, then discards the 16-bit version. The 4-bit copy stays in memory as the permanent stored format — the 16-bit version only exists for a split second during the calculation.

The LoRA adapter is a separate small matrix that runs entirely in 16-bit. Its output gets added to the base layer’s output. During backpropagation, gradients only flow through the adapter (because the base is frozen), so 16-bit precision is maintained end-to-end for everything that’s actually learning.

So it’s not “training 4-bit weights in 16-bit.” It’s:

Storing base weights in 4-bit (saves memory)
Computing with them in 16-bit (dequantize on the fly, preserves quality)
Training only the adapter, which was always 16-bit

The 4-bit is purely a storage compression. The math always happens in 16-bit. That’s why NF4 is designed the way it is — optimized for dequantizing back to 16-bit with minimal information loss.

What does gradient checkpointing do?

During training, the GPU remembers the output of every layer so it can calculate gradients during backpropagation (the “learning” pass). For a model with dozens of layers, that eats a ton of VRAM — often more than the model weights themselves.

Gradient checkpointing says: “Don’t remember everything. Throw away most intermediate outputs, and recompute them when needed during backpropagation.” You trade compute time (recalculating) for memory savings (not storing it all).

Libraries like Unsloth offer custom implementations (use_gradient_checkpointing="unsloth") that are smarter about which layers to save versus recompute, saving more memory with less speed penalty than PyTorch’s default.

The three memory tricks work together:

4-bit loading — shrinks model weights (8GB → 2.5GB)
Gradient checkpointing — shrinks stored activations
LoRA — only trains ~1% of parameters, so optimizer states are tiny

All three combined make it possible to fine-tune a multi-billion parameter model on a single GPU.

What is learning rate?

The learning rate controls how big a step the optimizer takes on each weight update. After the model processes a batch and computes gradients, the learning rate determines how far the weights actually move in the direction those gradients suggest.

Too high and the model overshoots — loss jumps around erratically instead of decreasing. Too low and the model barely moves — loss flatlines even though the model hasn’t converged. A common default for LoRA fine-tuning is 2e-4 (0.0002), which works well as a starting point. If your loss is oscillating wildly, try halving it. If your loss isn’t moving, try doubling it.

What are batch size and gradient accumulation?

Batch size = how many samples the GPU processes at once. Each sample sits in VRAM simultaneously. Bigger batch = more VRAM usage but faster training.

Gradient accumulation = how many batches to stack up before updating the weights. With grad_accum=8, the GPU processes 8 mini-batches one at a time, adds up the gradients, then makes one combined weight update.

The math: batch_size × grad_accum = effective batch size

Both of these give an effective batch of 8, but use memory differently:

batch_size=8, grad_accum=1 — fast (8 samples in parallel) but needs more VRAM
batch_size=1, grad_accum=8 — slow (1 sample at a time, 8 sequential passes) but uses minimal VRAM

The model learns the same thing either way — the weight updates are mathematically identical. You’re trading speed for memory.

What do training loss and validation loss mean?

Both measure how surprised the model is by the correct answer. The model reads an input, predicts the next token in the expected response, and the loss reflects how wrong those predictions are. Lower = better.

Training loss: measured on data the model is learning from. It will always keep going down — the model is memorizing these examples.
Validation loss: measured on data the model has never trained on. This is the reality check.

The relationship matters:

Both going down — the model is learning and generalizing. Good.
Training going down, validation going up — overfitting. The model is memorizing rather than learning patterns.
Both stuck — the model isn’t learning. Learning rate may be too low.

Always watch the validation loss to decide when to stop training. Don’t trust epoch count defaults from tutorials — your data and model will tell you the right answer.

Why does the loss oscillate step-to-step? If you look at the raw (unsmoothed) loss curve, it won’t decrease in a clean line — it zigzags. This is normal, and it correlates directly with noise in your dataset. Each training batch samples a different mix of correctly and incorrectly labeled data. A batch that happens to contain mostly clean, correctly labeled examples gives the model a consistent gradient signal — loss drops. The next batch might contain several mislabeled samples, producing contradictory gradients — loss spikes. With a dataset like DiverseVul (~60% label accuracy for the vulnerable class), these contradictions happen frequently, and the zigzag is pronounced.

Three things control how spiky the curve looks. Batch size: smaller batches sample fewer examples per step, so the label noise ratio varies more between batches — more oscillation. Learning rate: higher values amplify the effect of noisy gradients, making each spike bigger. Data quality: the noisier the labels, the more batches disagree with each other on what the model should learn. Increasing batch size smooths the curve cosmetically, but doesn’t fix the underlying problem — the model is still receiving contradictory supervision from mislabeled data.

The validation loss plateau is the real signal here. When it flatlines while training loss keeps dropping, the model has learned everything the clean labels can teach. Further training just memorizes the noise — which is why the growing gap between training and validation loss is the clearest sign to stop.

What are epochs?

One epoch = the model sees every training sample once. Multiple epochs mean the model sees the same data repeatedly — each pass reinforces what it learned and helps it pick up patterns it missed the first time.

Whether you need multiple epochs depends on the dataset. A small, clean dataset might benefit from 5–10 epochs. A large or noisy dataset — one pass is often enough.

What formats does a fine-tuned model produce?

LoRA adapter (~80–160MB) — just the trained adapter weights. The size depends on save precision: ~84MB at 16-bit, ~168MB at 32-bit. To use this, you load the base model and attach the adapter on top. You can swap adapters at runtime — train one for vulnerability detection, another for code review, another for documentation. Same base model, different skills. One 8GB base + three small adapters is much cheaper than three separate full models.

model = load("google/gemma-4-E4B-it")
model.load_adapter("my-vuln-detector-lora")

Merged model (~8GB) — base model + adapter baked together into one set of files. You need this as a clean starting point for converting to other formats. Why save in 16-bit when you loaded in 4-bit? Because the 4-bit was a temporary memory trick for training. The original model exists in 16-bit on HuggingFace — the merge retrieves those original full-precision weights and combines them with your 16-bit adapter. You’re not upscaling 4-bit back to 16-bit; you’re going back to the source and folding in what the adapter learned.

GGUF (~2.5GB quantized) — a single-file format created by the llama.cpp project, used by Ollama, LM Studio, and llama.cpp for running models locally without Python or PyTorch.

Can you keep the adapter separate or must you merge?

For Python/HuggingFace use: keep them separate. You get adapter swapping, smaller files, and flexibility. Only merge when the next step requires it — specifically GGUF conversion, which needs a complete model.

Think of it as two ecosystems:

	SafeTensors (HuggingFace)	GGUF (llama.cpp)
Swap LoRA adapters at runtime	Yes	No — baked in
Run in Ollama / LM Studio	No	Yes
Run without Python	No	Yes
Multiple skills, one base model	Yes	Need separate GGUF per skill

Practical tip: save to Google Drive

If you’re training on Google Colab, mount Drive at the start and write outputs there. Colab sessions die without warning — free tier disconnects after 30–90 minutes of inactivity, and even paid tiers have session limits. I lost a full training run before learning this.

from google.colab import drive
drive.mount('/content/drive')

This is the concepts reference for “The Security Engineer’s Practical Guide to LLMs.” Read the experiment: I Fine-Tuned Gemma 4 to Detect Code Vulnerabilities — Here’s What Happened.

Geo Joy