Geo Joy

I Fine-Tuned Gemma 4 to Detect Code Vulnerabilities

2026-05-02T00:00:00+00:00

One GPU, one epoch, three evaluation surprises, and recall that jumped from 4% to 51%. If you want the concepts behind the decisions (LoRA, QLoRA, NF4, batch size, loss curves), read the companion reference: Every Concept You Need Before Fine-Tuning an LLM.

I work with LLMs daily through APIs and orchestration pipelines. But there’s a difference between using models and understanding what happens inside them. I wanted to get hands-on with the training process itself — so I picked a domain I know well (code security), grabbed a public dataset, and fine-tuned Google’s Gemma 4 E4B on a Colab A100 over a weekend. Code vulnerability detection is the vehicle here, not the destination — every technique applies to any domain. That said, there’s a practical angle: a fine-tuned local model can analyze code without sending it to a cloud API. For teams working on proprietary codebases, air-gapped environments, or regulated industries where code cannot leave the network, a local model — even a modest one — fills a niche that commercial cloud scanners can’t.

The setup

Model: Google’s Gemma 4 E4B — a dense model with 8 billion total parameters and ~4.5 billion effective parameters during inference. The “E” stands for “Effective” — the model uses Per-Layer Embeddings (PLE), where large embedding lookup tables add to the total parameter count but aren’t used in the forward computation, so the effective compute footprint is much smaller than the total (source: Google model card). Instruction-tuned and multimodal (text, vision, audio).

Dataset: DiverseVul — ~330,000 C/C++ functions labeled as vulnerable or safe, spanning 150 CWE categories.

Tool: Unsloth — handles QLoRA loading, optimized training, and GGUF export.

Hardware: Google Colab with an A100 GPU (40GB VRAM). I initially tried a free-tier T4 (16GB) but hit out-of-memory errors during training even with QLoRA and batch size of 1. The A100’s 40GB gives comfortable headroom for QLoRA fine-tuning and supports bf16 precision (more numerically stable than the T4’s fp16).

The dataset

DiverseVul is extracted from vulnerability-fixing commits on GitHub — projects like the Linux kernel, OpenSSL, FFmpeg, and ImageMagick. Each function is labeled vulnerable (1) or safe (0). Note: the dataset is C/C++ only — a different profile from the JavaScript/Python/TypeScript vibe-coded apps mentioned above, but the fine-tuning process is identical regardless of language.

Two properties matter:

It’s heavily imbalanced. ~95% safe, ~5% vulnerable. Training on this raw teaches the model to always say “SAFE” and achieve 95% accuracy while catching nothing. Fix: balanced sampling — I took 3,000 vulnerable and 3,000 safe functions for training, 500 + 500 for validation.

raw = load_dataset("bstee615/diversevul")
vuln = [r for r in raw["train"] if r["target"] == 1 and 30 < len(r["func"]) < 3200]
safe = [r for r in raw["train"] if r["target"] == 0 and 30 < len(r["func"]) < 3200]
train_balanced = random.sample(vuln, 3000) + random.sample(safe, 3000)

The labels are noisy. The DiverseVul authors themselves report 60% label accuracy for vulnerable functions, measured by manually verifying a random sample of 50 (Table 8, DiverseVul paper, RAID 2023). The main sources of error: vulnerabilities spread across multiple functions, and non-vulnerable functions changed in the same commit as the fix. This puts a hard ceiling on achievable performance. For a learning experiment, this is acceptable. For production, you’d invest heavily in label quality first.

Each sample is formatted as a Gemma 4 chat conversation for SFT:

text = (
    f"system\n{SYSTEM}\n"
    f"user\n{user_msg}\n"
    f"model\n{reply}\n"
)

Training

CONFIG = dict(
    model       = "google/gemma-4-E4B-it",
    max_seq_len = 512,
    lora_rank   = 16,
    epochs      = 1,
    batch_size  = 8,
    grad_accum  = 1,           # effective batch = 8
    lr          = 2e-4,
    samples_per_class = 3000,  # 3k vuln + 3k safe = 6k total
)

LoRA adapters targeted all attention and MLP layers (q/k/v/o projections, gate/up/down projections). After loading:

GPU: NVIDIA A100-SXM4-40GB
VRAM after model load: ~3.2 / 40.0 GB
Trainable: 42,401,792 / 8,038,558,240 (0.53%)

Training completed in approximately 1 hour 45 minutes on the A100 for one epoch.

Fine-tuning loss curve. Training loss (blue) drops sharply from ~9.5 to ~1.3. Validation loss (orange) plateaus at ~2.3.

Training loss dropped sharply from ~9.5 to ~1.3 in the first 100 steps. (A starting loss of ~9.5 is higher than typical text models — this is normal for Gemma 4’s multimodal architecture with its large vocabulary. The model hasn’t seen our task format before, so early predictions are essentially random across the full token space.) It continued declining gradually after that.

Validation loss dropped to ~2.3 and plateaued completely. Additional training steps reduced training loss but didn’t improve generalization. I had originally configured 3 epochs, but the validation curve made the decision clear: stop at 1 epoch. The model absorbed the clean, obvious patterns quickly. Further training was fitting the noisy labels, not learning new patterns.

Unsloth training progress — step-by-step loss showing the plateau during epoch 1.

Three output formats saved to Google Drive — LoRA adapter, merged SafeTensors, and GGUF.

The fine-tuned model is saved in three formats…

Evaluation: three iterations to honest numbers

Evaluating this model correctly turned out to be harder than training it.

The accuracy trap

First run on 200 random test samples: 94.5% accuracy. Impressive — until you check the distribution. 195 safe, 5 vulnerable. The raw test set mirrors the original dataset’s 95/5 imbalance. The model said “SAFE” almost every time and scored well by default.

Lesson: always evaluate on a balanced test set. Accuracy on imbalanced data is meaningless.

The prompt echo

Balanced evaluation (100 vulnerable + 100 safe): 52.5% accuracy, 7% recall. Something was clearly wrong. I looked at the actual model outputs:

CWE: CWE-416
Model said: SAFE and a brief reason.

CWE: CWE-20, CWE-787
Model said: SAFE and a brief reason.

The model wasn’t analyzing code — it was echoing the prompt. The training data used the phrase “Reply with VULNERABLE or SAFE and a brief reason.” At inference time, the model encountered this substring and completed the most probable next tokens — which were the rest of the training template. This is a generation artifact: the model had learned the task, but the decoding followed a memorized path instead of producing new analysis.

The fix was simple: change the prompt wording at inference so it couldn’t trigger the memorized completion. Same model, same weights, different question:

# Triggered memorized template completion
"Reply with VULNERABLE or SAFE and a brief reason."

# Fixed — new wording, model produces actual analysis
"Is it VULNERABLE or SAFE? Explain your reasoning."

The model immediately started producing real analysis:

CWE: unknown
Model: This function is VULNERABLE. The function uses fork() to
execute a command in a child process...

CWE: CWE-190
Model: VULNERABLE. The function TIFFReadRawStrip1 is vulnerable
to a buffer overflow when reading raw data from a TIFF file...

Lesson: fine-tuning teaches a conversational pattern, not just a task. The inference prompt must align with — but not exactly match — the training format. If the prompt contains a substring from training targets, the model may complete the template rather than reason about the input.

The real numbers

Balanced evaluation, 200 samples (100 vulnerable + 100 safe), corrected prompt, with random.seed(42) for reproducibility. Both the fine-tuned and zero-shot models were evaluated with the identical prompt and the same 200 samples for a fair comparison:

	Fine-tuned	Zero-shot (no training)	Delta
Accuracy	61.0%	45.5%	+15.5%
Precision	63.7%	23.5%	+40.2%
Recall	51.0%	4.0%	+47.0%
F1	0.567	0.068	+0.499

The base Gemma 4 E4B caught 4 out of 100 vulnerabilities zero-shot — essentially guessing. The fine-tuned version caught 51, bringing recall from near-zero to about half. Not perfect, but a clear signal that the fine-tuning worked, especially given the noisy labels in the training data.

What did fine-tuning actually change?

Here’s what’s counterintuitive: we didn’t teach Gemma 4 about vulnerabilities. It already knew. The model was pre-trained on code, security advisories, CWE descriptions, and countless discussions about buffer overflows and injection attacks. The zero-shot baseline proved this — it sometimes gave detailed, correct explanations of why code was dangerous.

But it only caught 4 out of 100 vulnerabilities in our eval. Why?

Because our eval looked for the word “VULNERABLE” in the response. The base model would say things like “this code has potential security implications that warrant further review” — technically correct analysis, but our parser reads that as SAFE because it doesn’t contain the keyword. A smarter parser that also caught phrases like “security flaw” or “dangerous” would have narrowed the gap — but the inconsistency and lack of structured verdicts would remain. The model knew the answer but expressed it in a way our system couldn’t reliably use.

Fine-tuning was essentially response format alignment — teaching the model to package what it already knew into the structured output we needed:

Lead with a verdict — always say VULNERABLE or SAFE first, not a hedged paragraph
Be consistent — same format every time, not sometimes three paragraphs and sometimes one word
Commit to a decision — no “this could potentially be problematic” — yes or no

Think of it as a senior security consultant who knows everything about vulnerabilities but has never used your team’s reporting template. They can write a brilliant analysis, but they can’t fill in the “Severity: HIGH/MEDIUM/LOW” field consistently. Fine-tuning taught the consultant to use the template.

This is an important insight for anyone considering fine-tuning: if the base model already understands your domain, you may not need thousands of examples to teach it new knowledge. You need enough examples to teach it your expected response structure. In our case, one epoch was sufficient — the model learned the format fast, because the underlying knowledge was already there.

What it catches and what it misses

Running the fine-tuned model against 200 vulnerable test samples grouped by CWE reveals a clear pattern. A caveat: sample sizes per CWE are small (some have only 4 samples), so these recall numbers are indicative of trends, not statistically robust benchmarks.

Strong performers (>60% recall):

CWE	Description	Caught	Total	Recall
CWE-310	Cryptographic issues	3	4	75.0%
CWE-20	Input validation	12	17	70.6%
CWE-200	Information exposure	4	6	66.7%
CWE-787	Out-of-bounds write	16	25	64.0%

Weak spots (<35% recall):

CWE	Description	Caught	Total	Recall
CWE-415	Double free	0	4	0.0%
CWE-401	Memory leak	1	4	25.0%
CWE-399	Resource management	1	4	25.0%
CWE-416	Use after free	4	12	33.3%

The model catches vulnerabilities with obvious, localized code signatures — unchecked inputs, buffer writes without bounds checking, weak crypto usage. These are patterns where a single line or function call is the red flag.

Where it struggles is with state-tracking bugs — double frees, use-after-free, memory leaks. These vulnerabilities require understanding execution flow across multiple lines: memory was allocated here, freed there, and then accessed again somewhere else. A model looking at a single function in isolation has limited ability to track that kind of stateful reasoning.

Fine-tuning taught the model to recognize vulnerability signatures, not to perform deep program analysis. True flow-sensitive analysis would likely require either a much larger model, a multi-file context approach, or combining the LLM with static analysis tools — for example, using Semgrep or CodeQL to identify candidate functions, then the LLM to classify and explain. That hybrid approach is worth exploring in a future post.

Key takeaways

Watch the validation loss, not the training loss. Training loss always keeps dropping — that’s memorization. Validation loss tells you when to stop. Mine plateaued halfway through epoch 1.

Evaluation is harder than training. My reported accuracy changed from 94.5% to 52.5% to 61% across three iterations. Each time, the problem was measurement, not the model.

Prompt alignment matters more than you’d expect. The model learned fine — but the inference prompt triggered a memorized template completion instead of actual analysis. Changing the prompt wording fixed it instantly, with no retraining.

Data quality is the ceiling. With ~60% label accuracy (DiverseVul, RAID 2023), no training configuration will produce great results. For production, invest in labels first. For learning, noisy data teaches you the process just as well.

Practical note: if you’re training on Google Colab, save to Google Drive early and often. I lost a full training run when the session disconnected. Mount Drive at the start and set your output directory there.

The outputs

The fine-tuned model is saved in three formats: a LoRA adapter (~160MB), a merged 16-bit SafeTensors model (~8GB), and a GGUF Q4_K_M file (~2.5GB). The evaluation in this post was done on the SafeTensors LoRA checkpoint. The GGUF version hasn’t been evaluated yet — that’s the focus of the next post.

What’s next

In the next post, I’ll take the GGUF file and benchmark different quantization levels — Q4 vs Q5 vs Q8 — measuring what you lose when you shrink a model from 8GB to 2.5GB. Does Q4 still catch the buffer overflows that Q8 catches? Where exactly is the quality cliff?

The code for the full experiment: https://github.com/Geo-Joy/llm-vuln-detector

This is Part 1 of “The Security Engineer’s Practical Guide to LLMs.” Concepts reference: Every Concept You Need Before Fine-Tuning an LLM. Next: What you lose when you shrink a model 4x.

Every Concept You Need Before Fine-Tuning an LLM

2026-05-01T00:00:00+00:00

A practitioner’s reference — LoRA, QLoRA, batch size, loss curves, and output formats explained. This is the concepts companion to I Fine-Tuned Gemma 4 to Detect Code Vulnerabilities — Here’s What Happened.

Most engineering teams use LLMs through APIs — prompt in, response out. The models themselves are a black box. Fine-tuning opens that box: instead of crafting better prompts, you adjust the model’s weights directly. I recently ran my first fine-tuning experiment and spent more time understanding the concepts than writing the code. This post is the reference guide I wish existed when I started.

What is fine-tuning, and why not just prompt better?

Zero-shot prompting means giving an LLM instructions and hoping it follows them. It works surprisingly well for general tasks. But when you need a model to perform one specific task consistently — same format, same decision boundary, every time — fine-tuning has an edge.

You show the model thousands of input/output examples, and it adjusts its internal weights to reproduce that pattern. The result is a smaller, specialized model that does one thing reliably, versus a large general model that needs careful prompting and still varies.

What’s SFT?

SFT stands for Supervised Fine-Tuning. “Supervised” means you provide the right answers — input/output pairs. The model sees a code snippet (input) and the correct verdict like “VULNERABLE — CWE-120” (output), repeated thousands of times. It adjusts its weights to predict similar outputs for similar inputs.

This is different from RLHF (reinforcement learning from human feedback), where the model gets a score for how good its answer was, or unsupervised pre-training where the model just reads text with no labels. The SFTTrainer from the TRL (Transformer Reinforcement Learning) library — HuggingFace’s toolkit for fine-tuning and aligning LLMs — handles the mechanics: tokenization, masking user messages so the model only learns to predict assistant responses, and running the training loop.

When to use SFT vs alternatives: SFT is the right choice when you have labeled data (input/output pairs) and want the model to produce structured, explainable responses. If you only needed a binary score without explanations, a classification head on top of the model would be simpler — though in my experiment I chose SFT because I wanted the model to also produce reasoning and CWE classifications alongside the verdict, not just a bare label. If you wanted to refine response quality after SFT, DPO (Direct Preference Optimization) takes pairs of good/bad responses and teaches the model to prefer the better one — that’s the SFT → DPO pipeline most production models use. RLHF goes further with a full reward model and reinforcement learning, but that’s overkill unless “good” is subjective and hard to label. For most fine-tuning projects, SFT is where you start.

What are LoRA and QLoRA?

Full fine-tuning updates all of a model’s parameters. For a model like Gemma 4 E4B, that’s 8 billion numbers — you’d need 80–100GB of GPU memory. The breakdown: 16GB for weights in 16-bit. 16GB for gradients — a value per weight that tells the optimizer which direction and how steeply the loss changes with respect to that weight. The optimizer then decides how far to actually move. 64GB for Adam optimizer states (it tracks momentum and variance for every weight, both in 32-bit). Plus activations. Even with memory-efficient optimizers, you’re looking at 60GB minimum. Not practical on most hardware.

LoRA (Low-Rank Adaptation) takes a different approach. You freeze the entire base model and inject small trainable matrices into specific layers. These are called adapters. Think of it as: you have a textbook (the base model). Instead of rewriting every page, you add sticky notes to the pages that matter. The textbook stays the same; the sticky notes customize it for your task.

In my experiment, I trained 42 million parameters out of 8 billion — just 0.53% of the model. Where does that number come from? For each target layer, LoRA adds two small matrices instead of updating the full weight matrix. Say an attention layer has a weight matrix of size 3072 × 3072 (~9.4 million parameters). LoRA replaces that with two tiny matrices:

Original weight:  3072 × 3072 = 9,437,184 parameters
LoRA adapter A:   16 × 3072   = 49,152 parameters
LoRA adapter B:   3072 × 16   = 49,152 parameters
Total per module: 98,304 parameters (vs 9.4 million)

The 16 is the LoRA rank — our chosen adapter size. Multiply across 7 target modules (q, k, v, o, gate, up, down) per layer, across all 42 transformer layers, and you get ~42 million trainable parameters. Increase the rank to 32 and it doubles to ~84M. Drop to rank 8 and it halves to ~21M. The rank is your dial between “learn more” and “use less memory.”

Because the base model is frozen, gradients and optimizer states are only computed for the adapter — 42 million parameters, not 8 billion. That’s why the memory drops dramatically.

The frozen base model still sits in GPU memory though. With standard LoRA, it stays at full 16-bit precision — that’s ~8GB just for weights you’re not even changing. That’s where QLoRA comes in.

QLoRA is LoRA with exactly one change: compress the frozen base model to 4-bit when loading it into memory. The adapters, the training loop, the gradients, the optimizer — all identical to LoRA. The only difference is how much space the frozen base occupies in VRAM. In code, it’s a single flag: load_in_4bit=True. Set it to False and you’re doing standard LoRA. Set it to True and you’re doing QLoRA.

But that one flag triggers more than simple compression. Under the hood, the bitsandbytes library applies several innovations from the QLoRA paper (Dettmers et al., 2023). The key one: it uses a smart compression method called NF4 that’s specifically designed for neural network weights — instead of rounding numbers uniformly (which loses a lot), it places the 4-bit quantization levels where the weight values are most dense. This preserves 95–98% of model quality despite the 4x compression.

To be explicit: the adapters you’re training still run in full 16-bit precision — only the frozen base gets compressed. The base weights shrink from ~8GB to ~2.5GB, and the total setup fits comfortably on a single GPU. Everything else — LoRA rank, target modules, learning rate, batch size, gradient flow — stays the same.

Here’s the memory contrast:

	Full fine-tuning	LoRA	QLoRA
Base weights	16GB (8B × 16-bit)	16GB (8B × 16-bit, frozen)	2.5GB (8B × 4-bit, frozen)
Gradients	16GB (8B params)	84MB (42M adapter params)	84MB (42M adapter params)
Optimizer states	64GB (8B × 2 × 32-bit)	336MB (42M × 2 × 32-bit)	336MB (42M × 2 × 32-bit)
Total (+ activations)	~100GB	~18GB	~10GB

Notice that LoRA and QLoRA have identical adapter sizes, gradient sizes, and optimizer sizes. The only row that changes is base weights — 16GB vs 2.5GB. That’s the entire difference.

Naive 4-bit compression would lose meaningful quality. NF4 is what makes QLoRA work — it’s the reason that one flag doesn’t tank your results.

Why load the model in 4-bit? Won’t that hurt accuracy?

A model’s weights are just numbers — millions of them. Each number can be stored at different precisions:

16-bit: high precision, like 3.141592653589793. Takes more space.
4-bit: lower precision, like 3.1. Takes 4x less space.

The intuition says 4-bit should be much worse. And with naive rounding, it would be. But NF4 (the smart compression method used by QLoRA) places quantization levels where the weight values actually cluster rather than spacing them evenly. That’s why the research shows 95–98% quality retention.

The other key insight: we’re not training those compressed weights. They’re frozen. The LoRA adapters running on top are in full 16-bit precision and can actually compensate for the small precision loss in the base. So by the time training is done, the fine-tuned model often performs nearly identically to one trained from a 16-bit base.

If the base model is in 4-bit, how does training happen in 16-bit?

This is the most common confusion about QLoRA. The answer: the 4-bit is a storage format, not a compute format. The math always happens in 16-bit.

Here’s what happens in a single forward pass through one layer:

Input → [Base layer weights: stored in 4-bit, dequantized to 16-bit on the fly]
       → output_base (16-bit)

Input → [LoRA adapter weights: stored and computed in 16-bit]
       → output_adapter (16-bit)

Final output = output_base + output_adapter

The base weights sit in GPU memory compressed to 4-bit. But when the model needs to do actual matrix multiplication, bitsandbytes dequantizes them to 16-bit temporarily for that one computation, then discards the 16-bit version. The 4-bit copy stays in memory as the permanent stored format — the 16-bit version only exists for a split second during the calculation.

The LoRA adapter is a separate small matrix that runs entirely in 16-bit. Its output gets added to the base layer’s output. During backpropagation, gradients only flow through the adapter (because the base is frozen), so 16-bit precision is maintained end-to-end for everything that’s actually learning.

So it’s not “training 4-bit weights in 16-bit.” It’s:

Storing base weights in 4-bit (saves memory)
Computing with them in 16-bit (dequantize on the fly, preserves quality)
Training only the adapter, which was always 16-bit

The 4-bit is purely a storage compression. The math always happens in 16-bit. That’s why NF4 is designed the way it is — optimized for dequantizing back to 16-bit with minimal information loss.

What does gradient checkpointing do?

During training, the GPU remembers the output of every layer so it can calculate gradients during backpropagation (the “learning” pass). For a model with dozens of layers, that eats a ton of VRAM — often more than the model weights themselves.

Gradient checkpointing says: “Don’t remember everything. Throw away most intermediate outputs, and recompute them when needed during backpropagation.” You trade compute time (recalculating) for memory savings (not storing it all).

Libraries like Unsloth offer custom implementations (use_gradient_checkpointing="unsloth") that are smarter about which layers to save versus recompute, saving more memory with less speed penalty than PyTorch’s default.

The three memory tricks work together:

4-bit loading — shrinks model weights (8GB → 2.5GB)
Gradient checkpointing — shrinks stored activations
LoRA — only trains ~1% of parameters, so optimizer states are tiny

All three combined make it possible to fine-tune a multi-billion parameter model on a single GPU.

What is learning rate?

The learning rate controls how big a step the optimizer takes on each weight update. After the model processes a batch and computes gradients, the learning rate determines how far the weights actually move in the direction those gradients suggest.

Too high and the model overshoots — loss jumps around erratically instead of decreasing. Too low and the model barely moves — loss flatlines even though the model hasn’t converged. A common default for LoRA fine-tuning is 2e-4 (0.0002), which works well as a starting point. If your loss is oscillating wildly, try halving it. If your loss isn’t moving, try doubling it.

What are batch size and gradient accumulation?

Batch size = how many samples the GPU processes at once. Each sample sits in VRAM simultaneously. Bigger batch = more VRAM usage but faster training.

Gradient accumulation = how many batches to stack up before updating the weights. With grad_accum=8, the GPU processes 8 mini-batches one at a time, adds up the gradients, then makes one combined weight update.

The math: batch_size × grad_accum = effective batch size

Both of these give an effective batch of 8, but use memory differently:

batch_size=8, grad_accum=1 — fast (8 samples in parallel) but needs more VRAM
batch_size=1, grad_accum=8 — slow (1 sample at a time, 8 sequential passes) but uses minimal VRAM

The model learns the same thing either way — the weight updates are mathematically identical. You’re trading speed for memory.

What do training loss and validation loss mean?

Both measure how surprised the model is by the correct answer. The model reads an input, predicts the next token in the expected response, and the loss reflects how wrong those predictions are. Lower = better.

Training loss: measured on data the model is learning from. It will always keep going down — the model is memorizing these examples.
Validation loss: measured on data the model has never trained on. This is the reality check.

The relationship matters:

Both going down — the model is learning and generalizing. Good.
Training going down, validation going up — overfitting. The model is memorizing rather than learning patterns.
Both stuck — the model isn’t learning. Learning rate may be too low.

Always watch the validation loss to decide when to stop training. Don’t trust epoch count defaults from tutorials — your data and model will tell you the right answer.

Why does the loss oscillate step-to-step? If you look at the raw (unsmoothed) loss curve, it won’t decrease in a clean line — it zigzags. This is normal, and it correlates directly with noise in your dataset. Each training batch samples a different mix of correctly and incorrectly labeled data. A batch that happens to contain mostly clean, correctly labeled examples gives the model a consistent gradient signal — loss drops. The next batch might contain several mislabeled samples, producing contradictory gradients — loss spikes. With a dataset like DiverseVul (~60% label accuracy for the vulnerable class), these contradictions happen frequently, and the zigzag is pronounced.

Three things control how spiky the curve looks. Batch size: smaller batches sample fewer examples per step, so the label noise ratio varies more between batches — more oscillation. Learning rate: higher values amplify the effect of noisy gradients, making each spike bigger. Data quality: the noisier the labels, the more batches disagree with each other on what the model should learn. Increasing batch size smooths the curve cosmetically, but doesn’t fix the underlying problem — the model is still receiving contradictory supervision from mislabeled data.

The validation loss plateau is the real signal here. When it flatlines while training loss keeps dropping, the model has learned everything the clean labels can teach. Further training just memorizes the noise — which is why the growing gap between training and validation loss is the clearest sign to stop.

What are epochs?

One epoch = the model sees every training sample once. Multiple epochs mean the model sees the same data repeatedly — each pass reinforces what it learned and helps it pick up patterns it missed the first time.

Whether you need multiple epochs depends on the dataset. A small, clean dataset might benefit from 5–10 epochs. A large or noisy dataset — one pass is often enough.

What formats does a fine-tuned model produce?

LoRA adapter (~80–160MB) — just the trained adapter weights. The size depends on save precision: ~84MB at 16-bit, ~168MB at 32-bit. To use this, you load the base model and attach the adapter on top. You can swap adapters at runtime — train one for vulnerability detection, another for code review, another for documentation. Same base model, different skills. One 8GB base + three small adapters is much cheaper than three separate full models.

model = load("google/gemma-4-E4B-it")
model.load_adapter("my-vuln-detector-lora")

Merged model (~8GB) — base model + adapter baked together into one set of files. You need this as a clean starting point for converting to other formats. Why save in 16-bit when you loaded in 4-bit? Because the 4-bit was a temporary memory trick for training. The original model exists in 16-bit on HuggingFace — the merge retrieves those original full-precision weights and combines them with your 16-bit adapter. You’re not upscaling 4-bit back to 16-bit; you’re going back to the source and folding in what the adapter learned.

GGUF (~2.5GB quantized) — a single-file format created by the llama.cpp project, used by Ollama, LM Studio, and llama.cpp for running models locally without Python or PyTorch.

Can you keep the adapter separate or must you merge?

For Python/HuggingFace use: keep them separate. You get adapter swapping, smaller files, and flexibility. Only merge when the next step requires it — specifically GGUF conversion, which needs a complete model.

Think of it as two ecosystems:

	SafeTensors (HuggingFace)	GGUF (llama.cpp)
Swap LoRA adapters at runtime	Yes	No — baked in
Run in Ollama / LM Studio	No	Yes
Run without Python	No	Yes
Multiple skills, one base model	Yes	Need separate GGUF per skill

Practical tip: save to Google Drive

If you’re training on Google Colab, mount Drive at the start and write outputs there. Colab sessions die without warning — free tier disconnects after 30–90 minutes of inactivity, and even paid tiers have session limits. I lost a full training run before learning this.

from google.colab import drive
drive.mount('/content/drive')

This is the concepts reference for “The Security Engineer’s Practical Guide to LLMs.” Read the experiment: I Fine-Tuned Gemma 4 to Detect Code Vulnerabilities — Here’s What Happened.

Your Brain on Scams: What the Experiment Actually Found

2026-04-13T00:00:00+00:00

Part 2 of 2 — The results. Part 1 covered the theory and experiment design.

TL;DR: I ran scam messages through TRIBE v2 — Meta’s brain-encoding model — via two paths: raw text (language encoder) and rendered screenshot (visual encoder). The language encoder predicts stronger prefrontal activation for scam text vs legitimate text, consistent across all four scam types and both English and Japanese. The visual encoder predicts lower visual cortex activation for scam screenshots than for the legitimate baseline — the scam UI doesn’t stand out visually. And the visual encoder’s brain maps are near-identical across English and Japanese (r = 0.98–0.99), while the language encoder’s maps vary more by language (r = 0.59–0.91). These are computational predictions, not real brain measurements — but the patterns are consistent enough to be worth taking seriously.

When Meta released TRIBE v2, I kept thinking about what it could mean for scam detection. This is me finally running that experiment — a personal research project, not a peer-reviewed study. Treat the findings as hypotheses worth questioning, not conclusions worth citing. If something here raises a doubt or suggests a better experiment, the comments are open.

Last time I set up an experiment using TRIBE v2 — Meta’s brain-encoding model — to predict what the human cortex might activate when processing a scam message versus a legitimate one. To be precise: TRIBE v2 doesn’t measure brains. It predicts group-average fMRI activation patterns based on a model trained on 451 hours of real fMRI data. Think of it as a computational proxy — useful for hypothesis generation at scale, not a substitute for putting people in a scanner.

Two input paths: feed the raw text through the language stack (LLaMA 3.2-3B + Wav2Vec-BERT), or feed a rendered screenshot of the same message through the visual encoder (V-JEPA2 ViT-Giant). Important distinction: these are different encoders seeing fundamentally different representations of the same content. Path A processes words. Path B processes pixels — it never reads the text inside the image. Two different questions about the same stimulus. I promised results. Here they are.

The Setup (Fast Version)

Five message types: a legitimate shipping notification as baseline, plus four scams — phishing (“Your Amazon account has been compromised”), investment (“500% returns guaranteed”), fake shop (“90% OFF Ray-Ban sunglasses”), and pyramid scheme (“Earn $5,000/month passive income”). Each rendered as both a plain text file and a realistic UI screenshot (WhatsApp chat bubble for SMS-style scams, social post frame for the others).

The 10 rendered stimuli: 5 message types × 2 languages. Each processed as both raw text and screenshot.

Each pair ran through TRIBE v2’s dual-path inference on Colab Pro (A100 40GB). The model outputs a predicted fMRI activation surface on the fsaverage5 cortical mesh (a standard 3D brain surface model used across neuroscience research) — roughly 20,000 cortical vertices plus ~8,800 subcortical voxels (deep brain structures). For region-of-interest analysis I attempted seven pre-defined regions: dlPFC, ACC, insula, visual cortex, TPJ, amygdala, and nucleus accumbens. The first five are cortical and extracted cleanly via the Destrieux surface atlas (a standard brain region map that parcellates the cortex into named areas). Amygdala and nucleus accumbens are subcortical — their values came out near-zero across all conditions, which is either a genuine finding or a TRIBE v2 coverage limitation (the model was trained primarily on cortical fMRI). More on that in caveats. Then ran the whole corpus again in Japanese to test cross-lingual generalization.

What the Text Path Showed

The cleanest finding from the text path: dlPFC lights up for every scam type, without exception.

A note on terminology: Part 1 referred to “prefrontal cortex” and “ventromedial prefrontal cortex (vmPFC)” when predicting fake shop activation. The actual ROI extracted here is the dorsolateral prefrontal cortex (dlPFC) — a different subdivision. dlPFC handles working memory, goal maintenance, and conflict resolution. vmPFC handles value computation and reward evaluation. They’re neighbours, not synonyms. The experiment measured dlPFC; vmPFC was not separately extracted. That distinction matters for interpreting what “prefrontal activation” means in this context.

Message Type	dlPFC	ACC	Insula	Visual Cortex	TPJ
Phishing (text)	+0.053	+0.047	+0.012	+0.007	+0.028
Investment (text)	+0.061	−0.007	−0.005	−0.106	+0.047
Fake Shop (text)	+0.074	+0.023	+0.023	−0.036	+0.039
Pyramid Scheme (text)	+0.024	−0.013	−0.006	−0.096	+0.000

The dorsolateral prefrontal cortex is your rational evaluation engine — working memory, goal maintenance, conflict resolution. The model predicts it fires harder when reading scam text than any other ROI. Fake shop gets the highest dlPFC response at 0.074, followed by investment at 0.061, then phishing at 0.053. These aren’t random noise — they’re consistent with the hypothesis that high-manipulation text forces cognitive engagement.

The ACC (anterior cingulate cortex — conflict monitoring, urgency) co-activates with dlPFC for phishing (+0.047) and fake shop (+0.023), but goes slightly negative for investment and pyramid scheme. That’s interesting: the urgency framing in phishing and flash sale language triggers both conflict monitoring and rational evaluation simultaneously, which is exactly what makes them effective. Your brain notices the conflict and tries to reason through it — that’s the manipulation working as intended.

TPJ (temporo-parietal junction — theory of mind, social cognition) activates specifically for investment (+0.047) and fake shop (+0.039). The pyramid scheme TPJ is flat at 0.000. I expected pyramid to show the strongest TPJ signal given its explicit social network framing, but the model disagrees — or rather, predicts that the brain doesn’t engage social cognition for it. Make of that what you will.

Figure 9: Predicted brain activation — text path, EN corpus. All four scam types. Warmer colours = higher predicted activation.

Figure 10: Mean activation per brain region — text path (blue) vs screenshot path (orange), EN corpus.

What the Screenshot Path Showed (The Surprise)

I expected the screenshot path to add to the text path signal — stack visual trust cues on top of semantic manipulation. That’s not what happened.

Message Type	dlPFC	ACC	Insula	Visual Cortex	TPJ
Phishing (screenshot)	−0.038	−0.050	−0.016	−0.136	−0.045
Investment (screenshot)	−0.033	−0.049	−0.014	−0.143	−0.046
Fake Shop (screenshot)	−0.035	−0.049	−0.015	−0.148	−0.049
Pyramid Scheme (screenshot)	−0.006	+0.005	+0.045	−0.093	+0.030

The screenshot path suppresses activation for three of the four scam types. dlPFC goes negative (−0.006 to −0.038). ACC goes negative for phishing, investment, and fake shop (−0.049 to −0.050) but flips slightly positive for pyramid (+0.005). And the visual cortex — the region you’d most expect to fire when processing a visual — gets hit the hardest across all conditions: −0.093 to −0.148.

That’s the counterintuitive result: showing the brain a WhatsApp screenshot reduces visual cortex activation relative to baseline.

Figure 11: Predicted brain activation — screenshot path, EN corpus. Note the broad suppression (cooler colours) vs Figure 9.

Figure 12: Same phishing message — text path (left) vs screenshot path (right). Note dlPFC activation on left, broad suppression on right.

The one exception is the pyramid scheme insula response: +0.045, the only positive insula value in the screenshot path, and the largest insula value in the entire EN dataset. The insula encodes visceral risk signals — disgust, gut-level wrongness. Something about the visual presentation of the pyramid pitch specifically triggers that signal. The other scam types don’t. Whether that’s the particular visual structure I used for the rendering or something genuinely specific to multi-level recruitment imagery, I can’t say from n=1. But it’s the sharpest single anomaly in the data.

This directly contradicts what I expected in Part 1 — that pyramid scheme messages would show the most ambiguous signature, closest to legitimate. For the text path, that holds: pyramid scheme does show the lowest dlPFC response (+0.024, vs +0.074 for fake shop). But visually, it’s the most distinctive condition in the entire dataset. The prediction was half right: the words look almost legitimate; the visual presentation doesn’t.

Figure 13: Pyramid scheme screenshot (left) vs phishing screenshot (right). Insula activation visible in pyramid condition only.

Why the Visual Encoder Predicts Less Activation for Scam Screenshots

Worth being precise about what this result actually means before interpreting it.

Path B feeds a screenshot to V-JEPA2 — a video understanding model. V-JEPA2 processes pixels, not text. The words inside the WhatsApp bubble are never linguistically decoded in this path. The visual encoder is comparing: what does a scam screenshot look like versus what does a legitimate shipping notification screenshot look like — purely as visual patterns.

The result: TRIBE v2 predicts lower visual cortex activation for the scam screenshots than for the legitimate baseline. Not higher — lower. The scam UI, rendered in a familiar messaging interface, doesn’t produce a visually distinctive or novel pattern relative to a normal message. V-JEPA2 sees something that looks visually routine.

One interpretation: scam designers who wrap their content in standard UI templates (WhatsApp bubbles, SMS notification frames) are, deliberately or not, producing visual stimuli that a visual processing system treats as unremarkable. There’s no visual novelty for the encoder to flag. Whether this translates to reduced human attention is a hypothesis the data suggests but doesn’t prove — that would require actual eye-tracking or fMRI studies with real participants.

What the text path shows in contrast: the same scam content, stripped of UI context, predicted to drive dlPFC engagement. The words alone carry the manipulative signal. The UI wrapping, at least visually, does not add to it — it obscures it.

ROI	Phishing (text)	Investment (text)	Fake Shop (text)	Pyramid (text)	Phishing (screenshot)	Investment (screenshot)	Fake Shop (screenshot)	Pyramid (screenshot)
dlPFC	+0.053	+0.061	+0.074	+0.024	−0.038	−0.033	−0.035	−0.006
ACC	+0.047	−0.007	+0.023	−0.013	−0.050	−0.049	−0.049	+0.005
Insula	+0.012	−0.005	+0.023	−0.006	−0.016	−0.014	−0.015	+0.045
Visual Cortex	+0.007	−0.106	−0.036	−0.096	−0.136	−0.143	−0.148	−0.093
TPJ	+0.028	+0.047	+0.039	0.000	−0.045	−0.046	−0.049	+0.030

Figure 14: ROI activation table — EN corpus. Bold = highest activation (text path) and strongest suppression (screenshot path).

The Cross-Language Finding (The Most Actionable Result)

TRIBE v2 claims zero-shot cross-lingual generalization (the ability to work in languages it was never explicitly trained on). The experiment tests that claim with an adversarial use case: do Japanese scam texts produce similar brain maps to English ones?

Message Type	Text Path r	Screenshot Path r
Phishing	0.604	0.994
Investment	0.911	0.998
Fake Shop	0.592	0.995
Pyramid Scheme	0.610	0.983

The text path cross-language correlation is moderate — 0.592 to 0.911. The screenshot path is near-perfect: 0.983 to 0.998 across all four scam types.

This makes sense structurally. Visual UI patterns — WhatsApp chat bubbles, sale banners, notification frames — are language-agnostic by design. The same visual template that works in English works in Japanese, Arabic, and Hindi because the UI conventions are global. The brain’s response to familiar UI structure is universal.

Text is different. Japanese and English activate overlapping but distinct language processing networks. The semantic content of “URGENT: your account has been compromised” in English versus “緊急：アカウントが侵害されました” in Japanese produces correlated but not identical predicted activation patterns — hence r ≈ 0.60 for phishing and fake shop.

One notable number: the Japanese phishing text path produces a dlPFC activation of 0.124 — more than double the English equivalent at 0.053. That’s the highest single dlPFC value in the entire experiment. Japanese phishing text triggers the strongest predicted prefrontal engagement of any condition tested. Whether that reflects something specific to Japanese-language urgency framing or a TRIBE v2 artifact from its training data distribution, I don’t know. But it’s worth flagging.

Figure 15: Phishing text path — EN (left) vs JA (right). r = 0.604. Divergence visible in left-hemisphere language regions.

Figure 16: Phishing screenshot path — EN (left) vs JA (right). r = 0.994. Near-identical suppression pattern across both languages.

What This Means for Scam Detection

Three practical implications:

1. Text and visual signals carry different information — and current detectors only read one. NLP-based scam filters catch urgency words, too-good-to-be-true patterns, spoofed sender names. They operate on semantic content. What this experiment suggests — and it’s a hypothesis, not a proof — is that the visual encoding of a message carries a separate signal: how visually distinctive or routine the presentation looks. A scam wrapped in a standard UI template may be visually indistinguishable from a legitimate message even when the text is clearly manipulative. Detection systems that only analyse text are not seeing what the visual encoder sees.

2. Visual trust signals are language-universal attack surface. The r = 0.99 cross-language correlation on the screenshot path tells you that a scam template designed in one language ports to any other with near-zero friction. The visual attack is already global. Defending against it needs to be global too — which means UI-fingerprinting and brand impersonation detection that operates on visual structure, not just text content.

3. dlPFC suppression may be the key neural signature to look for. If the goal is to build models that predict susceptibility rather than just flag known patterns, the variable to track is probably prefrontal engagement — not amygdala activation (which, notably, showed near-zero values throughout this experiment). Fear isn’t the primary mechanism TRIBE v2 predicts. Cognitive load suppression is.

4. Audio-delivered scams may be the most dangerous channel — and this experiment accidentally suggests why. Path A is not purely a “text” path. TRIBE v2 converts the input text to speech via TTS before processing it through the language and audio encoders. That means Path A is actually predicting how the brain responds to a spoken version of the message — and it consistently outdrives Path B on prefrontal engagement across every scam type. This is directionally consistent with what scam researchers observe in the field: voice-based scams (vishing calls, WhatsApp audio notes, robocalls) tend to have higher victim conversion rates than text-based ones. The experiment’s Path A is synthetic speech with no emotional tone — a real scammer’s voice adds urgency, fear, and social pressure on top. If neutral TTS already predicts stronger cognitive engagement than a visual screenshot, real audio scams likely widen that gap further. Detection systems that don’t analyse audio are missing the highest-impact channel.

Caveats

This is an in-silico (computer simulation) experiment. TRIBE v2 is a model trained to predict group-average fMRI responses from a specific population under controlled conditions. It is not measuring real brain activity — it’s a proxy that correlates reasonably well with measured fMRI data in validation studies, but “reasonably well” is not “ground truth.”

The corpus is synthetic. I wrote these messages for the experiment; they are not drawn from real scam campaigns. Real scams are evolved and optimized; synthetic examples may under- or over-represent specific manipulation patterns.

The n is small: five message types, two languages, one model run. No statistical significance testing is meaningful here. The cross-language correlations and ROI values are observations, not generalizable findings. They suggest hypotheses worth testing properly.

The visual encoder is a video model running on still images. V-JEPA2 ViT-Giant (Meta’s video AI) was designed for video clips with motion and temporal dynamics. The screenshot path feeds it the same static frame repeated 16 times — a workaround, not an ideal input. A static image encoder like DINOv2 would be more appropriate for screenshots. That said, swapping it isn’t possible without retraining TRIBE v2 from scratch, since the whole model learned to map V-JEPA2 features to brain activations. Worth noting: a vision-language model (one that can actually read text inside images) would have been more powerful for screenshots, but would collapse the clean separation between Path A and Path B — the two paths would both “know” the words, and the comparison would lose its meaning.

What TRIBE v2 does well: provide a computational proxy for neural processing that can be applied at scale, without recruiting human subjects, and with consistent methodology across languages and modalities. That’s genuinely useful for hypothesis generation — which is what this experiment is.

What’s Next

The next step is rendering higher-fidelity screenshot mockups — the current ones are functional but basic. A more realistic WhatsApp UI with sender photos, read receipts, and conversation history context might shift the visual cortex suppression values meaningfully. I want to test whether increasing visual authenticity increases suppression (more familiar = less attention) or decreases it (more complex scene = more visual processing load).

I’m also planning to run the ROI extraction against TRIBE v2’s subcortical predictions — the near-zero amygdala and nucleus accumbens values in this experiment could be a true finding (scams don’t primarily operate through limbic fear/reward) or a limitation of the model’s cortical focus. Worth separating those two explanations.

The YouTube video covering this series is in progress. The experiment notebook — corpus creation, inference pipeline, visualization, ROI analysis — will be open-sourced once cleaned up. Drop a comment or reach out if you want early access.

When Meta released TRIBE v2, I kept asking myself: can a brain-encoding AI tell scam messages apart from legitimate ones? I finally ran the experiment. It turned into a two-part series, a Colab notebook, and more follow-up questions than answers — which is exactly what I was hoping for. If you’re a neuroscientist, an ML researcher, or someone who works in fraud detection and see something worth challenging here — I’d genuinely like to hear it.

Part 1 opened with: “What if you could watch, in real time, what a scam message does to someone’s brain?”

The honest answer after running this experiment: you can’t watch — not yet, not with this. What you can do is run a computational proxy that predicts what a population-average brain might do, and look for patterns in those predictions. The patterns were there. They weren’t always the patterns I expected. dlPFC, not amygdala. Suppression, not amplification. Near-perfect visual universality across languages.

That’s worth something. Not proof. A starting point.

Glossary

fMRI (functional Magnetic Resonance Imaging) — A brain scanning technique that measures blood oxygen levels as a proxy for neural activity. When neurons fire, they demand more oxygen, and fMRI detects the resulting change in blood flow. It produces 3D maps of which brain regions are active at a given moment — but it’s slow (one scan every 1–2 seconds) and expensive.

Brain encoding model — A machine learning model trained to predict fMRI brain activity from a stimulus (text, audio, or video). Instead of putting a person in a scanner, you feed the stimulus to the model and it estimates what the brain would do. TRIBE v2 is this kind of model — trained on 451 hours of real fMRI data, then used to make predictions on new inputs.

Brain activation map / fMRI activation map — A visualization showing which parts of the brain are predicted to be more or less active in response to a specific stimulus. Warmer colours (red/yellow) = more activation. Cooler colours (blue) = less activation or suppression relative to baseline. In this experiment, all maps are predicted, not measured.

fsaverage5 cortical mesh — A standardized 3D model of the human brain surface used in neuroscience to compare data across individuals. “fsaverage” is an average brain; “5” refers to the resolution level (~20,484 surface points). TRIBE v2 outputs predictions at each of these ~20,000 points, which is how you get a full brain map.

Region of interest (ROI) — A specific brain area you’ve decided to measure in advance because you have a hypothesis about it. Rather than sifting through all 20,000+ brain points, you define ROIs (e.g., “prefrontal cortex”) and compute the average activation there. This experiment uses seven ROIs: dlPFC, ACC, insula, visual cortex, TPJ, amygdala, and nucleus accumbens. The first five are cortical; amygdala and nucleus accumbens are subcortical and came out near-zero in TRIBE v2’s predictions.

Hemodynamic response — The blood flow change that follows neural activity, which is what fMRI actually detects. It peaks about 5–6 seconds after the neuron fires, which is why TRIBE v2 offsets its predictions by 5 seconds — to account for this lag between “neuron fires” and “scanner detects it.”

Group-average prediction — TRIBE v2 was trained on data from 25 subjects. Its output is a prediction of how the average brain across those subjects would respond — not any individual’s brain. Individual brains vary significantly; the group average smooths this out and is often more reliable than any single subject’s scan.

dlPFC (dorsolateral prefrontal cortex) — The brain’s cognitive control engine. Handles working memory, goal maintenance, and conflict resolution — the mental work of evaluating something that doesn’t add up. When dlPFC fires hard, it means the brain is working to assess a situation critically. Top-activated ROI for all four scam types in the text path of this experiment.

ACC (anterior cingulate cortex) — A brain region involved in detecting conflict between competing responses and processing urgency signals. If something feels wrong but you’re being pushed to act fast, the ACC is firing. Co-activates with dlPFC for phishing and fake shop text, but goes negative in the screenshot path for three of four scam types.

Insula — A brain region deep in the cortex associated with interoception (sensing internal body states), disgust, and visceral risk signals. When something triggers a “gut feeling” of wrongness, the insula is often involved. In this experiment, the pyramid scheme screenshot produced the only positive insula response in the screenshot path (+0.045) — the sharpest single anomaly in the dataset.

Visual cortex — The primary region at the back of the brain that processes visual information — shapes, colours, motion, spatial layout. Expected to activate strongly for visual stimuli. Counterintuitively, it suppressed in the screenshot path for all scam types (−0.093 to −0.148), suggesting familiar UI templates don’t produce visually distinctive patterns.

TPJ (temporo-parietal junction) — A brain region involved in theory of mind — the ability to model other people’s intentions and perspectives. Relevant for social manipulation (does the sender want something from me?). Shows up positively for investment (+0.047) and fake shop (+0.039) in the text path, but is flat for pyramid scheme.

Amygdala — A subcortical structure (deep in the brain, below the cortex) strongly associated with fear, threat detection, and emotional learning. Expected to activate for phishing — but near-zero throughout this experiment. TRIBE v2 was trained primarily on cortical (surface) data, so its subcortical predictions are unreliable. Fear may not be the primary cognitive mechanism here — or the model simply can’t measure it.

Nucleus accumbens — A subcortical structure central to reward anticipation and dopamine-driven motivation. Expected to activate for investment scams. Like the amygdala, came out near-zero — same TRIBE v2 coverage caveat applies.

Text path vs screenshot path — The two input routes in this experiment. The text path feeds the raw message words to TRIBE v2’s language encoders (LLaMA + Wav2Vec-BERT), which process meaning. The screenshot path feeds a rendered image of the message to the visual encoder (V-JEPA2), which processes pixels — it never reads the words inside the image. Two different questions about the same message, answered separately.

Differential activation map — A brain map showing the difference between a scam condition and the legitimate baseline. Instead of “how does the brain respond to phishing?”, it shows “how does the brain respond to phishing differently than to a normal shipping notification?” Positive values = more activation for the scam; negative values = less.

Cross-language correlation (r) — A measure of how similar two brain maps are to each other, ranging from −1 (opposite) to +1 (identical). Compares English vs Japanese versions of the same scam type. Screenshot path r = 0.983–0.998 (near-identical). Text path r = 0.592–0.911 (correlated but language-specific differences visible). The high screenshot correlation reflects that visual UI patterns are globally uniform regardless of language.

This experiment is independent personal research, unaffiliated with his employer. TRIBE v2 is used under CC BY-NC 4.0.

What Does a Scam Message Do to Your Brain? I Used Meta’s AI to Find Out

2026-04-09T00:00:00+00:00

Part 1 of 2 — The theory and the experiment design. Part 2 shows what actually happened.

When Meta released TRIBE v2 in March 2026, I couldn’t stop thinking about what it could mean for scam detection. This is me finally running the experiment. It’s a personal research project, not a peer-reviewed study — the goal is to ask interesting questions with a new tool and share what comes out. Some findings will hold up under scrutiny; others will invite challenge. Both outcomes are useful. If something here sparks a question, a doubt, or a better experiment — that’s exactly the point.

What if you could watch, in real time, what a scam message does to someone’s brain? Not metaphorically. Not “it activates fear.” I mean a high-resolution map of 29,000 brain regions lighting up as someone reads “Your account has been compromised — verify immediately” — and then see a completely different pattern when they see that same message pop up as a WhatsApp notification on their phone.

That’s now possible. Meta FAIR released TRIBE v2 in March 2026 — a foundation model that takes text, audio, or video as input and predicts how a human brain would respond to it, outputting full fMRI-resolution brain activation maps. It’s designed for neuroscience research: running virtual brain experiments without putting anyone in a scanner.

But I work in scam detection. And the moment I saw this model, I had two questions: do scam messages produce a measurably different brain signature than legitimate ones? And does a scam hack your brain through what it says — or through how it looks?

If the answer is yes, that changes how we think about detecting scams entirely.

What TRIBE v2 actually does

TRIBE v2 is a brain encoding model. You feed it a stimulus — a video clip, an audio recording, or a text passage — and it predicts how the average human brain would respond, across approximately 20,484 cortical surface points (the brain’s outer layer) and 8,802 subcortical voxels (deep brain structures below the cortex).

The architecture is a three-stage pipeline. Three frozen foundation models handle feature extraction: LLaMA 3.2-3B (Meta’s language AI) processes text, V-JEPA2 ViT-Giant (Meta’s video and image AI) processes video and images, and Wav2Vec-BERT 2.0 (an audio understanding AI) processes audio. Each modality’s features get compressed into a shared 384-dimensional space, concatenated into a 1,152-dimensional multimodal time series, and fed into a Transformer encoder with 8 layers and 8 attention heads operating over a 100-second context window. A final prediction head maps these representations onto the brain surface.

Figure: TRIBE v2 architecture overview. Text, audio, and video inputs are processed by specialized encoders (LLaMA, Wav2Vec-BERT, V-JEPA2), fused into a shared representation, and transformed into predicted fMRI brain activation maps. Source: Meta AI Research.

The model was trained on 451.6 hours of fMRI data from 25 subjects. Its predictions of group-averaged brain responses are often more accurate than any individual subject’s actual fMRI recording. When researchers applied Independent Component Analysis (a technique for finding hidden structure in data) to the model’s final layer, it had independently discovered five canonical functional brain networks — without being told they exist.

Figure: TRIBE v2 prediction accuracy across the cortical surface. The model achieves strong correlation with actual fMRI data across most brain regions. Source: Meta AI Research.

The code and weights are open-source on GitHub and HuggingFace under CC BY-NC 4.0.

The neuroscience of deception: why this matters for scams

Here’s the foundational insight: lying is neurologically expensive.

Decades of fMRI research — most notably by Daniel Langleben at UPenn — shows that deception activates the brain very differently from truthful communication. Truth-telling is the brain’s default mode. It requires one cognitive operation: recall and report. Deception demands four simultaneous processes running in parallel:

Suppress the truthful response (prefrontal cortex)
Construct a false narrative (dorsolateral prefrontal cortex)
Monitor internal consistency — does this lie contradict my earlier lies? (anterior cingulate cortex)
Predict the listener’s response — will they buy it? (temporo-parietal junction)

This asymmetry is measurable, and it leaves fingerprints in the text itself. Studies published in Nature Scientific Reports show that deceptive text contains fewer self-references, more negative emotion words, reduced verifiable details, increased hedging, and inconsistent sentiment patterns. NLP algorithms (text analysis software) trained on these features achieve 77% detection accuracy — far exceeding trained human experts at 59%.

But here’s what gets interesting for scam detection specifically. Scams aren’t just deceptive. They’re engineered to hijack specific neural circuits:

Phishing messages target the amygdala (threat detection) and anterior cingulate (urgency/conflict monitoring) — “Your account has been compromised” triggers fear before your prefrontal cortex can apply rational evaluation.
Investment scams target the nucleus accumbens (reward anticipation) — “500% returns guaranteed” activates the same dopaminergic pathways (dopamine reward circuits) as gambling.
Fake shops exploit the prefrontal cortex (value computation, cognitive evaluation) — “90% OFF today only” creates a perceived value gap that overrides skepticism. The specific prefrontal subdivision — vmPFC (value) vs dlPFC (conflict resolution) — is something the experiment will disambiguate.
Pyramid schemes are the hardest to detect because they mimic legitimate business opportunity language — the brain activation pattern may be genuinely close to how you’d process a real business proposition.

If TRIBE v2 can predict these differential activation patterns from text and from the visual presentation of the message, we have something no scam detection system currently uses: a measure of how hard a message is trying to hack your brain — and through which channel.

Figure: Which brain regions respond most to each modality in TRIBE v2. Red = video-dominant, Green = audio-dominant, Blue = text-dominant. Note how language processing areas (blue) are distinct from visual cortex (red). This separation enables our text-vs-screenshot experiment. Source: Meta AI Research.

Figure 3: The seven brain regions tracked in this experiment, shown on a schematic lateral view. Blue = dlPFC (cognitive load). Green = ACC (urgency/conflict). Orange = insula (visceral risk). Purple = visual cortex. Pink = TPJ (social cognition). Grey = amygdala and nucleus accumbens (subcortical — near-zero in TRIBE v2’s cortex-focused predictions).

The experiment: two paths, one brain

Here’s the interesting design choice. In the real world, scam messages reach victims through two channels simultaneously: the words (semantic content) and the visual presentation (a WhatsApp bubble, an SMS notification, a social media post with a scam image). TRIBE v2’s multimodal architecture lets us separate these and ask: does a scam hack your brain through what it says, or through how it looks?

I’m going to run the same scam messages through TRIBE v2 twice — via two different input paths — and compare the brain maps.

Path A — The text path (semantic processing). Feed the raw scam message as text. TRIBE v2 auto-converts text to speech via TTS (text-to-speech), runs WhisperX (a speech timing tool) to get word-level timestamps, then processes it through LLaMA 3.2-3B (language features) and Wav2Vec-BERT (audio features). This predicts how the brain would process the meaning of the message — the semantic manipulation, the emotional trigger words, the urgency framing.

Path B — The screenshot path (visual processing). Feed a realistic screenshot of the same message — rendered as it would actually appear in WhatsApp, an SMS inbox, or a social media feed. TRIBE v2 processes this through V-JEPA2 ViT-Giant (visual features). Important: V-JEPA2 processes pixels, not text — the words inside the image are never linguistically decoded in this path. This predicts how the brain would respond to the visual presentation of the message — the UI patterns, the notification styling, the visual structure that scammers exploit.

The comparison is the story. If both paths light up emotional regions, scammers are hitting you from two directions at once. If the text path shows stronger amygdala activation but the screenshot path shows stronger visual cortex activity, it means the words do the emotional manipulation while the visual framing provides the camouflage of legitimacy. That’s a fundamentally different attack surface.

Figure 1: The dual-path design. Path A (blue) feeds raw text through language encoders — LLaMA + Wav2Vec-BERT. Path B (orange) feeds a rendered screenshot through the visual encoder — V-JEPA2 ViT-Giant, which processes pixels only and never reads the text inside the image. Both paths output predicted brain activation maps; the comparison between them is the experiment.

Input corpus:

A set of synthetic scam messages across four categories plus legitimate baselines, each prepared as both raw text and rendered screenshots. All messages in English and Japanese — because TRIBE v2 claims zero-shot cross-lingual generalization, and I want to test whether scam brain signatures are language-universal.

Type	Message
Legitimate	“Your package has been shipped. Expected delivery: Thursday.”
Phishing	“URGENT: Your Amazon account has been compromised. Verify your identity immediately or your account will be permanently locked.”
Investment	“Exclusive crypto opportunity — 500% returns guaranteed in 30 days. Only 12 spots remaining. Act now.”
Fake Shop	“FLASH SALE: 90% OFF authentic Ray-Ban sunglasses! Today only. Free worldwide shipping.”
Pyramid Scheme	“Join our financial freedom network. Earn $5,000/month passive income by helping 3 friends discover the same opportunity.”

For the screenshot path, each message gets rendered in a realistic messaging UI — WhatsApp-style chat bubbles, SMS notification layouts, social media post frames. The visual context matters: the same text in a WhatsApp bubble versus a plain email triggers different levels of trust and urgency.

Figure 2: The 10 rendered stimuli — 5 message types × 2 languages (EN + JA). Each processed as both raw text (Path A) and screenshot (Path B), giving 20 inference runs total.

Process:

Text path: Feed each message as raw text → TRIBE v2 auto-generates speech, extracts features via LLaMA + Wav2Vec-BERT → predict brain activation map
Screenshot path: Feed a rendered screenshot of the same message → TRIBE v2 extracts features via V-JEPA2 (static image repeated across frames to simulate video input) → predict brain activation map
Generate differential activation maps: scam minus legitimate baseline, separately for each path
Compare activation in regions of interest: amygdala (fear), nucleus accumbens (reward), prefrontal cortex (cognitive load), anterior cingulate (conflict/urgency), insula (disgust/risk), visual cortex (visual processing)
Cross-path comparison: overlay text-path and screenshot-path brain maps to identify which modality drives which neural response
Repeat with Japanese translations to test cross-lingual consistency

What I expect to find:

The text path should show stronger predicted activation in language-processing and emotional regions — temporal cortex, amygdala, insula, prefrontal cortex. This is where the semantic manipulation is predicted to live: the fear, urgency, and reward signals that bypass rational evaluation. Whether TRIBE v2 captures subcortical regions like the amygdala depends on the model’s training coverage — cortex-focused models may not predict limbic responses reliably, which would itself be a finding worth reporting.

The screenshot path should show heavier visual cortex activation but also — and this is the interesting hypothesis — some emotional activation from the visual trust cues that scammers exploit. A message rendered in a WhatsApp bubble with a verified-looking profile picture should predict different brain responses than the same text in a suspicious-looking email. If TRIBE v2 picks up this visual-trust signal, it validates what scam researchers have known anecdotally: presentation matters as much as content.

Pyramid scheme messages should show the most ambiguous signature across both paths — closest to legitimate — which would explain why both humans and AI classifiers struggle most with this category.

And if the cross-lingual comparison shows similar brain signatures for the same scam translated into Japanese, that’s evidence that scam detection could use brain-signature features as language-agnostic signals.

Technical setup: TRIBE v2’s full encoder stack (LLaMA 3.2-3B + V-JEPA2 Giant + Wav2Vec-BERT) needs roughly 25GB of VRAM. I’ll be running this on Google Colab Pro with an A100 GPU (40GB), which handles all three encoders loaded simultaneously with room to spare. The screenshot path requires a minor note: V-JEPA2 expects video frames, so the static screenshot gets repeated across the temporal dimension to simulate video input.

Figure: TRIBE v2 predicts brain responses across diverse regions. Solid lines show actual fMRI BOLD signals from a human subject watching a video; dashed lines show TRIBE v2’s predictions. The model captures temporal dynamics with high correlation (r = 0.77–0.85). Source: Meta AI Research.

What this could mean for scam detection

If the experiment works, the implications go beyond an interesting visualization.

Manipulation potency scoring. Current scam detectors produce a binary output: scam or not. A brain-predictive model could add a dimension: how dangerous is this scam? A message that predicts strong prefrontal engagement — the brain working hard to evaluate something that doesn’t feel right — may be more insidious than one that triggers raw fear. Whether the primary signal turns out to be cortical (prefrontal, cingulate) or subcortical (amygdala, nucleus accumbens) depends on what the model actually predicts. Part 2 will show which regions actually light up.

Adversarial red-teaming. If you can predict which message variations produce the strongest brain hijacking response, you can generate the most dangerous possible scam variants and test whether your detection system catches them. Traditional adversarial testing mutates text randomly. This mutates text toward maximum predicted neural exploitation — a far more realistic threat model.

Verdict justification. Instead of telling a user “this is likely a scam,” imagine: “This message is designed to trigger your fear response while creating artificial time pressure to bypass your critical thinking.” That’s a fundamentally different user experience — you’re not just warning them, you’re vaccinating them against the technique.

Cross-language early warning. If a scam template predicts high emotional hijacking in English but low activation in Japanese, it likely won’t be effective (or prevalent) in Japan — and vice versa. This could predict which scam types will emerge in which markets before they appear in the training data.

Coming in Part 2

I’ll run the actual experiment — both paths — share the brain activation maps side by side, and find out whether the theory holds. Does a phishing message light up different brain regions than a shipping notification? Does a WhatsApp screenshot trigger different neural responses than the raw text? Is there a universal neural signature of a scam that works across languages and modalities? And does the pyramid scheme really look like a legitimate message to your brain?

The code, Colab notebook, and all visualizations will be open-sourced.

Read Part 2 →

Glossary

Region of interest (ROI) — A specific brain area you’ve decided to measure in advance because you have a hypothesis about it. Rather than sifting through all 20,000+ brain points, you define ROIs (e.g., “prefrontal cortex”) and compute the average activation there. This experiment tracks seven ROIs: dlPFC, ACC, insula, visual cortex, TPJ, amygdala, and nucleus accumbens. The first five are cortical and extracted cleanly; amygdala and nucleus accumbens are subcortical and came out near-zero in TRIBE v2’s predictions (either a genuine finding or a model coverage limitation).

dlPFC (dorsolateral prefrontal cortex) — The brain’s cognitive control engine. Handles working memory, goal maintenance, and conflict resolution — the mental work of evaluating something that doesn’t add up. When dlPFC fires hard, it means the brain is working to assess a situation critically. In this experiment, it’s the top-activated region for all four scam types via text, suggesting scam messages force cognitive engagement.

ACC (anterior cingulate cortex) — A brain region involved in detecting conflict between competing responses and processing urgency signals. If something feels wrong but you’re being pushed to act fast, the ACC is firing. It sits at the intersection of emotion and cognition.

Visual cortex — The primary region at the back of the brain that processes visual information — shapes, colours, motion, spatial layout. Expected to activate strongly for visual stimuli. Notably, it suppressed in the screenshot path for all scam types — suggesting familiar UI templates don’t produce visually distinctive patterns.

TPJ (temporo-parietal junction) — A brain region involved in theory of mind — the ability to model other people’s intentions and perspectives. Relevant for social manipulation (does the sender want something from me?). Shows up in investment and fake shop conditions in the text path.

Amygdala — A subcortical structure (deep in the brain, below the cortex) strongly associated with fear, threat detection, and emotional learning. Conventional wisdom says phishing messages “activate fear” — but in this experiment, amygdala values were near-zero. TRIBE v2 was trained primarily on cortical (surface) data, so its subcortical predictions may not be reliable.

Nucleus accumbens — A subcortical structure central to reward anticipation and dopamine-driven motivation. Expected to activate for investment scams (“500% returns”). Like the amygdala, came out near-zero here — same TRIBE v2 coverage caveat applies.

Text path vs screenshot path — The two input routes in this experiment. The text path feeds the raw message words to TRIBE v2’s language encoders (LLaMA + Wav2Vec-BERT), which process meaning. The screenshot path feeds a rendered image of the message to the visual encoder (V-JEPA2), which processes pixels — it never reads the words inside the image. They answer different questions about the same message.

Cross-language correlation (r) — A measure of how similar two brain maps are to each other, ranging from −1 (opposite) to +1 (identical). In this experiment, it compares English vs Japanese versions of the same scam type. Screenshot path r = 0.983–0.998 (near-identical). Text path r = 0.592–0.911 (correlated but with meaningful differences). The high screenshot correlation reflects that UI visual patterns are globally uniform.

TRIBE v2 is used under its CC BY-NC 4.0 license for non-commercial research. TRIBE v2 figures courtesy of Meta AI Research, from “A Foundation Model of Vision, Audition, and Language for In-Silico Neuroscience” (2025).