Show & Tell: TAF Agent v0.7 — 14 browser-only diagnostics for transformer LLMs (anti-bullshit pack)

karlexmarin · May 7, 2026, 7:08am

# Show & Tell: TAF Agent v0.7 — 14 browser-only diagnostics for transformer LLMs (anti-bullshit pack)

> **TL;DR** — A free, no-signup, browser-only tool that calls bullshit on common LLM-eval lies: misleading `max_position_embeddings`, silent chat-template halving in lm-eval-harness, hidden Chatbot Arena CIs, MMLU contamination priors, model-specific quant cliffs, and the NIAH-vs-reasoning gap.

>

> **Live**: TAF Agent - a Hugging Face Space by karlexmarin

> **Source**: GitHub - karlesmarin/tafagent: Transformer LLM diagnostic in your browser. Free, unlimited, auditable. · GitHub

> **Paper**: [Marin 2026 — Predicting How Transformers Attend]( Predicting How Transformers Attend Analytic Power-Law Theory, Phase Transitions, and Practical Compression Tools )

-–

## What it is

I built a single static HTML+JS page that ships **14 diagnostic modes** for transformer LLMs. The premise is simple: a lot of the things the community routinely complains about — leaderboard contamination, model-card lies, framework drift, quantization cliffs — are diagnosable from **metadata alone** (`config.json`, `tokenizer_config.json`, published vote counts), without spinning up a GPU or running inference.

Everything runs in your browser. Your inputs never leave the tab. There is no server, no signup, no telemetry. The Python tools that some modes use run via **Pyodide**; the math is deterministic.

It’s available in **EN / ES / FR / ZH** (685 i18n keys, parity-checked).

-–

## What’s new in v0.7 — the anti-bullshit pack

After surveying public HF Forum threads, GitHub issues, arxiv papers, and Reddit posts, I picked **10 community pain points** and shipped browser-only solutions for **8 of them**. (The remaining 2 — VRAM-formal-bound and pre-fine-tune forgetting forecast — are the v0.8 roadmap.)

### Unmask — does `max_position_embeddings` lie?

Paste an HF model id. The tool reads `config.json` and tells you whether the declared context is honest, **inflated** (SWA window restricts effective range), **severely inflated** (Mistral-7B-v0.1 declares 32k but attends ~4-8k), or **YaRN-extended** (factor + original-pe).

Pre-flight verdicts on real public models:

- `mistralai/Mistral-7B-Instruct-v0.3` → **HONEST** 32k (v0.3 dropped SWA; v0.1 was the SWA-confused release)

- `microsoft/Phi-3-mini-4k-instruct` → **INFLATED** (sliding-window=2047, hidden in config)

- `deepseek-ai/DeepSeek-V2.5` → **YARN-EXTENDED** (factor=40×, 4k → 163k)

### Chat-template Sniffer

Detects which template family (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / DeepSeek / Alpaca / custom / none) by reading `tokenizer_config.json` and gives you the **exact CLI flag** for `lm-evaluation-harness`, `vLLM serve`, and `transformers`. This solves [lm-eval-harness issue #1841]( Inconsistent evaluation results with Chat Template · Issue #1841 · EleutherAI/lm-evaluation-harness · GitHub ) — the one where forgetting `–apply_chat_template` silently halves multi-turn accuracy.

To my knowledge, **no other public tool diffs the apply path** and gives per-framework commands.

### Arena-Elo CI Reconstructor

[The Leaderboard Illusion]( [2504.20879] The Leaderboard Illusion ) (Apr 2025) diagnosed Chatbot Arena gaming and pointed out that public CIs are stripped. Paste a CSV of pairwise votes (`model_a, model_b, winner`) and the tool runs Bradley-Terry MLE + 200-iteration bootstrap and tells you which model pairs are **statistically tied** (CIs overlap). Has a “Load sample” button with synthetic 6-model data so you can see it work without hunting raw battle logs.

### Contamination Prior

Built-in DB of 20 popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, MMLU-Pro, GPQA, AIME 2024, BBH, MUSR, RULER, etc.). Enter your model’s training cutoff date — get a **Bayesian prior of contamination** per benchmark based on time gap, corpus inclusion, and known leak history. Llama-3.1 (cutoff 2023-12) on MMLU returns ~97% prior. Same model on AIME 2024 returns ~5%.

This complements GPU-bound detectors like CoDeC and Min-K% — it’s the **pre-flight risk score**, not a post-hoc detection.

### Quant-regime Classifier

Predicts γ-shift and ΔPPL for 10 quantization schemes (FP8, int8, GGUF Q8_0/Q5_K_M/Q4_K_M/Q3_K_M/Q2_K, AWQ, GPTQ, NF4) on a per-model basis. Architecture-aware: small d_head + aggressive GQA increases sensitivity; calibrated schemes (AWQ) absorb shift better than uncalibrated (NF4).

Pre-flight on `mistralai/Mistral-7B-Instruct-v0.3`:

- AWQ → mild (γ-shift +0.023, ΔPPL ~0.01)

- NF4 → **CLIFF** (γ-shift +0.081, ΔPPL ~0.06)

Recommends a switch when it detects a cliff.

### Cross-framework Drift Bound

Same model, different scores on different setups. Paste both with `(framework, dtype, batch, chat-template applied?)`. Tool predicts the **maximum drift admissible from numerical noise** (additive: dtype-pair penalty + framework kernel diff + batch-ratio + 0.3-pt non-determinism floor). If observed gap exceeds it → real bug, usually chat-template mismatch (most common) or KV-cache layout. References: [arxiv 2506.09501]( [2506.09501] Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference ) on FP32-on-the-fly reproducibility.

### NIAH → Reasoning Gap Predictor

[RULER paper]( [2404.06654] RULER: What's the Real Context Size of Your Long-Context Language Models? ) showed long-context models often pass needle-retrieval but fail multi-hop reasoning at the same context. The HELMET work confirmed synthetic NIAH doesn’t predict downstream. This mode predicts **both pass rates** from architecture (γ_Padé + d_horizon + arch pressure), reports the gap, and finds your model’s “safe reasoning context” where reasoning stays ≥ 65%.

Pre-flight on `meta-llama/Llama-3.1-8B-Instruct` (claimed 128k, RoPE θ=500k, GQA 8/32):

- @ 8k: NIAH 100% · Reason 94% → **ROBUST**

- @ 64k: NIAH 100% · Reason 94% → **ROBUST**

- @ 128k: NIAH 98% · Reason 92% → **ROBUST**

- Curve drops past T_train (4× extrapolation = ~30% NIAH penalty)

-–

## Formal verification

There’s a companion repo at GitHub - karlesmarin/lean-taf: Lean 4 + Mathlib formalization of TAF algebraic identities + Cv Hagedorn erratum (Marin 2026) · GitHub with **37 theorems machine-proven in Lean 4 + Mathlib4** (1973 build jobs). Identities like `β·χ = −1` (Anti-Ising closure), `D-SAGE-1` quadratic, Padé z-substitution. Each badge in the TAF Card links to the source line. Includes one substantive **finding** — a factor-2 inconsistency in the paper’s own V/β formula tables (formally proved in `V_derivative_ne_RG_beta`).

Anyone can clone + `lake build` to re-verify in ~5 seconds after Mathlib cache fetch.

-–

## Honest limits

- **It predicts; it doesn’t measure.** Verdicts are heuristics calibrated against published RULER / Grootendorst / arxiv data. For ground truth you still need a GPU.

- **Some modes use sample data** (Arena CI’s bundled 6-model fixture) because raw Arena battle logs aren’t always public.

- **Quantization predictor** is calibrated to publicly-reported PPL drops; novel architectures may sit outside the band.

- **Contamination prior** is a Bayesian prior, not a detector — pair with CoDeC/Min-K%/PaCoST when you have GPU access.

I would rather call out limitations honestly than oversell. If the tool is wrong about your model, please tell me — refutations are taken as seriously as confirmations.

-–

## Why I built this

A lot of v0.7 came from one observation: there’s a paper trail of community frustration about each of these issues, but the existing solutions (RULER, CoDeC, Min-K%, LayerCast, HELMET) are all **GPU-bound research artifacts**, not tools you reach for at 11 PM when you’re trying to decide whether to buy compute for a model. A browser-only “predict before you spend” layer felt missing.

That said — TAF Agent doesn’t replace any of those tools. It’s the pre-flight check before you bring out the heavy artillery.

-–

## How you can help

- **Falsify a verdict.** Run the tool, then run RULER / lm-eval / your downstream task. If we disagree with reality on a specific model, [open an issue]( Issues · karlesmarin/tafagent · GitHub ) with the model id + your numbers — that’s gold for calibration.

- **Suggest a benchmark for the contamination DB.** If a benchmark you care about isn’t in the 20 we cover, add it.

- **Translate.** EN/ES/FR/ZH covered; PRs welcome for more.

Built by one independent researcher with no funding, no team, and no GPUs beyond a single consumer card. The work itself belongs to the commons that made it possible.

-– Carles Marin

-–

*If you find a real bug, email me or open an issue — I treat refutations as gifts.*

Topic		Replies	Views
AI LLM model bias Intermediate	0	168	January 16, 2024
Dropping CDM — the metric that finally tells you when CoT actually works. One file, works on every model Beginners	3	32	May 23, 2026
Say goodbye to manual testing of your LLM-based apps – automate with EvalMy.AI beta! 🚀 Research	0	95	October 29, 2024
Open AI Box – Universal LLM introspection: injection points & dimension roles in any model Show and Tell	0	16	March 11, 2026
Measuring Hallucinations in LLMs Models	0	270	November 6, 2023

Show & Tell: TAF Agent v0.7 — 14 browser-only diagnostics for transformer LLMs (anti-bullshit pack)

Related topics