Token Chaser

27 lab notes

LLM TestingJuly 1, 2026

Ornith 35B vs Qwen 3.6 35B | Head to Head Battle

Ornith 35B Q6 vs Qwen 3.6 35B Q6. Two Qwen-family MoE models. Same prompts. Same setup. Same pressure. In this battle, I’ve got both models building a futuristic street-racing car OS, then expanding it into a deeper race-control interface, and finally pushing it into a live race simulator UI to see who actually holds up when complexity starts stacking. I’m not really interested in model-card chest puffing. I want to see who plans better, who keeps the design cleaner, who follows instructions, and who starts making bad decisions once the prompts get harder. Setup: - Ornith 35B vs Qwen 3.6 35B-A3B - llama.cpp + llama-swap - OpenCode side by side - local 2x RTX 3090 rig

2 models2 files

LLM TestingJune 24, 2026

Qwythos 9B vs Qwen3.5 9B | Local Coding Head-to-Head

Today I’m putting Qwen3.5 9B up against Qwythos 9B, a Qwen3.5-based Mythos fine tune with a much larger context window. Both models are running locally, and I’m giving them the same web coding challenge: Phase 1: Recreate an iPhone-style home screen UI using HTML, CSS, and JavaScript Phase 2: Make Phone, Messages, and Music work, then add a playable Arcade game Phase 3: Build a fancy Qwen3.7 launch website inside the phone’s Safari/browser app This is not a scientific benchmark. It’s a real-world coding test to see which small local model can create a better-looking UI, keep features working, and hold the project together as the challenge gets more complicated. Qwen3.5 9B vs Qwythos 9B. Regular Qwen versus the Mythos fine tune. Let’s see which one handles the challenge better. More local AI tests and projects: https://tokenchaser.net #LocalLLM #Qwen #Qwen3 #Qwythos #WebCoding #AIcoding #OpenSourceAI #LocalAI #LLM #TokenChaser

2 models2 files

LLM TestingJune 18, 2026

I Made Fusion and Qwen3.6 27B Build the Same Web App

I put OpenRouter Fusion and Qwen3.6 27B head-to-head and gave them the exact same prompt: build the same web app from scratch. Same goal. Same constraints. Same phased build. Very different results. In this video, I compare how a multi-model AI committee stacks up against a single 27B model when the task is actual software delivery, not just talking about code. The project was a real web app built in phases on fresh Linux VPSes, with each model responsible for turning the prompt into something usable. This wasn’t about benchmark scores or cherry-picked one-liners. I wanted to see which one could actually plan, build, adapt, and ship. For all prompts, code outputs, and info about the video, visit: https://tokenchaser.net Drop a comment with which models you want to see go head-to-head next.

2 models

LLM TestingJune 15, 2026

Qwopus3.6-27B Coder vs Qwen3.6-27B | New Local King?

Two local 27B coding models enter. One leaves looking like the better builder. In this head-to-head, I put Qwopus3.6-27B-Coder-MTP against Qwen3.6-27B-MTP and had them build the same app across multiple rounds on separate fresh VPSs. The challenge: Phase 1: build a shared LAN whiteboard Phase 2: add drag-and-drop image support Phase 3: add toggleable real-time chat Final phase: add shapes, emojis, text, and more advanced whiteboard features This wasn’t a one-prompt toy test. Both models had to keep the same project alive, extend it phase by phase, and not fall apart as the app got more complicated.

2 models

LLM TestingJune 12, 2026

I Made Fable 5 and Qwen3.6 27B Build the same Web App

Today I put Qwen3.6 27B and Claude Fable 5 head-to-head with the same challenge: build a real LLM benchmark dashboard on a fresh local VPS. The goal wasn’t to make a fake demo or a pretty mockup. I wanted a working product that could connect to my llama-swap endpoints, load models, run benchmark prompts, save results, and compare historical runs with real charts, stats, and benchmark data. Both models had to: - work from a fresh VPS - install whatever they needed - expose the dashboard on port 80 - build something that actually works - turn it into a tool I could keep using later If you’re into local AI, llama.cpp, llama-swap, coding agents, and real-world model battles, this is exactly the kind of chaos you’re here for.

2 models

LLM TestingJune 10, 2026

Claude Fable 5 vs GPT 5.5 | Head to Head Coding Battle

I put Claude Fable 5 and GPT-5.5 into a real head-to-head coding battle to see which AI could build the better project. Both models got the exact same phased challenge and had to keep building on top of the same app as new features were added. This wasn’t just about writing code fast — it was about design, usability, creativity, polish, and which model could actually hold everything together as the project kept getting bigger. The phases were simple: Phase 1: a Windows-style desktop UI Phase 2: a working browser and physics sandbox Phase 3: a pseudo-3D racing game Phase 4: more desktop apps, features, and polish

2 models2 files

LLM TestingJune 9, 2026

Gemma4 12B vs Gemma4 12B QAT | Real Coding Under Pressure

In this video, I put Gemma4 12B head-to-head against Gemma4 12B QAT to see which one performs better in real coding tasks. Both models went through a few live tasks to fill up their context window and see how they handled the pressure: VPS Dashboard Setup Add a Tower Defense Game to a Remote Dashboard Chat Client Build from Server/API Docs This wasn’t a clean benchmark. It was more about seeing how each model handled a real VPS workflow, live setup, routing, UI changes, and multi-step build tasks under the same conditions. Gemma4 12B and Gemma4 12B QAT were both running locally through my setup and tested on the same prompts. The goal is to see which model follows instructions better, builds the cleaner UI, handles frontend and system tasks more reliably, and stays more usable as the context window gets packed.

2 models

LLM TestingJune 9, 2026

Gemma4 12B vs Gemma4 12B QAT | UI Build + Debug Battle

In this video, I put Gemma4 12B head-to-head against Gemma4 12B QAT to see which one performs better in real coding tasks. Both models get the same single-file HTML, CSS, and JavaScript tests: iOS UI Clone Weather Dashboard Broken Code Debug + Repair Challenge This wasn’t just a clean code generation test. After the build rounds, both models had to deal with damaged UI code, runtime issues, broken layouts, and debugging pressure to see which one could actually recover and repair the project. Gemma4 12B and Gemma4 12B QAT are both running locally through my setup, tested on the same prompts under the same conditions. The goal is to see which model follows instructions better, builds the cleaner UI, handles frontend tasks more reliably, and recovers better when the codebase starts falling apart.

2 models4 files

LLM TestingJune 7, 2026

MiMo-v2.5 vs Qwen3.6 27B — This Battle Was Closer Than I Expected

In this video, I put Qwen3.6 27B Q8 head-to-head against MiMo-v2.5 to see if local AI can keep up with the frontier model. Both models get the same single-file HTML, CSS, and JavaScript tests: iPhone Replica with Mini Game Ragdoll Physics Simulator 3D Orbital Earth Explorer Qwen3.6 27B Q8 is running locally on my GPU setup, while MiMo-v2.5 is tested on the same prompts to see how well it can keep up. The goal is to see which model follows instructions better, builds the cleaner UI, creates the more functional project, and handles interactive coding tasks without completely falling apart.

2 models7 files

LLM TestingJune 6, 2026

MiniMax-M3 vs Qwen3.6 27B | Local vs Cloud Head to Head

In this video, I put MiniMax-M3 head-to-head against Qwen3.6 27B to see how a cloud model handles coding challenges against a local model running on my own GPUs. This one had a little bit of everything: A realistic iPhone-style UI with a game and social media app, a ragdoll physics simulator, a ridiculous/funny web design challenge, and a fake Qwen3.7 30B open model release page to see which model could actually make something usable, polished, and not completely cursed. As always, there is no perfect “fair” in these head-to-heads. I am just running the models, testing the outputs, and having fun seeing what breaks first.

2 models7 files

LLM TestingJune 5, 2026

Gemma4 12B vs Qwen3.5 9B | Local Head to Head

In this video, I put Gemma 4 12B IT head-to-head against Qwen3.5 9B to see how these smaller local models handle real browser-based coding prompts. Both models get the same single-file HTML, CSS, and JavaScript tests: 1. iPhone Replica 2. Top-Down Car Game 3. Live Weather Dashboard The goal is to see which model follows instructions better, builds the cleaner UI, creates the more functional project, and handles interactive coding tasks without completely falling apart. These are lighter prompts than some of the bigger model tests, but they still cover UI design, game logic, JavaScript interaction, API handling, layout, and overall polish. #gemma4 #qwen #localAI #llm #homelab #headtohead

Lab Notes

Ornith 35B vs Qwen 3.6 35B | Head to Head Battle

Qwythos 9B vs Qwen3.5 9B | Local Coding Head-to-Head

I Made Fusion and Qwen3.6 27B Build the Same Web App

Qwopus3.6-27B Coder vs Qwen3.6-27B | New Local King?

I Made Fable 5 and Qwen3.6 27B Build the same Web App

Claude Fable 5 vs GPT 5.5 | Head to Head Coding Battle

Gemma4 12B vs Gemma4 12B QAT | Real Coding Under Pressure

Gemma4 12B vs Gemma4 12B QAT | UI Build + Debug Battle

MiMo-v2.5 vs Qwen3.6 27B — This Battle Was Closer Than I Expected

MiniMax-M3 vs Qwen3.6 27B | Local vs Cloud Head to Head

Gemma4 12B vs Qwen3.5 9B | Local Head to Head

Qwopus 27B vs Claude Opus 4.8 | VPS Sabotage Challenge

Qwen3.6 27B vs Qwen3.7 Max | Head to Head

Qwen3.6 27B vs Step3.7 Flash | Local AI vs OpenRouter

Qwen3.6 27B MTP vs GPT-5.5 | Local AI Head-to-Head

Qwopus3.6 35B A3B MTP vs 27B MTP | Local AI Head-to-Head

Claude Opus 4.8 vs Qwen3.6 27B: Can Local AI Keep Up?

Qwen3.6 27B Q8 vs Claude Opus 4.6 | Head-to-Head

Qwen3 Coder Next vs Claude Opus 4.6 | Head-to-Head

Qwopus3.6 27B MTP vs Claude Opus 4.6 | Local vs Cloud Head-to-Head

Qwen3.6 27B vs Nemotron Super 3 120B | Head-to-Head

Qwen3.6 27B vs Heretic NEO Code 27B on RTX 3090s | Head-to-Head

Qwopus3.6 27B vs Qwen3.6 27B on RTX 3090s | Head-to-Head

Qwen3.6 27B vs 27B MTP on RTX 3090s | Head-to-Head

Gemma4 31B vs Qwen3.6 27B | Local Head-to-Head

Qwen3.6 27B Q8 vs Claude Sonnet 4.6 | Head-to-Head

Qwen3.6 27B vs 35B Unsloth on RTX 3090s | Head-to-Head