Token Chaser
AI · Homelabs · Benchmarks
17 lab notes
Gemma4 12B vs Qwen3.5 9B | Local Head to Head
LLM TestingJune 5, 2026

Gemma4 12B vs Qwen3.5 9B | Local Head to Head

In this video, I put Gemma 4 12B IT head-to-head against Qwen3.5 9B to see how these smaller local models handle real browser-based coding prompts. Both models get the same single-file HTML, CSS, and JavaScript tests: 1. iPhone Replica 2. Top-Down Car Game 3. Live Weather Dashboard The goal is to see which model follows instructions better, builds the cleaner UI, creates the more functional project, and handles interactive coding tasks without completely falling apart. These are lighter prompts than some of the bigger model tests, but they still cover UI design, game logic, JavaScript interaction, API handling, layout, and overall polish. #gemma4 #qwen #localAI #llm #homelab #headtohead

2 models6 files
Qwopus 27B vs Claude Opus 4.8 | VPS Sabotage Challenge
LLM TestingJune 3, 2026

Qwopus 27B vs Claude Opus 4.8 | VPS Sabotage Challenge

In this video, I put Qwopus 27B up against Claude Opus 4.8 in a different kind of head-to-head test. Instead of just having both models build a single browser app, I gave each one a clean Ubuntu VPS with root access and had them deploy a full web project from scratch. They had to SSH in, install Nginx, set up a site on port 80, build a homepage with system info, create a server dashboard, and make a playable browser game. Then things got a little more interesting. After both models finished their builds, I had them connect to each other’s VPS and sabotage the opponent’s dashboard in a controlled way. After that, each model had to troubleshoot and repair its own broken site without using backups, hints, or sabotage notes. This test is meant to see how well each model can handle real-world-ish server setup, coding, deployment, debugging, and fixing something it didn’t originally break. As always, this is not a perfect scientific benchmark. It’s just a practical head-to-head to see which model handles the challenge better.

2 models
Qwen3.6 27B vs Qwen3.7 Max | Head to Head
LLM TestingJune 3, 2026

Qwen3.6 27B vs Qwen3.7 Max | Head to Head

In this video, I put Qwen3.6 27B Q8XL head-to-head against Qwen3.7 Max to see how the local model compares against the newer cloud model. Both models get the same single-file HTML, CSS, and JavaScript prompts: 1. iPhone Mockup 2. Halo 2-Style FPS 3. Weather Dashboard Qwen3.6 27B Q8XL is running locally on my GPU setup, while Qwen3.7 Max is running in the cloud. The goal is to see which model follows instructions better, builds the cleaner UI, creates the more functional project, and handles complex browser-based coding prompts without falling apart.

2 models6 files
Qwen3.6 27B vs Step3.7 Flash | Local AI vs OpenRouter
LLM TestingJune 2, 2026

Qwen3.6 27B vs Step3.7 Flash | Local AI vs OpenRouter

In this video, I put Qwen3.6 27B Q6 running locally head-to-head against Step3.7 Flash through OpenRouter to see how a local model compares against a fast API model on real coding tasks. Both models were given the same prompts so we could compare speed, coding ability, creativity, interactivity, UI polish, physics, and how well each one handled single-file browser projects using HTML, CSS, and JavaScript. In this test, the models build: • A realistic iPhone-style mobile app UI • Sled Sketch, a Line Rider-style physics game with preset tracks • VoltNode Hosting, a modern web hosting company website This is not meant to be a perfect scientific benchmark. It is more of a real-world coding head-to-head to see how these models actually behave when building the same projects side by side. Let me know which model you think won and what matchup I should test next. #Qwen #Qwen36 #StepAI #OpenRouter #LocalAI #LLM #AICoding #OpenSourceAI #RTX3090 #TokenChaser

2 models6 files
Qwen3.6 27B MTP vs GPT-5.5 | Local AI Head-to-Head
LLM TestingJune 1, 2026

Qwen3.6 27B MTP vs GPT-5.5 | Local AI Head-to-Head

In this video, I put Qwen3.6 27B Q8 head-to-head against GPT-5.5 to see how a local model running on consumer hardware compares against one of the strongest frontier coding models. Both models were given the same prompts at the same time so we could compare speed, coding ability, creativity, interactivity, UI polish, game logic, physics, and how well each one handled single-file HTML projects. In this test, the models build: • Sled Sketch — a Line Rider-style physics sandbox • Cubicle Chaos • Smart City Emergency Command Dashboard This is not meant to be a perfect scientific benchmark. It is more of a real-world coding head-to-head to see how these models actually behave when given the same tasks side by side. Let me know which model you think won and what matchup I should test next. #Qwen #Qwen36 #GPT55 #LocalAI #LLM #AICoding #OpenSourceAI #RTX3090 #TokenChaser

2 models6 files
Qwopus3.6 35B A3B MTP vs 27B MTP | Local AI Head-to-Head
LLM TestingMay 30, 2026

Qwopus3.6 35B A3B MTP vs 27B MTP | Local AI Head-to-Head

In this video, I put Qwopus3.6 35B A3B MTP head-to-head against Qwopus3.6 27B MTP to see how the larger A3B MTP version compares against the smaller 27B MTP model. Both models were run locally and given the same prompts at the same time so we could compare speed, coding ability, creativity, interactivity, UI polish, and how well each one handled single-file HTML projects. In this test, the models build: • AI Server Fleet Control Panel • Galactic Colony Survival Game • Monster Truck Mayhem This is not meant to be a perfect scientific benchmark. It is more of a real-world local AI coding test to see how these models actually behave when given the same tasks side by side. Let me know which model you think won and what matchup I should test next.

2 models6 files
Claude Opus 4.8 vs Qwen3.6 27B: Can Local AI Keep Up?
LLM TestingMay 29, 2026

Claude Opus 4.8 vs Qwen3.6 27B: Can Local AI Keep Up?

Claude Opus 4.8 just dropped, so I put it head-to-head against Qwen3.6 27B running locally to see how the newest frontier coding model compares against a local AI model on real coding tasks. Both models were given the same prompts at the same time so we could compare speed, coding ability, creativity, interactivity, UI polish, and how well each one handled more advanced single-file HTML projects. In this test, the models build: • AI Model Testing Command Center • Planet Chaos Simulator • Midnight Mall FPS Survival Game This is not meant to be a perfect scientific benchmark. It is more of a real-world coding head-to-head to see how these models actually behave when given the same tasks side by side. Let me know which model you think won and what matchup I should test next. #Claude #ClaudeOpus #Opus48 #Qwen #Qwen36 #LocalAI #LLM #AICoding #OpenSourceAI #RTX3090 #TokenChaser

2 models6 files
Qwen3.6 27B Q8 vs Claude Opus 4.6 | Head-to-Head
LLM TestingMay 28, 2026

Qwen3.6 27B Q8 vs Claude Opus 4.6 | Head-to-Head

In this video, I put Qwen3.6 27B Q8XL Unsloth head-to-head against Claude Opus 4.6 to see how a local model running on consumer GPUs compares against one of the strongest frontier coding models. Both models were given the same prompts at the same time so we could compare speed, coding ability, creativity, interactivity, UI polish, and how well each one handled more advanced single-file HTML projects. In this test, the models build: • Smart Home Energy Command Center • Agent Automation Studio • Tiny Planet Terraformer Game This is not meant to be a perfect scientific benchmark. It is more of a real-world coding head-to-head to see how these models actually behave when given the same tasks side by side. Let me know which model you think won and what matchup I should test next. #Qwen #Qwen36 #Claude #ClaudeOpus #LocalAI #LLM #AICoding #OpenSourceAI #RTX3090 #Unsloth #TokenChaser

2 models6 files
Qwen3 Coder Next vs Claude Opus 4.6 | Head-to-Head
LLM TestingMay 27, 2026

Qwen3 Coder Next vs Claude Opus 4.6 | Head-to-Head

In this video, I put Qwen3 Coder Next head-to-head against Claude Opus 4.6 to see how a local/open model compares against one of the strongest frontier coding models. Both models were given the same prompts at the same time so we could compare speed, coding ability, creativity, interactivity, UI polish, and how well each one handled more advanced single-file HTML projects. In this test, the models build: • AI SaaS Admin Dashboard • Visual Workflow Builder • Cyberpunk Delivery Arcade Game This is not meant to be a perfect scientific benchmark. It is more of a real-world coding head-to-head to see how these models actually behave when given the same tasks side by side. Let me know which model you think won and what matchup I should test next. #Qwen #QwenCoder #Claude #ClaudeOpus #LocalAI #LLM #AICoding #OpenSourceAI #TokenChaser #CodingAI

2 models6 files
Qwopus3.6 27B MTP vs Claude Opus 4.6 | Local vs Cloud Head-to-Head
LLM TestingMay 26, 2026

Qwopus3.6 27B MTP vs Claude Opus 4.6 | Local vs Cloud Head-to-Head

In this video, I’m testing Qwopus3.6 27B MTP running locally against Claude Opus 4.6 in a head-to-head AI coding challenge. Both models are given the same single-file HTML prompts to see how they handle UI design, JavaScript logic, interactivity, polish, and overall project quality. The prompts in this video: 1. Personal finance dashboard 2. Smart home automation/control dashboard 3. Zombie survival game The goal is to see whether a local 27B MTP model can keep up with a much larger cloud model like Claude Opus 4.6 when building real browser-based projects from scratch. If you like local AI, LLM coding tests, GPUs, homelab setups, and seeing what these models can actually build, subscribe for more. #LocalAI #Qwopus #ClaudeOpus #LLM #AICoding #RTX3090 #Homelab #TokenChaser

2 models6 files
Qwen3.6 27B vs Nemotron Super 3 120B | Head-to-Head
LLM TestingMay 24, 2026

Qwen3.6 27B vs Nemotron Super 3 120B | Head-to-Head

In this video, I’m testing local Qwen3.6 27B against Nemotron Super 3 120B running on AWS to see how a smaller local model compares against a much larger cloud model. Both models are given the same single-file HTML coding prompts, and I’m comparing speed, UI design, instruction following, functionality, polish, and how well each project actually works. The prompts in this video: 1. AI model benchmark dashboard 2. Visual website builder 3. Mini arcade basketball game This is a local AI vs cloud AI coding head-to-head to see whether the bigger 120B model has a clear advantage, or if local Qwen3.6 27B can still keep up. If you like local AI, LLM coding tests, GPUs, homelab setups, and seeing what these models can actually build, subscribe for more. #LocalAI #Qwen #Nemotron #LLM #AICoding #AWS #RTX3090 #Homelab #TokenChaser

2 models6 files
Qwen3.6 27B vs Heretic NEO Code 27B on RTX 3090s | Head-to-Head
LLM TestingMay 24, 2026

Qwen3.6 27B vs Heretic NEO Code 27B on RTX 3090s | Head-to-Head

In this video, I’m testing the default Qwen3.6 27B against the Heretic NEO Code 27B fine-tune to see which model performs better in local AI coding tasks. Both models are running locally on RTX 3090s, and I gave them the same prompts at the same time to compare speed, project quality, UI design, functionality, and how well they follow instructions. The prompts in this video: 1. Interactive weather dashboard 2. Restaurant ordering app 3. Mini social media app This test is focused on single-file HTML projects, so each model has to handle the HTML, CSS, and JavaScript in one file while trying to create a polished, fully working result. If you like local AI, LLM coding tests, GPUs, homelab setups, and seeing what these models can actually build, subscribe for more. #LocalAI #Qwen #LLM #AICoding #RTX3090 #Homelab #TokenChaser

2 models6 files
Qwopus3.6 27B vs Qwen3.6 27B on RTX 3090s | Head-to-Head
LLM TestingMay 23, 2026

Qwopus3.6 27B vs Qwen3.6 27B on RTX 3090s | Head-to-Head

In this video, I’m testing Qwopus 3.6 27B v2 against the default Qwen 3.6 27B to see if the Claude Opus-style reasoning fine-tune actually helps with coding. Both models are running locally on my AI server using the same Q6_K quant, so the goal is to compare the model behavior as fairly as possible. I gave them the same single-file HTML prompts and tested how well they handled planning, UI design, game logic, and finishing usable browser projects. The prompts in this video: 1. Simple tower defense game 2. Xbox 360-inspired dashboard with a mini game 3. Top-down driving game Qwopus is basically Qwen 3.6 27B under the hood, but fine-tuned with Claude Opus-style reasoning data. I wanted to see if that makes it better at structured coding tasks compared to the default model. If you like local AI, LLM coding tests, GPUs, homelab setups, and seeing what these models can actually build, subscribe for more. #LocalAI #Qwen #Qwopus #LLM #AICoding #3090 #Homelab #TokenChaser

2 models6 files
Qwen3.6 27B vs 27B MTP on RTX 3090s | Head-to-Head
LLM TestingMay 23, 2026

Qwen3.6 27B vs 27B MTP on RTX 3090s | Head-to-Head

In this video, I put Qwen3.6 27B and Qwen3.6 27B MTP head to head on RTX 3090s to see how the regular model compares against the MTP version. Both models were run locally on my AI server using separate 3090 GPUs, and I tested them with the exact same single-file HTML prompts at the same time to compare speed, quality, creativity, interactivity, and how well each model followed instructions. Prompts included: • Interactive lava lamp • Mini aquarium simulator • Tower defense game This is not a perfect scientific benchmark, but it is a real-world side-by-side test to see if the MTP version actually feels faster or better when building creative, interactive browser-based projects from scratch. Models tested: • Qwen3.6 27B • Qwen3.6 27B MTP Hardware: • RTX 3090 GPUs • Local AI server setup • Running both models side by side Let me know which model you think won and what matchup I should test next. #Qwen #Qwen36 #LocalAI #LLM #AIModels #OpenSourceAI #CodingAI #RTX3090 #Homelab #TokenChaser #MTP

2 models6 files
Gemma4 31B vs Qwen3.6 27B | Local Head-to-Head
LLM TestingMay 21, 2026

Gemma4 31B vs Qwen3.6 27B | Local Head-to-Head

In this video, I put Gemma 4 31B head to head against Qwen3.6 27B to see which model performs better on real interactive coding tasks. Both models were tested with the same 3 single-file HTML prompts: • iOS-style web UI with apps, a game, theme switching, and a surprise feature • Full Tetris game with sound effects, scoring, leaderboard, and AI autoplay • Interactive cyberpunk stock trading dashboard with simulated live data The goal was to compare speed, design quality, instruction following, interactivity, game logic, UI polish, and how usable the final outputs actually were. This is not a perfect scientific benchmark. It is a real-world side-by-side test to see how these models perform when building creative browser projects from scratch. Models tested: • Gemma 4 31B • Qwen3.6 27B Let me know which model you think won and what matchup I should test next. #Gemma4 #Qwen #Qwen36 #LocalAI #LLM #AIModels #OpenSourceAI #CodingAI #RTX3090 #Homelab #TokenChaser

2 models6 files
Qwen3.6 27B Q8 vs Claude Sonnet 4.6 | Head-to-Head
LLM TestingMay 21, 2026

Qwen3.6 27B Q8 vs Claude Sonnet 4.6 | Head-to-Head

In this video, I put Qwen3.6 27B Q8 head to head against Claude Sonnet 4.6 to see how a local open model compares against one of the big cloud models. Both models were tested with the same 3 single-file HTML coding prompts: • Tiny ant colony simulator • Neon drone arena game • Top-down neon racing game The goal was to compare how each model handled real interactive coding tasks, including game logic, movement, animations, UI polish, physics, responsiveness, and whether the final output was actually fun to use. This is not meant to be a perfect scientific benchmark. It is a real-world side-by-side test to see how these models perform when asked to build creative, interactive browser projects from scratch. Models tested: • Qwen3.6 27B Q8 • Claude Sonnet 4.6

2 models6 files
Qwen3.6 27B vs 35B Unsloth on RTX 3090s | Head-to-Head
LLM TestingMay 20, 2026

Qwen3.6 27B vs 35B Unsloth on RTX 3090s | Head-to-Head

In this video, I put the Unsloth Qwen3.6 27B and Qwen3.6 35B-A3B models head to head on RTX 3090s to see how they handle the exact same coding prompts at the same time. Both models were run locally on my AI server using separate 3090 GPUs, and I tested them with single-file HTML prompts to compare speed, quality, creativity, and how well each model followed instructions. Prompts included: • Interactive cloth simulator • Liquid slime/blob mouse effect • Neon glowing hex grid honeycomb with particles This is not a perfect scientific benchmark, but it is a real-world side-by-side test to see which model feels better for local coding, creative UI generation, and browser-based interactive projects. Models tested: • Unsloth Qwen3.6 27B • Unsloth Qwen3.6 35B-A3B Hardware: • RTX 3090 GPUs • Local AI server setup • Running both models side by side

2 models6 files