LOCAL MODELS / REAL GPUS

TokenChaser

Notes from local model runs on consumer GPUs: prompts, generated code, raw outputs, and live HTML files from the videos.

Open Lab Notes Watch on YouTube

17 lab notes

5K subscribers

Full outputs

Latest Lab Notes

View all →

LLM TestingJune 5, 2026

Gemma4 12B vs Qwen3.5 9B | Local Head to Head

In this video, I put Gemma 4 12B IT head-to-head against Qwen3.5 9B to see how these smaller local models handle real browser-based coding prompts. Both models get the same single-file HTML, CSS, and JavaScript tests: 1. iPhone Replica 2. Top-Down Car Game 3. Live Weather Dashboard The goal is to see which model follows instructions better, builds the cleaner UI, creates the more functional project, and handles interactive coding tasks without completely falling apart. These are lighter prompts than some of the bigger model tests, but they still cover UI design, game logic, JavaScript interaction, API handling, layout, and overall polish. #gemma4 #qwen #localAI #llm #homelab #headtohead

2 models6 files

LLM TestingJune 3, 2026

Qwopus 27B vs Claude Opus 4.8 | VPS Sabotage Challenge

In this video, I put Qwopus 27B up against Claude Opus 4.8 in a different kind of head-to-head test. Instead of just having both models build a single browser app, I gave each one a clean Ubuntu VPS with root access and had them deploy a full web project from scratch. They had to SSH in, install Nginx, set up a site on port 80, build a homepage with system info, create a server dashboard, and make a playable browser game. Then things got a little more interesting. After both models finished their builds, I had them connect to each other’s VPS and sabotage the opponent’s dashboard in a controlled way. After that, each model had to troubleshoot and repair its own broken site without using backups, hints, or sabotage notes. This test is meant to see how well each model can handle real-world-ish server setup, coding, deployment, debugging, and fixing something it didn’t originally break. As always, this is not a perfect scientific benchmark. It’s just a practical head-to-head to see which model handles the challenge better.

2 models

LLM TestingJune 3, 2026

Qwen3.6 27B vs Qwen3.7 Max | Head to Head

In this video, I put Qwen3.6 27B Q8XL head-to-head against Qwen3.7 Max to see how the local model compares against the newer cloud model. Both models get the same single-file HTML, CSS, and JavaScript prompts: 1. iPhone Mockup 2. Halo 2-Style FPS 3. Weather Dashboard Qwen3.6 27B Q8XL is running locally on my GPU setup, while Qwen3.7 Max is running in the cloud. The goal is to see which model follows instructions better, builds the cleaner UI, creates the more functional project, and handles complex browser-based coding prompts without falling apart.

2 models6 files

Resources

View all →

QuantEval

Independent Open-Weights Benchmarking and Quantization Analysis of AI Models

OpenCode

Open-source AI coding agent for terminal, IDE, and desktop workflows. Useful for testing local and cloud models on real coding tasks.

llama.cpp

Run LLMs locally with GGUF models. Great for testing quantized models, CPU/GPU inference, and squeezing performance out of consumer hardware.

Hugging Face Models

Browse, download, and compare open models, fine-tunes, datasets, GGUF uploads, model cards, and community releases.

Ollama

Simple way to run local models with quick installs, easy model pulls, and a local API for apps and tools.