Neural net CAPTCHA solver. MobileNetV2 + OpenCV. Built for the hell of it.
Takes a reCAPTCHA image grid, splits it into cells, classifies each cell with a MobileNetV2 ImageNet classifier, and tells you which ones match the prompt. Also works live on Playwright pages.
I went all in and wrote a custom FP4 GEMM kernel on top of CUTLASS 3.8. Along the way I discovered FP4 doesn’t actually help training - no backward pass. But what came out of it is something I haven’t seen anywhere else for consumer Blackwell: a standalone FP4 GEMM library with a pre-quantized weight cache that hits 85-129 TFLOPS on the Spark.
Quantize weights once at model load, only quantize activations on the fly per call. Integrated into a full transformer (GPT-OSS-4.2B, 24 layers, 288 GEMM calls per forward pass), it runs 1.3-2.3x faster than BF16 at inference-relevant batch sizes with 4x memory savings. Tested on both 4.2B and 20B models - the 20B drops from 43.4 GB to 4.0 GB with FP4 weights (10.8x compression). No dependency on vLLM, TRT-LLM, or sglang - just a library you can call from any Python code.
Full source is open: GitHub - VincentKaufmann/fp4-cuda-kernel: Custom FP4 GEMM kernel for DGX Spark / RTX 50 Series (SM120/SM121). 143 TFLOPS, 5-9x faster than BF16. Built on CUTLASS 3.8.
Why This Library Exists
No existing path gives you hardware FP4 on SM121 as a standalone library.
GPT-OSS-20B-Vision: First community VLM for GPT-OSS, trained on a single DGX Spark
A couple weeks ago I shipped an MCP server (noapi-google-search-mcp) and people in the community challenged me to do something harder - build a VLM. So I bought a DGX Spark, flew to Dubai, and built the first vision-language model for GPT-OSS from a hotel room. Just a Spark, hotel WiFi and stubbornness.
This is an early proof of concept at 22% training - shipped it to show what's possible and to find compute partners to finish the job.
What it does: Adds vision to GPT-OSS-20B. Takes an image + text prompt, generates coherent descriptions. Identifies objects, scenes, spatial relationships. Vision was trained directly into the model through QLoRA adaptation - the LLM learned to see, not just pass through visual tokens. All original text capabilities are fully preserved. Hallucinations present - expected at this training stage.
How it works: A SigLIP vision encoder feeds into the 20B MoE language model through a method I call PseudoDeepStack - extracting visual features from multiple encoder depths instead of just the final layer. Richer visual representations at zero additional inference cost.
Key finding: Projector-only training (the standard approach for dense VLMs) fails completely on MoE architectures. The expert routing can't handle visual tokens it's never seen. QLoRA adaptation solves this.
The setup: Single NVIDIA DGX Spark GB10, hotel room in Dubai, Domino's pizza. No cluster, no team. ~3.5 days of training to this checkpoint.
What's next: Finishing training with new hyperparameters based on what we learned from this run, scaling to GPT-OSS-120B (same projector works - shared hidden dimensions), benchmarking. Need compute to get there.
Coolest thing about it is, its 1 pip install to give your local model the ability to see, do Google Searches and use News, Shopping, Scholar, Maps, Finance, Weather, Flights, Hotels, Translate, Images, Trends etc
Takes a reCAPTCHA image grid, splits it into cells, classifies each cell with a MobileNetV2 ImageNet classifier, and tells you which ones match the prompt. Also works live on Playwright pages.
LOL
https://github.com/VincentKaufmann/captcha-solver-ai
Work in progress, its getting way better in the next release