vkaufmann's comments

vkaufmann · 2026-03-09T01:30:28 1773019828

Neural net CAPTCHA solver. MobileNetV2 + OpenCV. Built for the hell of it.

Takes a reCAPTCHA image grid, splits it into cells, classifies each cell with a MobileNetV2 ImageNet classifier, and tells you which ones match the prompt. Also works live on Playwright pages.

LOL

https://github.com/VincentKaufmann/captcha-solver-ai

Work in progress, its getting way better in the next release

vkaufmann · 2026-02-25T10:32:06 1772015526

I went all in and wrote a custom FP4 GEMM kernel on top of CUTLASS 3.8. Along the way I discovered FP4 doesn’t actually help training - no backward pass. But what came out of it is something I haven’t seen anywhere else for consumer Blackwell: a standalone FP4 GEMM library with a pre-quantized weight cache that hits 85-129 TFLOPS on the Spark.

Quantize weights once at model load, only quantize activations on the fly per call. Integrated into a full transformer (GPT-OSS-4.2B, 24 layers, 288 GEMM calls per forward pass), it runs 1.3-2.3x faster than BF16 at inference-relevant batch sizes with 4x memory savings. Tested on both 4.2B and 20B models - the 20B drops from 43.4 GB to 4.0 GB with FP4 weights (10.8x compression). No dependency on vLLM, TRT-LLM, or sglang - just a library you can call from any Python code.

Full source is open: GitHub - VincentKaufmann/fp4-cuda-kernel: Custom FP4 GEMM kernel for DGX Spark / RTX 50 Series (SM120/SM121). 143 TFLOPS, 5-9x faster than BF16. Built on CUTLASS 3.8.

Why This Library Exists

No existing path gives you hardware FP4 on SM121 as a standalone library.

Find the complete post here:https://forums.developer.nvidia.com/t/custom-fp4-cuda-kernel...

Repo: https://github.com/VincentKaufmann/fp4-cuda-kernel

vkaufmann · 2026-02-19T22:54:33 1771541673

GPT-OSS-20B-Vision: First community VLM for GPT-OSS, trained on a single DGX Spark

A couple weeks ago I shipped an MCP server (noapi-google-search-mcp) and people in the community challenged me to do something harder - build a VLM. So I bought a DGX Spark, flew to Dubai, and built the first vision-language model for GPT-OSS from a hotel room. Just a Spark, hotel WiFi and stubbornness.

This is an early proof of concept at 22% training - shipped it to show what's possible and to find compute partners to finish the job.

What it does: Adds vision to GPT-OSS-20B. Takes an image + text prompt, generates coherent descriptions. Identifies objects, scenes, spatial relationships. Vision was trained directly into the model through QLoRA adaptation - the LLM learned to see, not just pass through visual tokens. All original text capabilities are fully preserved. Hallucinations present - expected at this training stage.

How it works: A SigLIP vision encoder feeds into the 20B MoE language model through a method I call PseudoDeepStack - extracting visual features from multiple encoder depths instead of just the final layer. Richer visual representations at zero additional inference cost.

Key finding: Projector-only training (the standard approach for dense VLMs) fails completely on MoE architectures. The expert routing can't handle visual tokens it's never seen. QLoRA adaptation solves this.

The setup: Single NVIDIA DGX Spark GB10, hotel room in Dubai, Domino's pizza. No cluster, no team. ~3.5 days of training to this checkpoint.

What's next: Finishing training with new hyperparameters based on what we learned from this run, scaling to GPT-OSS-120B (same projector works - shared hidden dimensions), benchmarking. Need compute to get there.

Model + code + full model card: https://huggingface.co/vincentkaufmann/gpt-oss-20b-vision-pr...

vkaufmann · 2026-02-11T16:55:03 1770828903

About to release GPT-OSS-120B-Vision and GPT-OSS-20B-Vision, how about that! :D

vkaufmann · 2026-02-11T15:30:10 1770823810

Its meant to be super light weight for people who run 1B, 3B, 8B or 20B models on skinny devices, one Pip install with high impact for one install :D

vkaufmann · 2026-02-11T12:45:03 1770813903

Coolest thing about it is, its 1 pip install to give your local model the ability to see, do Google Searches and use News, Shopping, Scholar, Maps, Finance, Weather, Flights, Hotels, Translate, Images, Trends etc

Easiest and fastest way and the impact is massive

vkaufmann · 2026-02-11T12:40:19 1770813619

GPT-OSS-120B runs like hell on my DGX Spark

embedding-shape · 2026-02-11T14:16:18 1770819378

The MXFP4 variant I suppose? My setup (RTX Pro 6000) does around ~140 tok/s with llama.cpp, around 160 tok/s with vLLM.

vkaufmann · 2026-02-11T15:32:08 1770823928

yep MXFP4 really fast :D

vkaufmann · 2026-02-11T12:39:19 1770813559

too slow bro

leumon · 2026-02-11T13:44:40 1770817480

might be slower, but then it can get the actual image as input, not just some description of it

vkaufmann · 2026-02-11T12:38:43 1770813523

Thought this is "hacker new" bro