Interestingly, mlc's web-llm runs much better on my iGPU than dGPU when the model size goes over available dGPU vram. Llama 2 7B runs faster on dGPU, but when I switch to Llama 2 13B suddenly my iGPU outperforms. I think because the iGPU effectively utilizes shared memory?
MLC has no GPU offloading. I assume the driver itself is spilling over into RAM, and this is essentially a silent malfunction because the driver implementation is extremely slow.