Wow, using the IGP for the parts that don't fit on the discrete GPU is a great i...

gsuuon · on Aug 7, 2023

Interestingly, mlc's web-llm runs much better on my iGPU than dGPU when the model size goes over available dGPU vram. Llama 2 7B runs faster on dGPU, but when I switch to Llama 2 13B suddenly my iGPU outperforms. I think because the iGPU effectively utilizes shared memory?

brucethemoose2 · on Aug 7, 2023

MLC has no GPU offloading. I assume the driver itself is spilling over into RAM, and this is essentially a silent malfunction because the driver implementation is extremely slow.

brucethemoose2 · on Aug 6, 2023

Yeah. It might be possible in llama.cpp soon, but the vulkan implementation may or may not be fast on the IGP.