OK, so this is a case of bad measurement and comparison. If you bothered to look...

paul_mk1 · on Aug 6, 2023

>The "-ngl 32" means that only 32 out of 35 layers are being run on the GPU, and this results in a huge slow down as the GPU syncs with the CPU, and then computes the last 3 layers on the CPU.

Thanks for the updated run configuration. It was a misunderstanding on our part about what llama.cpp considers “layers”, since layers are traditionally understood as learned parameter decoder layers (as they do in Hugging Face models). And, in this case the llama 7B model has 32 layers.

>On my XTX 7900, I get a 55% speed up on llama.cpp (to 132.61 tok/sec) when running all layers on the GPU, rather than only 32 as in your measurements.

On my 4090 I now get 128 t/s for Q5_1, and 116 t/s for Q6_k. So these are ballpark to mk600's 125 t/s for batch=1. Not surprising that different inference runtimes approach the same speed as they become memory bound for similar model sizes.

polishgladiator · on Aug 6, 2023

Something doesn't smell right.

Such sloppy errors with measurement and comparison (from people who are supposedly experts?), and cageyness about answering technical questions, reminds me of the era of crypto currency scams..