> … you can expect the speed to half when going from 4k to 16k long prompt …
> … it did slow down somewhat (from 25T/s to 18T/s) for very long context …
Depends on the hardware configuration (size of VRAM, speed of CPU and system RAM) and llama.cpp parameter settings, a bigger context prompt slows the T/s number significantly but not order of magnitudes.
Facit: gpt-oss 120B on a small GPU is not the proper setup for chat use cases.
People can read at a rate around 10 token/sec. So faster than that is pretty good, but it depends how wordy the response is (including chain of thought) and whether you'll be reading it all verbatim or just skimming.
Reading while words are flying by is really distracting. I believe it was mentioned at some point that 50t/s feels comfortable and ChatGPT aims for that (no source, sorry).
I'm not really timing it as I just use these models via open webui, nvim and a few things I've made like a discord bot, everything going via ollama.
But for comparison, it is generating tokens about 1.5 times as fast as gemma 3 27B qat or mistral-small 2506 q4.
Prompt processing/context however seems to be happening at about 1/4 of those models.
A bit more concrete of the "excellent", I can't really notice any difference between the speed of oss-120b once the context is processed and claude opus-4 via api.
I've found threads online that suggest that running gpt-oss-20b on ollama is slow for some reason. I'm running the 20b model via LM Studio on a 2021 M1 and I'm consistently getting around 50-60 T/s.
Pro tip: disable the title generation feature or set it to another model on another system.
After every chat, open webui is sending everything to llamacpp again wrapped in a prompt to generate the summary, and this wipes out the KV cache, forcing you to reprocess the entire context.
This will get rid of the long prompt processing times id you're having long back and forth chats with it.
How many tokens is excellent? How many is super slow? How many is non-filled context?