Notice that games are often able to render each frame in 5ms: which in practice means run multiple short programs for each pixel you see on the 4k screen. So modern computers are able to do huge huge amount of computation in 5ms (in the order of 10^10 flops, 10^8 bytes). If puny kilobytes of audio data cannot be processed in 5ms it means things are terribly wrong.
Games do a very impressive amount of work for graphics but there's a huge difference: a dropped/late graphics frame every now and then is not a big deal.
An audio glitch is very annoying by comparison, especially if the application is a live musical instrument or something like that. Even the choppy rocket motor sounds of Kerbal Space Program (caused by garbage collector pauses) are infuriating.
It's kind of the difference between soft and hard real time systems. Although most audio applications don't strictly qualify as hard real time (missing a deadline is as bad as a total failure) but failing a deadline is much worse than in graphics.