Yeah, using the GPU makes a huge difference. It's also incredibly finicky because the hardware and browser support varies so much.
Basically I represent an n-qubit superposition as a 2^(n/2) x 2^(n/2) texture with each pixel being an amplitude (with red=real, blue=imaginary components). Then I use fragment shaders to operate on all the pixels in parallel when applying a gate.
For example, here's the GLSL in the main method of the "UniversalNot" gate's shader [1]:
(The UniversalNot gate isn't possible in reality, but it's easy to implement in the simulator. So I have it implemented, but hidden away. You have to manually tweak the URL to contain a "__unstable__UniversalNot" gate to use it. But then you can use it for FTL communication shenanigans. [2])
That's an interesting strategy. I'm going to have to dig deeper when I get a chance. You said you're also using WebGL for a performance?
PS: Those continuous gates are especially cool!