I just tried the demo on the homepage and I don’t know what kind of sorcery this...

I just tried the demo on the homepage and I don’t know what kind of sorcery this is but it’s blowing my mind.

I input a bunch of completely made up words (Quastral Syncing, Zarnix Meshing, HIBAX, Bilxer) and used them in a sentence and the model zero-shotted perfect speech recognition!

It’s so counterintuitive for me that this would work. I would have bet that you have to provide at least one audio sample in order for the model to recognize a word it was never trained on.

Providing it to the model in text modality and it being able to recognize it in the audio modality must be an emergent property.