More

toebee · on April 22, 2025

You're absolutely right. We used Jordan's Whisper-D, and he was generous enough to offer some guidance along the way.

It's also a valid criticism that we haven’t yet audited the dataset for existing list of tags. That’s something we’ll be improving soon.

As for Dia’s architecture, we largely followed existing models to build the 1.6B version. Since we only started learning about speech AI three months ago, we chose not to innovate too aggressively early on. That said, we're planning to introduce MoE and Sliding Window Attention in our larger models, so we're excited to push the frontier in future iterations.

kamranjon · on April 22, 2025

I’m curious what differentiates it from Parakeet? I was listening to some of the demos on the parakeet announcement and they sound very similar to your examples - are they trained on the same data? Are there benefits to using Dia over Parakeet?

yencabulator · on April 23, 2025

Well, this for one, about Parakeet:

> We plan to release our fine-tuned whisper models and possibly the generative model (and/or future improved versions). The generative model would have to be released under a non-commercial license due to our datasets.

https://jordandarefsky.com/blog/2024/parakeet/

toebee · on April 22, 2025

Thank you for the kind words! We only support English at the moment.. Hope to add more languages in the future.

toebee · on April 22, 2025

Thanks you!! We personally used Quickpod and Runpod the most. But you can try it now on HF Spaces without spinning up GPUs yourself!

https://huggingface.co/spaces/nari-labs/Dia-1.6B

toebee · on April 22, 2025

Thanks for the interest! We also enjoyed using E5-F2 :) You can try it now on HF Spaces: https://huggingface.co/spaces/nari-labs/Dia-1.6B

toebee · on April 22, 2025

Thank you so much for the kind words :) We only support English at the moment, hopefully can do more languages in the future. We are planning to release a technical report on some of the details, so stay tuned for that!

bavell · on April 22, 2025

I'd also love to peek behind the curtains, if only to satisfy my own curiosity. Looking forward to the technical report, well done!

toebee · on April 22, 2025

We will try to make it work, but not sure if will be an easy task. For now, you can try with https://huggingface.co/spaces/nari-labs/Dia-1.6B

toebee · on April 22, 2025

We're adding guides for Zero-shot voice cloning. You can try it using the second example on Gradio: https://huggingface.co/spaces/nari-labs/Dia-1.6B

youssefabdelm · on April 22, 2025

Will give it a shot but I feel like fine-tuning will be more reliable, any way to do that?

toebee · on April 22, 2025

We just clarified in the README, sorry for the confusion ;(

Note that the model was not fine-tuned on a specific voice. Hence, you will get different voices every time you run the model. You can keep speaker consistency by either adding an audio prompt (a guide coming VERY soon - try it with the second example on Gradio or HF Space for now), or fixing the seed.

toebee · on April 22, 2025

Only two voices at the moment... We will need to upgrade the dataset to make that happen, and are considering that as one of the next steps.

toebee · on April 22, 2025

https://huggingface.co/spaces/nari-labs/Dia-1.6B we fixed it!