Comparison between Dia-1.6B (ours), ElevenLabs Studio, and Sesame CSM-1B. Plus fun examples (including audio prompt use).
Input script
[S1] Dia is an open weights text to dialogue model.
[S2] You get full control over scripts and voices.
[S1] Wow. Amazing. (laughs)
[S2] Try it now on Git hub or Hugging Face.
ElevenLabs_Untitled_Project (1).mp3
Note that ElevenLabs and Sesame models do not have the ability to transcribe laughter tags into speech. We replace (laughs)
with haha
. Also, Dia is not fine-tuned on a specific voice. It will generate random voices unless you add audio prompts, or fix the seed.
Input script
[S1] Hey. how are you doing?
[S2] Pretty good. Pretty good. What about you?
[S1] I'm great. So happy to be speaking to you.
[S2] Me too. This is some cool stuff. Huh?
[S1] Yeah. I have been reading more about speech generation.
[S2] Yeah.
[S1] And it really seems like context is important.
[S2] Definitely.
ElevenLabs_Untitled_Project (2).mp3