Comparison between Dia-1.6B (ours), ElevenLabs Studio, and Sesame CSM-1B. Plus fun examples (including audio prompt use).

Standard Usage


Input script

[S1] Dia is an open weights text to dialogue model. 
[S2] You get full control over scripts and voices. 
[S1] Wow. Amazing. (laughs) 
[S2] Try it now on Git hub or Hugging Face.

Dia-1.6B (ours)

audio (8).wav

ElevenLabs Studio

ElevenLabs_Untitled_Project (1).mp3

Sesame CSM-1B

full_conversation (3).wav

Note that ElevenLabs and Sesame models do not have the ability to transcribe laughter tags into speech. We replace (laughs) with haha. Also, Dia is not fine-tuned on a specific voice. It will generate random voices unless you add audio prompts, or fix the seed.

Input script

[S1] Hey. how are you doing?  
[S2] Pretty good. Pretty good. What about you? 
[S1] I'm great. So happy to be speaking to you.  
[S2] Me too. This is some cool stuff. Huh?  
[S1] Yeah. I have been reading more about speech generation. 
[S2] Yeah. 
[S1] And it really seems like context is important. 
[S2] Definitely.

Dia-1.6B (ours)

audio (32).wav

ElevenLabs Studio

ElevenLabs_Untitled_Project (2).mp3

Sesame Website Example