Dia-1.6B TTS : Greatest Textual content-to-Dialogue Technology Mannequin


Searching for the correct text-to-speech mannequin? The 1.6 billion parameter mannequin Dia could be the one for you. You’d even be shocked to listen to that this mannequin was created by two undergraduates and with zero funding! On this article, you’ll be taught concerning the mannequin, tips on how to entry and use the mannequin and in addition see the outcomes to essentially know what this mannequin is able to. Earlier than utilizing the mannequin, it might be applicable to get acquainted with it.

What’s Dia-1.6B?

The fashions skilled with the aim of getting textual content as enter and pure speech as output, are known as text-to-speech fashions. The Dia-1.6B parameter mannequin developed by Nari Labs belongs to the text-to-speech fashions household. That is an attention-grabbing mannequin that’s able to producing sensible dialogue from a transcript. It’s additionally value noting that the mannequin can produce nonverbal communications like chortle, sneeze, whistle and many others. Thrilling isn’t it? 

How you can Entry the Dia-1.6B?

Two methods during which we will entry the Dia-1.6B mannequin:

  1. Utilizing Hugging Face API with Google Collab
  2. Utilizing Hugging Face Areas

The primary one would require getting the API key after which integrating it in Google Collab with code. The latter is a no-code and permits us to interactively use Dia-1.6B. 

1. Utilizing Hugging Face and Collab

The mannequin is offered on Hugging Face and might be run with the assistance of 10 GB of VRAM, supplied by the T4 GPU in Google Collab pocket book. We’ll display the identical with a mini dialog.

Earlier than we start, let’s get our Hugging Face entry token which will likely be required to run the code. Go to https://huggingface.co/settings/tokens and generate a key, in case you don’t have one already. 

Make certain to allow the next permissions:

Open a brand new pocket book in Google Collab and add this key within the secrets and techniques (Title needs to be HF_Token):

Adding Secret Key

Observe: Swap to T4 GPU to run this pocket book. Then solely you’d be capable to use the 10GB of VRAM, required for operating this mannequin. 

Let’s now get our palms on the the mannequin:

  1. First clone the Dia’s Git repository:
!git clone https://github.com/nari-labs/dia.git
  1. Set up the native bundle:
!pip set up ./dia
  1. Set up the soundfile audio library:
!pip set up soundfile

After operating the earlier instructions, restart the session earlier than continuing.

  1. After the installations, let’s do the required imports and initialize the mannequin:
import soundfile as sf

from dia.mannequin import Dia

import IPython.show as ipd

mannequin = Dia.from_pretrained("nari-labs/Dia-1.6B")
  1. Initialize the textual content for the textual content to speech conversion:
textual content = "[S1] That is how Dia sounds. (chortle) [S2] Do not chortle an excessive amount of. [S1] (clears throat) Do share your ideas on the mannequin."
  1. Run inference on the mannequin:
output = mannequin.generate(textual content)

sampling_rate = 44100 # Dia makes use of 44.1Khz sampling charge.

output_file="dia_sample.mp3"

sf.write(output_file, output, sampling_rate) # Saving the audio

ipd.Audio(output_file) # Displaying the audio

Output:

The speech may be very human-like and the mannequin is doing nice with non-verbal communication. It’s value noting that the outcomes aren’t reproducible as there are not any templates for the voices. 

Observe: You’ll be able to attempt fixing the seed of the mannequin to breed the outcomes.

2. Utilizing Hugging Face Areas

Let’s attempt to clone a voice utilizing the mannequin through Hugging Face areas. Right here we’ve an choice to make use of the mannequin straight on the utilizing the web interface: https://huggingface.co/areas/nari-labs/Dia-1.6B

Right here you may move the enter textual content and moreover it’s also possible to use the ‘Audio Immediate’ to copy the voice. I handed the audio we generated within the earlier part. 

The next textual content was handed as an enter:

[S1] Dia is an open weights textual content to dialogue mannequin. 
[S2] You get full management over scripts and voices. 
[S1] Wow. Wonderful. (laughs) 
[S2] Strive it now on Git hub or Hugging Face.

I’ll allow you to be the choose, do you are feeling that the mannequin has efficiently captured and replicated the sooner voices?

Observe: I bought a number of errors whereas producing the speech utilizing Hugging Face areas, attempt altering the enter textual content or audio immediate to get the mannequin to work.

Issues to recollect whereas utilizing Dia-1.6B

Right here are some things that you need to be mindful, whereas utilizing Dia-1.6B:

  • The mannequin is just not fine-tuned on a particular voice. So, it’ll get a distinct voice on each run. You’ll be able to attempt fixing the seed of the mannequin to breed the outcomes.
  • Dia makes use of 44.1 KHz sampling charge.
  • After putting in the libraries, be certain that to restart the Collab pocket book. 
  • I bought a number of errors whereas producing the speech utilizing Hugging Face areas, attempt altering the Enter Textual content or Audio Immediate to get the mannequin to work.

Conclusion

The mannequin outcomes are very promising, particularly after we see what it may well do in comparison with the competitors. The mannequin’s greatest power is its assist for a variety of non-verbal communication. The mannequin has a definite tone and speech feels pure, however alternatively because it’s not fine-tuned on particular voices, it won’t be straightforward to breed a selected voice. Like another generative AI software, this mannequin needs to be used responsibly.

Regularly Requested Questions

Q1. Can we use solely two audio system within the dialog?

A. No, you need to use a number of audio system however want so as to add this within the immediate [S1], [S2], [S3]…

Q2. Is Dia 1.6B a paid mannequin?

A. No, it’s a very free to make use of mannequin out there on Hugging Face. 

Keen about expertise and innovation, a graduate of Vellore Institute of Know-how. Presently working as a Knowledge Science Trainee, specializing in Knowledge Science. Deeply excited by Deep Studying and Generative AI, desirous to discover cutting-edge strategies to resolve advanced issues and create impactful options.

Login to proceed studying and revel in expert-curated content material.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles