Piper voices of retro speech synthesizers

Welcome to the post everyone has asked for over time! The one where you can find download links for various Piper (link to repo) voices. These were generated using Google collab. Many of these voices used about an hour of audio data from each synthesizer, and you will find a date next to each indicating the last modified date. This should allow you to re-visit this page to see if new updates to a voice have released. Voice updates will be to improve quality over time, as more training data is generated.

I will not create Piper versions of voices which are still licensed. Doing so can infringe of the rights of those still selling the product, by directly negating their ability to sell the voice on their terms. Not cool. I am interested in creating copies of synthesizers which are no longer in development or cannot be run on modern platforms. For these, Piper can provide a new lease on life, although the nature of these as Neural voices means that some qualities of the speech system are bound to be lost. However, it gets close to the original in many respects.

How to obtain Piper

Piper is available as An NVDA Add-on called Sonata from developer Mush42 (Opens release page in a new window.)

It is also available for the Raspberry Pi. If you are into Linux things, you can Hack together a Speech-dispatcher module which will allow you to also run Piper.

Training your own voice In collab, and a shoutout

If you want me to train any other old speech system, getting high quality samples of that voice is welcome. Use the contact form on this site, or if you’re a friend on social media, get in touch!

If you wish to train your own voice, I highly recommend a read through Zachary Bennoui’s post on training with piper which was huge for me at the time. It will get you started on the basic notebooks as the main Piper training notebook can still be made into a copy and used just the same. Very little in that post has changed, although be aware that paid Collab subscriptions will do better especially if you can get the Pro one with 24-hour runtime.

I also wish to shout-out the developer of the Sonata voices for helping me with getting the streaming variants to work. This person is a genius at being a great dev. Here’s a Collab Notebook file for Exporting your last.CKPT file to your drive. Please note that this creates a folder called model_final within the root of your drive. The notebook will eventually be updated with headings and more options, but for now it’s basic enough to work. You will also need to update the configuration JSON file for your voice model. It will require setting the streaming boolean to true like so:
"num_symbols": 256,
"num_speakers": 1,
"speaker_id_map": {},
"piper_version": "1.0.0",
"streaming": true,
"key": "en_US-voice+RT-medium"
}

(Obviously, replace the voice name with the name of your model.) In order for Sonata to recognize the voice as a “Fast” variant, you must include +RT in the folder name.
Check out Mush42’s Piper fork repo, too if you’re a curious enough nerd. If you can contribute, I highly encourage keeping updated with the project through branches in this repo.

Pre-training steps: How do I create a good dataset?

A lot of you kind readers have asked for some sort of steps list on how to create a good dataset for pre-training. This is the step before you load the notebook. You will need Whisper installed. There are Many Git repos of Whisper for Windows and this links to a standalone one. If you use the large-V2 model, you will have better results, although the model itself can be a few gigabytes to download and store. The Medium model may do OK on transcribing, but I have noticed it struggle with unclear (or non-native) English speakers.

    Steps for getting this done:

  1. Create your dataset. Whether human or robot, you need good quality of recordings. You don’t need to splice these yourself since Whisper will help us, but you need a way to have at least 1 to 2 hours of good data, more if possible. Fine-tuning should be doable on 1 single hour of audio to where you get a fairly close matching voice.
  2. Once you have your dataset, you will need to use Whisper to transcribe it. Use this command after having placed Whisper into your path for the command line. First open a window with the CMD command in the run dialog (Windows +R) or by opening a terminal on Mac and adjusting the command.
    whisper-faster XXX.mp3 --model large-v2 --output_format vtt --output_dir .
    This will save a VTT file with the results within the directory you were in when running the command. Replace XXX.mp3 with the path or source of your file. If you were running Whisper from a local repo, beware that it will place the transcript into that folder for you.
  3. Now we get to the fun part: Splitting. For now, I have created a BASH splitting script which will soon move to Git I’m sure which will do the hard work of reading that VTT file for you and the MP3, then splitting it into files prefixed with the word “data” and a number. Here’s your syntax for this command:

    ./split_audio.sh xxx.mp3 xxx.vtt xxx

    Fill in the Mp3, VTT filenames. The last XXX is a number you specify to begin counting from. This is useful when adding further audio to a dataset later on and continuing the numbers of the dataset.
    Warning! Windows-only users: This is a BASH (Linux) script. You will need something like Windows Subsystem for Linux, and an installed copy of FFMpeg for this script to work. Mac users, be sure to install FFMPeg for Mac OS (Opens in a new window.) before using the script.
  4. The script will add 100MS of silence automatically, and save a file called transcript.txt. You must review transcript.txt yourself after running the script each time. This is because it gets overwritten! Do not split multiple files in succession, or else you will lose your transcripts. You must manually copy out the lines, review them, and paste them above new chunks of transcript data for yourself. This is intentional: You should review the transcribed data and how the audio matches up to it.
  5. Review: This can take hours. Your task is to listen to many of the files and see if any words bleed over that are not in transcribed text. This is called aligning. Focus on quality over quantity. VITS is particularly sensitive to misalignments and can quickly learn noise from bad data. If your model is turning out poorly, re-visit this step and spend hours doing this. Still easier than us humans manually splicing large audios though, isn’t it? And then transcribing those fragments yourself? No way.
  6. At the end of all this, you should have hundreds of small file chunks. Zip those up into a folder, upload to your drive, and use it in the pre-training steps.

Bonus: If you have a dataset you wish to add 100MS of silence to, Use this BASH Script that automates adding silence to the start of each file in a folder which can be useful for aligning your data.

Voices list

To install a voice, download the below files and import them into your environment. If using NVDA, you can do this from the Sonata Voice manager of the NVDA menu and load the local file.

You might wonder: Why the Alex voice? This was a hard decision, as it is technically still available. However, I am not here to make comercial income from Apple’s voice, and proving direct harms to sales is harder to do when a voice is not licensed to work on other platforms in general. For now, there’s an Alex voice. Should Apple send me a cease and desist around the matter, it will be revisited swiftly.

Happy Pipering!