Welcome to the post everyone has asked for over time! The one where you can find download links for various Piper (link to repo) voices. This post has also expanded over time to include normal add-ons for some classic NVDA voices that are no longer in active licensed status. Piper voices here were generated using Google Colab. Many of these voices used about an hour of audio data from each synthesizer, and you will find a date next to each indicating the last modified date. This should allow you to revisit this page to see if new updates to a voice have been released. Voice updates will improve quality over time, as more training data is generated.
For retro speech synths, find them in the first heading. All of these will be from around before the year 2000. Over time, more retro synthesizers may be added here later, as I work with others from the community to create them as NVDA add-ons.
I will not create Piper versions of voices that are still licensed. Doing so can infringe on the rights of those still selling the product by directly negating their ability to sell the voice on their terms. Not cool. I am interested in creating copies of synthesizers which are no longer in development or cannot be run on modern platforms. For these, Piper can provide a new lease on life, although the nature of these as neural voices means that some qualities of the speech system are bound to be lost. However, it gets close to the original in many respects.
Retro voices!
Below is a set of a few add-ons I’ve modified to work with NVDA 2024.4. Find them in the sub-headings.
BestSpeech (Keynote Gold)
Thanks to the hard work of Rommix0, we now have a repository Of BestSpeech (Keynote Gold) as a standalone Windows application.
To be clear, this was extracted from an old abandonware program called the Amazing Writing Machine , and therefore it is considered no longer licensed. Berkeley Speech Technology produced this in 1994, and the source for the product is no longer available. Thus, the speak application is open-source, but the DLL is not.
Shoutouts to Quin (The Quinbox), Mason, and others in the community who came together to support the development of this driver. You all are truly awesome people to work with on fixing bugs, and without you it would not have happened.
I have developed Rommix0’s work into an NVDA-addon to download here , which will allow you to use this DLL inside NVDA 2024.4.
- A few notes
- No sound switching! The driver does support this, but converting the MME sound device from a string to an integer (‘ulong’) is very difficult. As a result, it uses the sound device set in Sound Mapper. This is unfortunate for those who want NVDA on another sound device.
- Thread race conditions: Sometimes the synthesizer will dump back memory as audio. This will result in you hearing static and prior chunks of speech and memory. It used to crash NVDA; these days it will continue on the corrupted thread where we couldn’t initiate BSTShutup, then stop after a minute. We implemented a thread manager to ensure the DLL and text chunks don’t get sent too early and cause the DLL to break. The TTS lock and thread manager have helped a bit in avoiding the static.
- Realtek sound cards: If you use Realtek, it is recommended to disable sound enhancements. This is good practice to ensure the sound doesn’t get over-buffered and slow from the DLL.
- Unicode characters: If your text has Unicode characters, it will crash the reading. We try to convert it to Windows-1250.
Eloquence 3.3
Eloquence 3.3 is available as an NVDA driver. This driver was not initially developed by me; it was just modernized to work on NVDA 2024.4. It has proper indexing and can correctly pipe audio through Nvwave. This version of Eloquence was released in 1998, just 2 years after it began development, and was used in JAWS 3.2. For a 26-year old synthesizer, it does pretty good.
To fix the issue with the left parenthesis symbol “(“, you can add a voice dictionary entry for it and spell out the word with a space after should you wish. This is a wide bug in older eloquence versions. Shorten pauses is supported but use it with caution, as “p`1” tags get pronounced by the speech within symbols, and periods or other punctuations may get announced more often. P2 and P3 tags appear to work more reliably, so they may get added to help with slight pause shortening.
Note that this will create a synthesizer in your NVDA: Old Eloquence. It does not replace your IBMTTS or Eloquence drivers. I tried to ensure no conflicts occur, and we unload the DLL when you switch to another Eloquence synthesizer.
Eloquence 4.7 Synthesizer for NVDA
An Eloquence 4.7 Add-on is available for NVDA which uses the Eloquence version from JAWS 3.3 to 3.7. Again, this is far older than the 6.1 driver, and Eloquent Technologies / IBM have long since ceased supporting this version.
Some have noticed artifacting in speech with this version at the ends of words. This may be a consequence of using Nvwave player to play the audio now, not our own player object, but it has not received deep investigation. To test, have it say numbers like 3, or words like tree. You will notice a strange artifact at the ends.
Note that just as with Old Eloquence, this creates another synthesizer in your list: Eloquence 4.7.
Soft Voice for NVDA 2024.4
SoftVoice adapted to work on NVDA 2024.4 is available here for download. This is a classic synthesizer, extracted from the Microsoft +! pack. The DLL is not too stable for using with Say-all functionality, because unlike with BestSpeech, we cannot gauge when speech stops. Quite unfortunate.
BraiLab Synthesizer
BraiLab is an old Hungarian speech system, developed through the 80s and 90s. A Brailab Add-on updated for NVDA 2024.4 is available here and it supports artificial basic indexing. You can find an emulator of Homelab and Brailab here (account required) and alternatively, a lot of rom dumps from various versions of Brailab exist here (Hungarian link)
The DLL was initially developed by BME-TMIT and adapted to work in NVDA by Robert Osztolykan, Áron Ócsvári. They are heros for helping to make this come alive, getting the right licensing to compile a driver for NVDA. Myself and a few others simply enhanced it to work with modern NVDA versions through the years.
SMP (syntezator mowy SMP)
SMPSoft (Polish) Driver for NVDA 2024.4 is available here
SMP Soft (also sometimes called SMP or Syntezator Mowy Polskiej) is a Polish text-to-speech engine from around the late 1990s–2000s era. It’s not widely documented these days, but people in Poland sometimes used it as an alternative to other voices like Ivona or RealSpeak. It reads Polish text decently, though it can struggle with certain diacritics or advanced rules. This driver was updated from 2013 to work on modern NVDA. As such, some features may not work as expected. SMPSoft is “one-shot”- you call dll.speak(…) and immediately pull the entire wave data from memory in a single go. Then we manually append silence, hoping the TTS audio is fully recognized by NVWave. If SMPSoft doesn’t provide partial audio callbacks or an explicit “done” signal, NVWave can’t tell if more audio is coming or if the TTS is truly done. That’s why short utterances get truncated—SMPSoft itself is handing us everything at once, with no incremental updates. Unless someone else steps up to the plate to solve this problem, it most likely won’t get resolved.
How to obtain Piper
Piper is available as an NVDA Add-on called Sonata from developer Mush42 (Opens release page in a new window) .
It is also available for the Raspberry Pi. If you are into Linux things, you can hack together a Speech-dispatcher module which will allow you to also run Piper.
Training your own voice in Colab, and a shoutout
If you want me to train any other old speech system, getting high-quality samples of that voice is welcome. Use the contact form on this site, or if you’re a friend on social media, get in touch!
If you wish to train your own voice, I highly recommend reading through Zachary Bennoui’s post on training with Piper , which was huge for me at the time. It will get you started on the basic notebooks, as the main Piper training notebook can still be made into a copy and used just the same. Very little in that post has changed, although be aware that paid Colab subscriptions will do better, especially if you can get the Pro one with 24-hour runtime.
I also wish to shout out the developer of the Sonata voices for helping me get the streaming variants to work. This person is a genius at being a great dev. Here’s a Colab Notebook file for exporting your last.CKPT file to your drive. Please note that this creates a folder called model_final within the root of your drive. The notebook will eventually be updated with headings and more options, but for now it’s basic enough to work. You will also need to update the configuration JSON file for your voice model. It will require setting the streaming boolean to true like so:
"num_symbols": 256,
"num_speakers": 1,
"speaker_id_map": {},
"piper_version": "1.0.0",
"streaming": true,
"key": "en_US-voice+RT-medium"
}
(Obviously, replace the voice name with the name of your model.) In order for Sonata to recognize the voice as a “Fast” variant, you must include +RT in the folder name.
Check out Mush42’s Piper fork repo, too if you’re a curious enough nerd. If you can contribute, I highly encourage keeping up with the project through branches in this repo.
Pre-training steps: How do I create a good dataset?
A lot of you kind readers have asked for some steps on how to create a good dataset for pre-training. This is the step before you load the notebook. You will need Whisper installed. There are many Git repos of Whisper for Windows , and this links to a standalone one. If you use the large-V2 model, you will have better results, although the model itself can be a few gigabytes to download and store. The Medium model may do OK on transcribing, but I have noticed it struggle with unclear (or non-native) English speakers.
- Steps for getting this done:
- Create your dataset. Whether human or robot, you need good-quality recordings. You don’t need to splice these yourself since Whisper will help us, but you need at least 1 to 2 hours of good data, more if possible. Fine-tuning should be doable on a single hour of audio to get a fairly close matching voice.
- Once you have your dataset, you will need to use Whisper to transcribe it. Use this command after having placed Whisper into your path for the command line. First open a window with the CMD command in the Run dialog (Windows + R) or by opening a terminal on Mac and adjusting the command.
whisper-faster XXX.mp3 --model large-v2 --output_format vtt --output_dir .
This will save a VTT file with the results in the directory you were in when running the command. Replace XXX.mp3 with the path or source of your file. If you were running Whisper from a local repo, be aware that it will place the transcript into that folder for you. - Now we get to the fun part: splitting. For now, I have created a BASH splitting script which will soon move to Git I’m sure that will do the hard work of reading that VTT file for you and the MP3, then splitting it into files prefixed with the word “data” and a number. Here’s your syntax for this command:
./split_audio.sh xxx.mp3 xxx.vtt xxx
Fill in the MP3, VTT filenames. The last XXX is a number you specify to begin counting from. This is useful when adding further audio to a dataset later on and continuing the numbers of the dataset.
Warning! Windows-only users: This is a BASH (Linux) script. You will need something like Windows Subsystem for Linux, and an installed copy of FFMpeg for this script to work. Mac users, be sure to install FFMPeg for Mac OS (Opens in a new window.) before using the script. - The script will add 100ms of silence automatically and save a file called transcript.txt. You must review transcript.txt yourself after running the script each time. This is because it gets overwritten! Do not split multiple files in succession, or else you will lose your transcripts. You must manually copy out the lines, review them, and paste them above new chunks of transcript data for yourself. This is intentional: You should review the transcribed data and how the audio matches up to it.
- Review: This can take hours. Your task is to listen to many of the files and see if any words bleed over that are not in the transcribed text. This is called aligning. Focus on quality over quantity. VITS is particularly sensitive to misalignments and can quickly learn noise from bad data. If your model is turning out poorly, revisit this step and spend hours doing this. Still easier than us humans manually splicing large audios though, isn’t it? And then transcribing those fragments yourself? No way.
- At the end of all this, you should have hundreds of small file chunks. Zip those up into a folder, upload to your drive, and use it in the pre-training steps.
Bonus: If you have a dataset you wish to add 100ms of silence to, use this BASH script that automates adding silence to the start of each file in a folder , which can be useful for aligning your data.
Voices list
To install a voice, download the below files and import them into your environment. If using NVDA, you can do this from the Sonata Voice Manager of the NVDA menu and load the local file.
- RT (Real-time) voices
- Mac OS Alex (Fast, medium, modified 4/20/2024)
- Votrax SSI263 (Fast, Medium, last modified 4/22/2024)
- Keynote Gold (Medium, Fast, last updated 04/28/2024)
- non-fast (standard) voices
- Mac OS Alex (standard, medium, last modified 10/27/2023)
- Keynote Gold (standard, Medium, last modified 3/30/2024)
- Votrax SSI263 (Standard, medium, last modified 3/27/2024)
You might wonder: Why the Alex voice? This was a hard decision, as it is technically still available. However, I am not here to make commercial income from Apple’s voice, and proving direct harms to sales is harder to do when a voice is not licensed to work on other platforms in general. For now, there’s an Alex voice. Should Apple send me a cease and desist around the matter, it will be revisited swiftly.
Happy Pipering!