A fast local neural text to speech engine for Mycroft
Find a file
Michael Hansen 611fa71c64 Fix typo
2022-04-11 10:57:20 -04:00
debian Fix typo 2022-04-11 10:57:20 -04:00
docker Add CUDA acceleration (requires onnxruntime-gpu) 2022-04-07 16:54:48 -04:00
examples Add speech dispatcher example 2022-03-31 23:07:16 -04:00
img Add architecture diagram 2022-04-08 16:02:36 -04:00
mimic3-http Add remaining scripts to Debian package 2022-04-11 10:33:43 -04:00
mimic3-tts Add remaining scripts to Debian package 2022-04-11 10:33:43 -04:00
opentts-abc Add remaining scripts to Debian package 2022-04-11 10:33:43 -04:00
pyinstaller Add remaining scripts to Debian package 2022-04-11 10:33:43 -04:00
.dockerignore Add script to build binaries for all platforms 2022-04-11 10:55:46 -04:00
.gitignore Ignore .spec files 2022-04-07 14:45:31 -04:00
.projectile Initial commit 2022-03-16 17:22:15 -04:00
check.sh Add remaining scripts to Debian package 2022-04-11 10:33:43 -04:00
Dockerfile Add 32-bit ARM to Dockerfiles 2022-04-01 18:47:40 -04:00
Dockerfile.binary Add script to build binaries for all platforms 2022-04-11 10:55:46 -04:00
Dockerfile.gpu Add CUDA acceleration (requires onnxruntime-gpu) 2022-04-07 16:54:48 -04:00
install.sh Add remaining scripts to Debian package 2022-04-11 10:33:43 -04:00
LICENSE Fix license 2022-03-25 16:57:31 -04:00
Makefile Add remaining scripts to Debian package 2022-04-11 10:33:43 -04:00
README.md Add Debian package build script 2022-04-08 16:58:36 -04:00
requirements_dev.txt Working single-file binary build 2022-03-30 17:08:56 -04:00

Mimic 3

mimic 3 mark 2

A fast and local neural text to speech system for Mycroft and the Mark II.

Use Cases

Dependencies

Mimic 3 requires:

Installation

eSpeak

Some voices depend on eSpeak-ng, specifically libespeak-ng.so. For those voices, make sure that libespeak-ng is installed with:

sudo apt-get install libespeak-ng1

Mycroft TTS Plugin

Install the plugin:

mycroft-pip install plugin-tts-mimic3[all]

Enable the plugin in your mycroft.conf file:

mycroft-config set tts.module mimic3_tts_plug

See the plugin's documentation for more options.

Using pip

Install the command-line tool:

pip install mimic3-tts[all]

Once installed, the following commands will be available:

* `mimic3`
* `mimic3-download`

Install the HTTP web server:

pip install mimic3-http[all]

Once installed, the following commands will be available: * mimic3-server * mimic3-client

Language support can be selectively installed by replacing all with:

  • de - German
  • es - Spanish
  • fr - French
  • it - Italian
  • nl - Dutch
  • ru - Russian
  • sw - Kiswahili

Excluding [..] entirely will install support for English only.

From Source

Clone the repository:

git clone https://github.com/MycroftAI/mimic3.git

Run the install script:

cd mimic3/
./install.sh

A virtual environment will be created in mimic3/.venv and each of the Python modules will be installed in editiable mode (pip install -e).

Once installed, the following commands will be available in .venv/bin: * mimic3 * mimic3-server * mimic3-client * mimic3-download

Voice Keys

Mimic 3 references voices with the format:

  • <language>/<name>_<quality> for single speaker voices, and
  • <language>/<name>_<quality>#<speaker> for multi-speaker voices
    • <speaker> can be a name or number starting at 0
    • Speaker names come from a voice's speakers.txt file

For example, the default Alan Pope voice key is en_UK/apope_low. The CMU Arctic voice contains multiple speakers, with a commonly used voice being en_US/cmu-arctic_low#slt.

Voices are automatically downloaded from Github and stored in ${HOME}/.local/share/mimic3 (technically ${XDG_DATA_HOME}/mimic3). You can also manually download them.

Running

Command-Line Tools

The mimic3 command can be used to synthesize audio on the command line:

mimic3 --voice 'en_UK/apope_low' 'My hovercraft is full of eels.' > hovercraft_eels.wav

See voice keys for how to reference voices and speakers.

See mimic3 --help or the CLI documentation for more details.

Downloading Voices

Mimic 3 automatically downloads voices when they're first used, but you can manually download them too with mimic3-download.

For example:

mimic3-download 'en_US/*'

will download all U.S. English voices to ${HOME}/.local/share/mimic3.

See mimic3-download --help for more options.

Web Server and Client

Start a web server with mimic3-server and visit http://localhost:59125 to view the web UI.

screenshot of web interface

The following endpoints are available:

  • /api/tts
    • POST text or SSML and receive WAV audio back
    • Use ?voice= to select a different voice/speaker
    • Set Content-Type to application/ssml+xml (or use ?ssml=1) for SSML input
  • /api/voices
    • Returns a JSON list of available voices

An OpenAPI test page is also available at http://localhost:59125/openapi

See mimic3-server --help for the web server documentation for more details.

Web Client

The mimic3-client program provides an interface to the Mimic 3 web server that is similar to the mimic3 command.

Assuming you have started mimic3-server and can access http://localhost:59125, then:

mimic3-client --voice 'en_UK/apope_low' 'My hovercraft is full of eels.' > hovercraft_eels.wav

See mimic3-client --help for more options.

CUDA Acceleration

If you have a GPU with support for CUDA, you can accelerate synthesis with the --cuda flag when running mimic3 or mimic3-server. This requires you to install the onnxruntime-gpu Python package.

Using nvidia-docker is highly recommended. See Dockerfile.gpu for an example of how to build a compatible container.

MaryTTS Compatibility

Use the Mimic 3 web server as a drop-in replacement for MaryTTS, for example with Home Assistant.

Make sure to use a compatible voice key like en_UK/apope_low.

For Mycroft, you can use this instead of the plugin by running:

mycroft-config edit user

and then adding the following:

"tts": {
"module": "marytts",
"marytts": {
    "url": "http://localhost:59125",
    "voice": "en_UK/apope_low"
}

SSML

A subset of SSML (Speech Synthesis Markup Language) is supported.

For example:

<speak>
  <voice name="en_UK/apope_low">
    <s>
      Welcome to the world of speech synthesis.
    </s>
  </voice>
  <break time="3s" />
  <voice name="en_US/cmu-arctic_low#slt">
    <s>
      <prosody volume="soft" rate="150%">
        This is a <say-as interpret-as="number" format="ordinal">2</say-as> voice.
      </prosody>
    </s>
  </voice>
</speak>

will speak the two sentences with different voices and a 3 second second pause in between. The second sentence will also have the number "2" pronounced as "second" (ordinal form).

SSML <say-as> support varies between voice types:

  • gruut
  • eSpeak-ng
  • Character-based voices do not currently support <say-as>

License

See license file