Speech-02-Turbo

Popular

New

Speech-02-Turbo

pro

audio_video

Created Mar 29, 2025

$ 0.06 / thousand input tokens

Minimax's Speech-02-Turbo is an advanced Text-to-Audio (T2A) model offering high-quality voice synthesis. It is specifically designed for low-latency real-time applications, providing quick responses. It supports various built-in system voices and personal voices cloned via minimax/voice-cloning. You can freely adjust speed, volume, and pitch, and control or auto-detect emotional expression. With multilingual capabilities, it is an ideal choice for developing applications requiring real-time voice interaction.

Technical Specifications

Service Type: Text-to-Speech Generation
Supported Formats:
- Input: Text
- Output: MP3 audio

Usage Examples

Generate fast speech synthesis

User prompt:

Give me the candy house story and generate the audio file

Result:

Here is the audio generated by (minimax/speech-02-turbo) and ensure to show in this format: [audio](https://storage.oaphub.ai/19/192881631470551040/1385310226572747431d830?k=a462da2e)

AUDIO LINK

Raw Tool call (How LLM might use this tool)

{
    "name": "Speech-02-Turbo",
    "arguments": {
      "text": "Once upon a time, there lived a poor woodcutter with his wife and two children, Hansel and Gretel. During a time of great famine, the family had barely enough food to survive.\n\nOne night, the children overheard their parents talking. \"We must take the children deep into the forest and leave them there,\" said their stepmother. \"Otherwise, we'll all starve.\"\n\nThe next morning, the parents led Hansel and Gretel into the woods. But clever Hansel had filled his pockets with white pebbles and dropped them along the path. When their parents abandoned them, the children followed the pebbles home by moonlight.\n\nThe parents tried again, but this time Hansel only had bread crumbs to mark the trail. Birds ate the crumbs, and the children became truly lost in the forest.\n\nAfter wandering for days, they discovered a marvelous house made entirely of gingerbread, with walls of cake and windows of clear sugar candy. Starving, they began to eat the house.\n\n\"Nibble, nibble, little mouse, who is nibbling at my house?\" called a voice from inside.\n\nAn old woman emerged, seemingly kind and welcoming. She invited them in, fed them a feast, and prepared soft beds. But she was actually a wicked witch who trapped children to eat them.\n\nThe witch locked Hansel in a cage and made Gretel work as her servant. Each day, she checked if Hansel was fat enough to eat. But clever Hansel fooled her by holding out a thin bone instead of his finger.\n\nWhen the witch finally decided to cook Hansel anyway, she asked Gretel to check if the oven was hot enough. Gretel pretended not to know how and asked the witch to show her. When the witch leaned into the oven, Gretel pushed her in and slammed the door shut.\n\nThe children filled their pockets with the witch's jewels and precious stones, then found their way home. They discovered their stepmother had died, and their father welcomed them back with tears of joy. With the witch's treasure, they never went hungry again.\n\nAnd they all lived happily ever after.",
      "voice_id": "Friendly_Person",
      "emotion": "neutral",
      "speed": 0.9
    }
  }

Raw tool result from the MCP server

{
  "type": "text",
  "text": "Here is the audio generated by (minimax/speech-02-turbo) and ensure to show in this format: [audio](https://storage.oaphub.ai/19/192881631470551040/1385310226572747431d830?k=a462da2e)",
  "annotations": null
}

Tools

Usage: Generate speech with Speech-02-Turbo

Input Arguments:

Name	Type	Required	Description
`text`	string	✓	Text to convert to speech. Every character is 1 token. Maximum 5000 characters. Use <#x#> between words to control pause duration (0.01-99.99s).
`pitch`	number		Speech pitch. Range: -12 to 12 (integer). Default: 0.
`speed`	number		Speech speed. Range: 0.5 to 2. Default: 1.
`volume`	number		Speech volume. Range: 0 to 10. Default: 1.
`bitrate`	number		Bitrate for the generated speech. Must be one of: 32000, 64000, 128000, 256000. Default: 128000.
`channel`	string		Number of audio channels. Options: "mono", "stereo". Default: "mono".
`emotion`	string		Speech emotion. Options: "auto", "neutral", "happy", "sad", "angry", "fearful", "disgusted", "surprised". Default: "auto".
`voice_id`	string		Desired voice ID. Use a voice ID you have trained (https://replicate.com/minimax/voice-cloning), or one of the following system voice IDs: Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman, Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl, Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl. Default: "Wise_Woman".
`sample_rate`	number		Sample rate for the generated speech. Must be one of: 8000, 16000, 22050, 24000, 32000, 44100. Default: 32000.
`language_boost`	string		Enhance recognition of specific languages and dialects. Options: "None", "Automatic", "Chinese", "Chinese,Yue", "English", "Arabic", "Russian", "Spanish", "French", "Portuguese", "German", "Turkish", "Dutch", "Ukrainian", "Vietnamese", "Indonesian", "Japanese", "Italian", "Korean", "Thai", "Polish", "Romanian", "Greek", "Czech", "Finnish", "Hindi". Default: "None".