
Popular
New
Speech-02-Turbo
pro
audio_video
Created Mar 29, 2025
$ 0.06 / thousand input tokens
Minimax's Speech-02-Turbo is an advanced Text-to-Audio (T2A) model offering high-quality voice synthesis. It is specifically designed for low-latency real-time applications, providing quick responses. It supports various built-in system voices and personal voices cloned via minimax/voice-cloning. You can freely adjust speed, volume, and pitch, and control or auto-detect emotional expression. With multilingual capabilities, it is an ideal choice for developing applications requiring real-time voice interaction.
Technical Specifications
- Service Type: Text-to-Speech Generation
- Supported Formats:
- Input: Text
- Output: MP3 audio
Usage Examples
Generate fast speech synthesis
User prompt:
Give me the candy house story and generate the audio file
Result:
Here is the audio generated by (minimax/speech-02-turbo) and ensure to show in this format: [audio](https://storage.oaphub.ai/19/192881631470551040/1385310226572747431d830?k=a462da2e)
Raw Tool call (How LLM might use this tool)
{
"name": "Speech-02-Turbo",
"arguments": {
"text": "Once upon a time, there lived a poor woodcutter with his wife and two children, Hansel and Gretel. During a time of great famine, the family had barely enough food to survive.\n\nOne night, the children overheard their parents talking. \"We must take the children deep into the forest and leave them there,\" said their stepmother. \"Otherwise, we'll all starve.\"\n\nThe next morning, the parents led Hansel and Gretel into the woods. But clever Hansel had filled his pockets with white pebbles and dropped them along the path. When their parents abandoned them, the children followed the pebbles home by moonlight.\n\nThe parents tried again, but this time Hansel only had bread crumbs to mark the trail. Birds ate the crumbs, and the children became truly lost in the forest.\n\nAfter wandering for days, they discovered a marvelous house made entirely of gingerbread, with walls of cake and windows of clear sugar candy. Starving, they began to eat the house.\n\n\"Nibble, nibble, little mouse, who is nibbling at my house?\" called a voice from inside.\n\nAn old woman emerged, seemingly kind and welcoming. She invited them in, fed them a feast, and prepared soft beds. But she was actually a wicked witch who trapped children to eat them.\n\nThe witch locked Hansel in a cage and made Gretel work as her servant. Each day, she checked if Hansel was fat enough to eat. But clever Hansel fooled her by holding out a thin bone instead of his finger.\n\nWhen the witch finally decided to cook Hansel anyway, she asked Gretel to check if the oven was hot enough. Gretel pretended not to know how and asked the witch to show her. When the witch leaned into the oven, Gretel pushed her in and slammed the door shut.\n\nThe children filled their pockets with the witch's jewels and precious stones, then found their way home. They discovered their stepmother had died, and their father welcomed them back with tears of joy. With the witch's treasure, they never went hungry again.\n\nAnd they all lived happily ever after.",
"voice_id": "Friendly_Person",
"emotion": "neutral",
"speed": 0.9
}
}
Raw tool result from the MCP server
{
"type": "text",
"text": "Here is the audio generated by (minimax/speech-02-turbo) and ensure to show in this format: [audio](https://storage.oaphub.ai/19/192881631470551040/1385310226572747431d830?k=a462da2e)",
"annotations": null
}
Tools
Speech-02-Turbo
Usage: Generate speech with Speech-02-Turbo
Input Arguments:
Name | Type | Required | Description |
---|---|---|---|
text |
string | ✓ | Text to convert to speech. Every character is 1 token. Maximum 5000 characters. Use <#x#> between words to control pause duration (0.01-99.99s). |
pitch |
number | Speech pitch. Range: -12 to 12 (integer). Default: 0. | |
speed |
number | Speech speed. Range: 0.5 to 2. Default: 1. | |
volume |
number | Speech volume. Range: 0 to 10. Default: 1. | |
bitrate |
number | Bitrate for the generated speech. Must be one of: 32000, 64000, 128000, 256000. Default: 128000. | |
channel |
string | Number of audio channels. Options: "mono", "stereo". Default: "mono". | |
emotion |
string | Speech emotion. Options: "auto", "neutral", "happy", "sad", "angry", "fearful", "disgusted", "surprised". Default: "auto". | |
voice_id |
string | Desired voice ID. Use a voice ID you have trained (https://replicate.com/minimax/voice-cloning), or one of the following system voice IDs: Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman, Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl, Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl. Default: "Wise_Woman". | |
sample_rate |
number | Sample rate for the generated speech. Must be one of: 8000, 16000, 22050, 24000, 32000, 44100. Default: 32000. | |
language_boost |
string | Enhance recognition of specific languages and dialects. Options: "None", "Automatic", "Chinese", "Chinese,Yue", "English", "Arabic", "Russian", "Spanish", "French", "Portuguese", "German", "Turkish", "Dutch", "Ukrainian", "Vietnamese", "Indonesian", "Japanese", "Italian", "Korean", "Thai", "Polish", "Romanian", "Greek", "Czech", "Finnish", "Hindi". Default: "None". |
© 2025 Open Agent Platform. All Rights Reserved.