Inline control tokens that shape emotion, style, prosody, and sound effects inside the input text.
Inline tags control delivery at the token level. Insert them anywhere in input, and the model adjusts the surrounding speech. For example:
<|emotion:enthusiasm|>Welcome to the show!<|prosody:pause|>Let's get started!
Tags fall into four categories:
emotion: <|emotion:…|>, such as elation, fear, anger.
style: <|style:…|>, such as shouting, whispering.
sound effects: <|sfx:…|>, such as cough, sneeze.
prosody: <|prosody:…|>, including speed, pause, pitch, and expressiveness.
Recommended usage:
Lead the turn with delivery tokens. Emotion, style, speed, pitch, and expressiveness tags set how the entire turn is delivered, so place them at the start of the input before any text. Positional tokens are the exception: <|prosody:pause|> and <|prosody:long_pause|> go exactly where the break should fall, and each <|sfx:…|> goes right before the sound it triggers.
Pair every sound effect with onomatopoeia. A <|sfx:…|> token works best when the matching written sound follows immediately, such as <|sfx:laughter|>Haha, <|sfx:sigh|>Uh, or <|sfx:sneeze|>Achoo. The written cue helps the model realize the sound effect.
Prosody tags control speed, pitch, pauses, and overall expressiveness.
Tag
Effect
Sample input
Sample audio
<|prosody:speed_very_slow|>
~0.65× speed
<|prosody:speed_very_slow|>Take your time, <|prosody:pause|> there is, <|prosody:pause|> really no need <|prosody:pause|> to rush <|prosody:pause|> at all.
<|prosody:speed_slow|>
~0.85× speed
<|prosody:speed_slow|>Let me explain this slowly, so everyone can follow along.
<|prosody:speed_fast|>
~1.2× speed
<|prosody:speed_fast|>Quick, quick, we have to go right now, we're running late!
<|prosody:speed_very_fast|>
~1.4× speed
<|prosody:speed_very_fast|>Wait wait wait, two minutes till the train, no time to talk, just run run, go go go!
<|prosody:pitch_low|>
~−3 semitones
<|prosody:pitch_low|>In a deep and serious voice, he delivered the grave news.
<|prosody:pitch_high|>
~+2.5 semitones
<|prosody:pitch_high|>Oh hello there little one, aren't you just the cutest!
<|prosody:pause|>
~400–700 ms pause
Wait for it <|prosody:pause|> and there it is, the big reveal.
<|prosody:long_pause|>
~700–1500 ms pause
I have something important to tell you. <|prosody:long_pause|> I'm leaving tomorrow.
<|prosody:expressive_high|>
More expressive delivery
<|prosody:expressive_high|>This is incredible, absolutely magnificent, beyond my wildest dreams!
<|prosody:expressive_low|>
Flatter delivery
<|prosody:expressive_low|>The meeting is at noon. Bring your reports. That's all.