Skip to main content
Inline tags control delivery at the token level. Insert them anywhere in input, and the model adjusts the surrounding speech. For example:
<|emotion:enthusiasm|>Welcome to the show!
<|prosody:pause|>Let's get started!
Tags fall into four categories:
  • emotion: <|emotion:…|>, such as elation, fear, anger.
  • style: <|style:…|>, such as shouting, whispering.
  • sound effects: <|sfx:…|>, such as cough, sneeze.
  • prosody: <|prosody:…|>, including speed, pause, pitch, and expressiveness.
Recommended usage:
  • Lead the turn with delivery tokens. Emotion, style, speed, pitch, and expressiveness tags set how the entire turn is delivered, so place them at the start of the input before any text. Positional tokens are the exception: <|prosody:pause|> and <|prosody:long_pause|> go exactly where the break should fall, and each <|sfx:…|> goes right before the sound it triggers.
  • Pair every sound effect with onomatopoeia. A <|sfx:…|> token works best when the matching written sound follows immediately, such as <|sfx:laughter|>Haha, <|sfx:sigh|>Uh, or <|sfx:sneeze|>Achoo. The written cue helps the model realize the sound effect.

Emotion

TagEffectSample inputSample audio
<|emotion:elation|>Elation / joy<|emotion:elation|>This is the best news ever, I am absolutely thrilled!
<|emotion:amusement|>Amusement / playful laughter<|emotion:amusement|>Oh that's hilarious, I can't stop giggling about it.
<|emotion:enthusiasm|>Enthusiasm / excitement<|emotion:enthusiasm|>Let's go, I can't wait, this is going to be amazing!
<|emotion:determination|>Determination / firmness<|emotion:determination|>No matter what happens, I will not give up on this.
<|emotion:pride|>Pride / confidence<|emotion:pride|>Look at how far I've come. I never once doubted that I'd make it here.
<|emotion:contentment|>Calm satisfaction<|emotion:contentment|>Everything feels just right, I'm perfectly at peace.
<|emotion:affection|>Warmth / affection<|emotion:affection|>I'm so grateful to have you in my life, truly.
<|emotion:relief|>Relief<|emotion:relief|>Oh thank goodness, it's finally over, what a relief.
<|emotion:contemplation|>Thoughtful / reflective<|emotion:contemplation|>Sometimes I just sit and wonder what it all really means.
<|emotion:confusion|>Confused<|emotion:confusion|>Wait, I don't understand, what is even going on here?
<|emotion:surprise|>Surprised<|emotion:surprise|>What? No way, I did not see that coming at all!
<|emotion:awe|>Awe / wonder<|emotion:awe|>Wow, look at that, I have never seen anything so breathtaking.
<|emotion:longing|>Longing / yearning<|emotion:longing|>I miss you more than words can say, please come back to me.
<|emotion:arousal|>Heightened desire<|emotion:arousal|>Come closer, I can't stop thinking about you tonight.
<|emotion:anger|>Anger<|emotion:anger|>How dare you do that to me, this is completely unacceptable!
<|emotion:fear|>Fear<|emotion:fear|>Did you hear that? Something is out there, I'm really scared.
<|emotion:disgust|>Disgust<|emotion:disgust|>Ugh, that is absolutely revolting, I can't even look at it.
<|emotion:bitterness|>Bitterness<|emotion:bitterness|>After everything I did for them, this is how they repay me.
<|emotion:sadness|>Sadness<|emotion:sadness|>I really thought things would be different, it hurts so much.
<|emotion:shame|>Shame<|emotion:shame|>I can't believe I did that, I'm so embarrassed and ashamed.
<|emotion:helplessness|>Helplessness<|emotion:helplessness|>There's nothing I can do anymore, I just feel so powerless.

Style

TagEffectSample inputSample audio
<|style:singing|>Singing<|style:singing|>Dancing in the rain, feeling so alive again.
<|style:shouting|>Shouting / projected voice<|style:shouting|>Hey! Over here! Everybody listen to me right now!
<|style:whispering|>Whisper<|style:whispering|>Lean in, I have a little secret just for you, don't tell anyone.

Sound effects

Sound effects are vocalized — produced in the speaker’s voice — not mixed-in audio assets.
TagEffectSample inputSample audio
<|sfx:cough|>Cough<|sfx:cough|>Ahem, could I have everyone's attention, please?
<|sfx:laughter|>Laughter<|sfx:laughter|>Haha, that's the funniest thing I've heard all day!
<|sfx:crying|>Crying<|emotion:sadness|><|sfx:crying|>I... I’m sorry.
<|sfx:screaming|>Screaming<|sfx:screaming|>Ah, no, no, this can't be happening!
<|sfx:burping|>Burping<|sfx:burping|>Burp—ugh, whoa, sorry, that one snuck up on me.
<|sfx:humming|>Humming<|sfx:humming|>Hmm, that's a tricky question, isn't it?
<|sfx:sigh|>Sigh<|sfx:sigh|>Ahh, well, there's nothing left to do but wait.
<|sfx:sniff|>Sniff<|sfx:sniff|>Sff..., it's so cold out, my nose won't stop.
<|sfx:sneeze|>Sneeze<|sfx:sneeze|>Achoo! Bless you, oh wait, that was me.

Prosody

Prosody tags control speed, pitch, pauses, and overall expressiveness.
TagEffectSample inputSample audio
<|prosody:speed_very_slow|>~0.65× speed<|prosody:speed_very_slow|>Take your time, <|prosody:pause|> there is, <|prosody:pause|> really no need <|prosody:pause|> to rush <|prosody:pause|> at all.
<|prosody:speed_slow|>~0.85× speed<|prosody:speed_slow|>Let me explain this slowly, so everyone can follow along.
<|prosody:speed_fast|>~1.2× speed<|prosody:speed_fast|>Quick, quick, we have to go right now, we're running late!
<|prosody:speed_very_fast|>~1.4× speed<|prosody:speed_very_fast|>Wait wait wait, two minutes till the train, no time to talk, just run run, go go go!
<|prosody:pitch_low|>~−3 semitones<|prosody:pitch_low|>In a deep and serious voice, he delivered the grave news.
<|prosody:pitch_high|>~+2.5 semitones<|prosody:pitch_high|>Oh hello there little one, aren't you just the cutest!
<|prosody:pause|>~400–700 ms pauseWait for it <|prosody:pause|> and there it is, the big reveal.
<|prosody:long_pause|>~700–1500 ms pauseI have something important to tell you. <|prosody:long_pause|> I'm leaving tomorrow.
<|prosody:expressive_high|>More expressive delivery<|prosody:expressive_high|>This is incredible, absolutely magnificent, beyond my wildest dreams!
<|prosody:expressive_low|>Flatter delivery<|prosody:expressive_low|>The meeting is at noon. Bring your reports. That's all.