You can specify embedded XML tags to change the way the text-to-speech engine produces output. Depending on your speech engine, you might be using SSML (Speech Synthesis Markup Language) or another XML-based markup language, such as Apple’s embedded speech commands, and Microsoft’s SAPI Text to speech (TTS) markup.
Examples of Microsoft Server Speech using SSL Markups
Audio
Play a pre-recorded audio file as part of a prompt. “<audio src=”c:\the_name_of_your_file.wav”> This is what I will say if the file is not found. </audio>”
Break
Add a pause between words. “Five hundred milliseconds of silence <break time=”500ms” /> just occurred.”
Emphasis
Add emphasis to a word in a sentence. “The word <emphasis level=”strong”> boo </emphasis> is emphasized!”
-
- Note: Currently Microsoft TTS engines do not support this element. See MSDN Article, under paragraph “Specify speech output characteristics”
Say-As Element
The say-as element provides guidance about pronunciation–date formats, cardinal and ordinal numbers, characters, time, and telephone numbers. See Say-As Element for more details. Here are a few examples:
- To have each letter spelled out individually: “<say-as interpret-as=”characters”> TEST </say-as>”
- To have text spoken as an ordinal number: (e.g. 3rd, 4th) “Select the <say-as interpret-as=”ordinal”>3rd</say-as> option.”
- To specify a date format: “Today is <say-as interpret-as=”date” format=”mdy”> 4-24-2017 </say-as>”
Phoneme Element
Using the <phoneme> element, you can specify a phonetic pronunciation for a word or phrase. “His name is Mike <phoneme alphabet=”x-microsoft-ups” ph=”JH AU”> Zhou </phoneme>”
Prosody
Prosody specifies the pitch, contour, range, rate, duration, and volume for speaking the contained text. See Prosody Element for more details. Here are a few examples:
- “This is normal volume. <prosody volume=”40″> This is a whisper. </prosody>”
- “This is normal pitch. <prosody rate=”-20%” volume=”40″> This is slow and quiet. </prosody>”
Voice
To define the voice to be used, you can use the Voice markup or use the overload on the PlayTTS command:
- Voice Markup: “<voice name=”Microsoft Server Speech Text to Speech Voice (en-US, Helen)”> This is the text that the application will speak. </voice>”
- PlayTTS Command: m_VoiceResource.PlayTTS(“This is the text that the application will speak.”, “Microsoft Server Speech Text to Speech Voice (en-US, Helen)”);
Examples of SAPI 5.3 Markups:
Volume Control: <volume level=”nn”/> where nn is a value from 1 to 100.
Absolute Rate Of Speech: <rate absspeed=”nn”/> where nn is a value from -10 to 10.
Relative Rate Of Speech: <rate speed=”nn”/> where nn is a value from -10 to 10.
Absolute Pitch: <pitch absmiddle=”nn”/> where nn is a value from -10 to 10.
Relative Pitch: <pitch middle=”nn”/> where nn is a value from -10 to 10.
Emphasis: “The word <emph> boo </emph> is emphasized!”
Spelling Control: “The Word <spell> Spell </spell> is spelled out.”
Silence Control: “Five hundred milliseconds of silence <silence msec=”500″/> just occurred.”
Pronunciation: “<pron sym=”h eh l l ow & w er l l d”/>”
Bookmarks: “<bookmark mark=”bookmark_one”/>Simple Text”
Parts Of Speech: “<partofsp part=”noun”> A </partofsp> is the first letter of the alphabet.”
Context: “<context id=”date_mdy”> 03/04/01 </context> should be March fourth, two thousand one.”
Voice: “<voice required=”Age=Teen”>A teen should speak this sentence – if a female, non-child teen is present, she will be selected over a male teen, for example.</voice>”
Language: “<voice required=”Language=409″>A U.S. English voice should speak this.</voice>”