Report:
Speak, Cybert, Speak
Originally written for Albert on 6/29/99
NOTE: (Added 8/20/99)
I recently found a URL that will allow you to compare various text-to-speech engines (it's
free). http://www.research.att.com/projects/tts/
Goals
It has always been my intention to give Cybert a large vocabulary of words
and phrases. Not only will this make him more entertaining, he will be better able
to communicate his moods and even what he is "thinking" (which will be helpful
for debugging). Since my PC has a permanent connection to the Internet I plan to
have Cybert "tell me" about current news, weather, stock reports, a "joke
of the day" and more. I will merely have to parse text files from the HTML and
other data I receive. |

|
Since Cybert is controlled from my home PC and not a smaller processor with a
small amount of RAM, I will have nearly limitless storage capacity for Cybert's
"dialogue." I've done a lot of creative writing and even scripted my own
material while doing standup comedy in high school and college; so I look forward to
writing the endless streams of clever commentary that will really bring Cybert to life for
my family and friends.
Here is a more succinct list of my speech goals:
- Give Cybert lots of things to say
- Vary what Cybert says based on his mood
- Vary the sound of Cybert's voice based on his mood
- Make dialogue entry (and updates) simple and painless
- Allow Cybert to instantly vocalize information downloaded off the Internet
How Does a Text-to-Speech (TTS) Engine Work?
Text-to-speech fundamentally functions as a pipeline that converts text into
PCM digital audio. The elements of the pipeline are:
Text normalization
Converts input text into a series of words, and handles abbreviations, numbers,
etc. ("Dr. Smith" becomes "Doctor Smith".
"$25.00" becomes "Twenty five dollars".)
Homograph disambiguation
Uses contextual rules to figure out the pronunciation of words that are spelled
the same, but sound different. (A good engine will correctly voice: "Tomorrow I
will read the same story I read today")
Word pronunciation
Breaks down final list of words into a sequence of dictionary-like
phonemes. Thus "hello" becomes "h eh l oe".
Prosody
Chooses speed, pitch, and volume for words in a sentence. This helps the
voice sound less monotonous and actually makes it easier to understand. Most
engines, for example, will emphasize the last word in a question.
Concatenate wave segments
This important last step "munges" the various phonemes together for a
more natural sound. (No matter how fast you say "sh" and "oo"
together, it's not going to sound very much like the word "shoe".)
The Search for a Cost Effective Engine
I spent several weeks downloading and evaluating TTS engines. Some of
the inexpensive shareware offerings I found were laughably bad, while many of the high
quality products from companies like AT&T and Lucent were prohibitively
expensive.
Fortunately, I managed to find a very high quality engine that is also
FREE! It's the Speech SDK from
Microsoft. The large download not only provides TTS capabilities, it includes a top
notch voice recognition engine. I also found extensive documentation, sample
applications, and more. Plus, you can use it with Visual Basic or C++, so it was
perfect for my needs.
Hear Cybert's Voice! -
Microsoft's Speech SDK (4.0)
Using Microsoft's Speech SDK
After I read the documents and looked at the sample code, it was very easy
to add speech to my Visual Basic program. I listened to all the possible voices and
chose "robosoft three", I dragged the DirectSS1 control to my form, then added
the following code to my Form_Load event:
i = 1
While DirectSS1.ModeName(i) <> "RoboSoft Three"
i = i + 1
Wend
DirectSS1.Select i
My "Say" Subroutine
I chose to use the TextData method, rather than the Speak method, because it
allows me to use "tags" to embed comments and bookmarks, or to emphasize or
de-emphasize different words, etc. A typical command looks like this:
Form1.DirectSS1.TextData CHARSET_TEXT, 1, "Hello, my
name is Cybert."
I didn't want to have to type this lengthy line every time I wanted to make
Cybert talk, so I wrote a subroutine called "Say" that allows me to pass a text
string or variable. The sub currently looks like this: (I'm going to change it to
allow Cybert to "interrupt" himself on some occasions).
Public Sub Say(Arg As String)
Form1.DirectSS1.TextData CHARSET_TEXT, 1, Arg
End Sub
With this subroutine I can have Cybert speak any text string I want:
Say "Hello, world."
My "Speak" Subroutine
Unlike the "Say" routine, which is literal, the Speak command passes a category,
allowing the robot choose an appropriate comment based on his mood. To explain this
further: when Cybert is first turned on there is code to initialize his variables
and systems, and a line that looks like this:
Speak ("Greeting")
The Speak subroutine references Cybert's current mood (1-9) and picks the
appropriate remark. If he's angry, for example, he might say, "I don't want to
wake up. Go away."
I'm actually kind of pleased with the code I wrote to handle the mood-based
commentary. It takes very few lines, is easy to understand and update, and it allows
me to add new comments at any time. After experimenting with nested Case statements
and multiple If/Then commands to handle random variations, I hit on the idea of creating a
two dimensional array called Vocal(x,y). The first member (x) contains Cybert's
mood, represented as a number from 1 to 9. The second member of the array (y) is the
response number which lets me add up to 10 or more "random" responses.
Here's an example to help clarify things:
Vocal(happy,1) = "I'm happy"
Happy=6, so the array variable at Vocal(6,1) = the string "I'm
happy". This gets passed to Say() for voicing. (The 1 indicates that the
string is response #1, there can be nine others and the computer will atuomatically pick
one randomly.)
EXAMPLE: To make the computer pick from one of three possible
"happy" responses, I would set Vocal(happy,0) equal to 3. Thus:
Vocal(happy, 0) = 3
Vocal(happy, 1) = "I'm happy" '(The
first possible response)
Vocal(happy, 2) = "I'm quite happy today" '(Second
possible response)
Vocal(happy, 3) = "What a great day today!" '(Third
possible response)
Here's the source for the Speak Routine: (Note that mood() is a
function.)
Public Sub Speak(Arg As String)
Dim Vocal(9, 10) '(mood, room for up to 10 random responses)
Randomize
Select Case Arg
Case "Greeting" '(Used
when Cybert is first turned on)
Vocal(deprs, 1) = "I'm depressed"
Vocal(secur, 1) = "I'm secure"
Vocal(plyfl, 1) = "I'm playful."
Vocal(upset, 1) = "I'm upset"
Vocal(norml, 1) = "I'm normal"
Vocal(happy, 1) = "I'm happy"
Vocal(angry, 1) = "I'm angry"
Vocal(afrad, 1) = "I'm afraid"
Vocal(cncrn, 0) = 3
Vocal(cncrn, 1) = "I'm concerned #1" '(1/3 chance for this response)
Vocal(cncrn, 2) = "I'm concerned #2" '(1/3 chance for this response)
Vocal(cncrn, 3) = "I'm concerned #3" '(1/3 chance for this response)
Case
"BatteriesLow" '(Triggered when battery level falls below 11volts)
Vocal(deprs,
1) = "I'm running out of power... but who cares..."
Vocal(secur, 1) = "My power is running low, I'm off to the charger!"
Vocal(plyfl, 1) = "When my battery is charged I want to play a game!"
Vocal(upset, 1) = "Great! Now I'm running low on battery power."
Vocal(norml, 1) = "I need to recharge my batteries."
Vocal(happy, 1) = "Oops! Low battery. Gotta charge!"
Vocal(angry, 1) = "DAMN IT! I'm out of juice."
Vocal(afrad, 1) = "My battery is low! I'm going to die! HELP ME!"
Vocal(cncrn, 1) = "My battery is really low. I'd better find my charger."
End Select
Say (Vocal(Mood(), Int((Vocal(Mood(), 0) * Rnd) + 1)))
End Sub
There are some additional benefits to this code that make it even more useful:
if there is no dialogue for a particular mood/category then Cybert will just skip it and
not say anything. Also, I am able to set Vocal(x,o) = to a higher number than the
total random responses, and the robot will just ignore the excess. This will allow
me to easily tell Cybert to say a random comment only X% of the time. To illustrate,
let's say that the "category" BumpedWall has 3 random responses for the Normal
mood, but I only want Cybert to say something half the time. I could set this up by
making BumpedWall (x,0) = 6. If the random number comes up with a "2" then
Cybert would say the second response to the BumpedWall event (based on his mood).
But if the random number was a "5" then he wouldn't say anything since there is
no BumpedWall(x,5) comment.
|