Speak, Albert, Speak
6/29/99
NOTE: (Added 8/20/99)
I recently found a URL that will allow you to compare various text-to-speech engines (it's
free). http://www.research.att.com/projects/tts/
Many home robots don't include the luxury of a synthesized voice, and
those that do generally rely on a speech board of some kind. (RC systems V8600 is generally considered
to be the best board level speech solution for small robots). But since Albert uses
a laptop computer for high level control this report focuses on PC-based speech engines,
specifically the one I chose for this project: Microsoft's
Speech SDK.
Goals
It has always been my intention to give Albert a large vocabulary of words
and phrases. Not only will this make him more entertaining, he will be able to
communicate his moods and even what he is "thinking" (which will be helpful for
debugging). When I get the RF connection to my home PC (and thus the Internet), he
will also be able to relate interesting, humorous, and possibly even useful information. I
will merely have to parse text files from the HTML data I receive. |

|
My decision to use a laptop "brain" was based, at least in part, on
the knowledge that this option would provide me with nearly limitless storage capacity for
Albert's "dialogue." I've done a lot of creative writing and even scripted
my own material while doing standup comedy in high school and college; so I look forward
to writing the endless streams of clever commentary that will really bring Albert to life
for my family and friends.
Here is a more succinct list of my speech goals:
- Give Albert lots of things to say
- Vary what Albert says based on his mood
- Make dialogue entry (and updates) simple and painless
- Allow Albert to instantly vocalize information downloaded off the Internet
How Does a Text-to-Speech
(TTS) Engine Work?
Text-to-speech fundamentally functions as a pipeline that converts text into
PCM digital audio. The elements of the pipeline are:
Text normalization
Converts input text into a series of words, and handles abbreviations, numbers,
etc. ("Dr. Smith" becomes "Doctor Smith".
"$25.00" becomes "Twenty five dollars".)
Homograph disambiguation
Uses contextual rules to figure out the pronunciation of words that are spelled
the same, but sound different. (A good engine will correctly voice: "Tomorrow I
will read the same story I read today")
Word pronunciation
Breaks down final list of words into a sequence of dictionary-like
phonemes. Thus "hello" becomes "h eh l oe".
Prosody
Chooses speed, pitch, and volume for words in a sentence. This helps the
voice sound less monotonous and actually makes it easier to understand. Most
engines, for example, will emphasize the last word in a question.
Concatenate wave segments
This important last step "munges" the various phonemes together for a
more natural sound. (No matter how fast you say "sh" and "oo"
together, it's not going to sound very much like the word "shoe".)
The Search for a Cost
Effective Engine
I spent several weeks downloading and evaluating TTS engines. Some of
the inexpensive shareware offerings I found were laughably bad, while many of the high
quality products from companies like AT&T and Lucent were prohibitively
expensive.
Fortunately, I managed to find a very high quality engine that is also
FREE! It's the Speech SDK from
Microsoft. The large download not only provides TTS capabilities, it includes a top
notch voice recognition engine. I also found extensive documentation, sample
applications, and more. Plus, you can use it with Visual Basic or C++, so it was
perfect for my needs.
Sample #4 (Speech SDK)
- Microsoft Corporation (Albert's voice!)
Using Microsoft's Speech
SDK
After I read the documents and looked at the sample code, it was very easy
to add speech to my Visual Basic program. I listened to all the possible voices and
chose "robosoft three", I dragged the DirectSS1 control to my form, then added
the following code to my Form_Load event:
i = 1
While DirectSS1.ModeName(i) <> "RoboSoft Three"
i = i + 1
Wend
DirectSS1.Select i
My "Say"
Subroutine
I chose to use the TextData method, rather than the Speak method, because it
allows me to use "tags" to embed comments and bookmarks, or to emphasize or
de-emphasize different words, etc. A typical command looks like this:
Form1.DirectSS1.TextData CHARSET_TEXT, 1, "Hello, my
name is Albert."
I didn't want to have to type this lengthy line every time I wanted to make
Albert talk, so I wrote a subroutine called "Say" that allows me to pass a text
string or variable. The sub currently looks like this: (I'm going to change it to
allow Albert to "interrupt" himself on some occasions).
Public Sub Say(Arg As String)
Form1.DirectSS1.TextData CHARSET_TEXT, 1, Arg
End Sub
With this subroutine I can have Albert speak any text string I want:
Say "Hello, world."
Or I can pass a variable or function to the Say subroutine like this:
Say Direction()
[Direction() is a function that returns a string, such as
"South"]
My "Speak" Subroutine
Unlike the "Say" routine, which is literal, the Speak command passes a category,
allowing the robot choose an appropriate comment based on his mood. To explain this
further: when Albert is first turned on there is code to initialize his variables
and systems, and a line that looks like this: Speak ("Greeting"). The
Speak subroutine references Albert's mood and picks the appropriate remark. If
he's angry, for example, he might say, "I don't want to get up. Go away."
I'm actually kind of pleased with the code I wrote to handle the mood-based
commentary. It takes very few lines, is easy to understand and update, and it allows
me to add new comments at any time. After experimenting with nested Case statements
and multiple If/Then commands to handle random variations, I hit on the idea of creating a
two dimensional array called Vocal(x,y). The first member (x) contains Albert's
mood, represented as a number from 1 to 9. The second member of the array (y) is the
response number, which currently ranges from 1 to 10. Here's an example to help
clarify things:
Vocal(happy,1) = "I'm happy"
Happy=6, so the array variable at Vocal(6,1) = the string "I'm
happy". This gets passed to Say() for voicing. (The 1 indicates that the
string is response #1.)
Sometimes I'll want to have several responses for a particular mood and I'd like
to have the computer pick one randomly. In this case, I would set Vocal(happy,0) equal to
the total number of possibilities. Then I add new options by simply increasing the
reference number (y). Thus:
Vocal(happy, 0) = 3 'There are 3 responses for this
mood, so pick randomly
Vocal(happy, 1) = "I'm happy"
Vocal(happy, 2) = "I'm quite happy today"
Vocal(happy, 3) = "What a great day today!"
Here's the source for the Speak Routine: (Note that mood() is a
function.)
Public Sub Speak(Arg As String)
Dim Vocal(9, 10) 'mood, room for random responses
Randomize
Select Case Arg
Case "Greeting" 'Used
when Albert is first turned on
Vocal(deprs, 1) = "I'm depressed"
Vocal(secur, 1) = "I'm secure"
Vocal(plyfl, 1) = "I'm playful."
Vocal(upset, 1) = "I'm upset"
Vocal(norml, 1) = "I'm normal"
Vocal(happy, 1) = "I'm happy"
Vocal(angry, 1) = "I'm angry"
Vocal(afrad, 1) = "I'm afraid"
Vocal(cncrn, 0) = 3
Vocal(cncrn, 1) = "I'm concerned 1"
Vocal(cncrn, 2) = "I'm concerned 2"
Vocal(cncrn, 3) = "I'm concerned 3"
End Select
Say (Vocal(Mood(), Int((Vocal(Mood(), 0) * Rnd) + 1)))
End Sub
There are some additional benefits to this code that make it even more useful:
if there is no dialogue for a particular mood/category then Albert will just skip it and
not say anything. Also, I am able to set Vocal(x,o) = to a higher number than the
total of random responses, and the robot will just ignore the excess. To illustrate,
let's say that the "category" BumpedWall has 3 random responses for the Normal
mood, but I only want Albert to say something half the time. I could set this up by
making BumpedWall (x,0) = 6.
|