Albert Menu:        

 buttblue.jpg (750 bytes) Albert Project Home
The Albert Project Homepage.
 buttblue.jpg (750 bytes) Project Log
Follow Albert's progress.
 buttblue.jpg (750 bytes) Long Term Goals
Read the long term project goals.
 buttblue.jpg (750 bytes) Short Term Goals
Read the short term project goals.
 buttblue.jpg (750 bytes) Gallery
Enjoy pictures of Albert, taken during development.
 buttblue.jpg (750 bytes) Albert's Five Ears
Read this report about Albert's sonar sensors.
 buttblue.jpg (750 bytes) Albert Programming
Read this report about Albert's low level and high level code.
 buttblue.jpg (750 bytes) Speak, Albert, Speak
Read this report about Albert's text-to-speech ability.
 buttblue.jpg (750 bytes) Mood-Based Behaviors
Read this report about Albert's "Mood Matrix".

Speak, Albert, Speak
6/29/99

NOTE:  (Added 8/20/99)
I recently found a URL that will allow you to compare various text-to-speech engines (it's free).  http://www.research.att.com/projects/tts/

Many home robots don't include the luxury of a synthesized voice, and those that do generally rely on a speech board of some kind.   (RC systems V8600 is generally considered to be the best board level speech solution for small robots).  But since Albert uses a laptop computer for high level control this report focuses on PC-based speech engines, specifically the one I chose for this project: Microsoft's Speech SDK.

Goals
It has always been my intention to give Albert a large vocabulary of words and phrases.  Not only will this make him more entertaining, he will be able to communicate his moods and even what he is "thinking" (which will be helpful for debugging).  When I get the RF connection to my home PC (and thus the Internet), he will also be able to relate interesting, humorous, and possibly even useful information. I will merely have to parse text files from the HTML data I receive.

RoboBird.jpg (7315 bytes)

My decision to use a laptop "brain" was based, at least in part, on the knowledge that this option would provide me with nearly limitless storage capacity for Albert's "dialogue."  I've done a lot of creative writing and even scripted my own material while doing standup comedy in high school and college; so I look forward to writing the endless streams of clever commentary that will really bring Albert to life for my family and friends.

Here is a more succinct list of my speech goals:  

  • Give Albert lots of things to say
  • Vary what Albert says based on his mood
  • Make dialogue entry (and updates) simple and painless
  • Allow Albert to instantly vocalize information downloaded off the Internet

How Does a Text-to-Speech (TTS) Engine Work?
Text-to-speech fundamentally functions as a pipeline that converts text into PCM digital audio. The elements of the pipeline are:

Text normalization
Converts input text into a series of words, and handles abbreviations, numbers, etc.  ("Dr. Smith" becomes "Doctor Smith".   "$25.00" becomes "Twenty five dollars".)

Homograph disambiguation
Uses contextual rules to figure out the pronunciation of words that are spelled the same, but sound different.  (A good engine will correctly voice: "Tomorrow I will read the same story I read today")

Word pronunciation
Breaks down final list of words into a sequence of dictionary-like phonemes.  Thus "hello" becomes "h eh l oe".

Prosody
Chooses speed, pitch, and volume for words in a sentence.  This helps the voice sound less monotonous and actually makes it easier to understand.  Most engines, for example, will emphasize the last word in a question.

Concatenate wave segments
This important last step "munges" the various phonemes together for a more natural sound.  (No matter how fast you say "sh" and "oo" together, it's not going to sound very much like the word "shoe".)

The Search for a Cost Effective Engine
I spent several weeks downloading and evaluating TTS engines.  Some of the inexpensive shareware offerings I found were laughably bad, while many of the high quality products from companies like AT&T and Lucent were prohibitively expensive. 

Fortunately, I managed to find a very high quality engine that is also FREE!  It's the Speech SDK from Microsoft.  The large download not only provides TTS capabilities, it includes a top notch voice recognition engine.  I also found extensive documentation, sample applications, and more.   Plus, you can use it with Visual Basic or C++, so it was perfect for my needs.

Sample #4 (Speech SDK) - Microsoft Corporation (Albert's voice!)

Using Microsoft's Speech SDK
After I read the documents and looked at the sample code, it was very easy to add speech to my Visual Basic program.  I listened to all the possible voices and chose "robosoft three", I dragged the DirectSS1 control to my form, then added the following code to my Form_Load event:

     i = 1
     While DirectSS1.ModeName(i) <> "RoboSoft Three"
          i = i + 1
     Wend
     DirectSS1.Select i  

My "Say" Subroutine
I chose to use the TextData method, rather than the Speak method, because it allows me to use "tags" to embed comments and bookmarks, or to emphasize or de-emphasize different words, etc.  A typical command looks like this: 

     Form1.DirectSS1.TextData CHARSET_TEXT, 1, "Hello, my name is Albert."

I didn't want to have to type this lengthy line every time I wanted to make Albert talk, so I wrote a subroutine called "Say" that allows me to pass a text string or variable.  The sub currently looks like this: (I'm going to change it to allow Albert to "interrupt" himself on some occasions). 


     Public Sub Say(Arg As String)
     Form1.DirectSS1.TextData CHARSET_TEXT, 1, Arg
     End Sub


With this subroutine I can have Albert speak any text string I want:

   Say "Hello, world."

Or I can pass a variable or function to the Say subroutine like this:

     Say Direction() 
     [Direction() is a function that returns a string, such as "South"]

My "Speak" Subroutine
Unlike the "Say" routine, which is literal, the Speak command passes a category, allowing the robot choose an appropriate comment based on his mood.  To explain this further:  when Albert is first turned on there is code to initialize his variables and systems, and a line that looks like this: Speak ("Greeting").  The Speak subroutine  references Albert's mood and picks the appropriate remark.  If he's angry, for example, he might say, "I don't want to get up.  Go away."

I'm actually kind of pleased with the code I wrote to handle the mood-based commentary.  It takes very few lines, is easy to understand and update, and it allows me to add new comments at any time.   After experimenting with nested Case statements and multiple If/Then commands to handle random variations, I hit on the idea of creating a two dimensional array called Vocal(x,y).  The first member (x) contains Albert's mood, represented as a number from 1 to 9.  The second member of the array (y) is the response number, which currently ranges from 1 to 10.  Here's an example to help clarify things:    

     Vocal(happy,1) = "I'm happy"

Happy=6, so the array variable at Vocal(6,1) = the string "I'm happy".  This gets passed to Say() for voicing.  (The 1 indicates that the string is response #1.) 

Sometimes I'll want to have several responses for a particular mood and I'd like to have the computer pick one randomly. In this case, I would set Vocal(happy,0) equal to the total number of possibilities. Then I add new options by simply increasing the reference number (y). Thus:

     Vocal(happy, 0) = 3 'There are 3 responses for this mood, so pick randomly
     Vocal(happy, 1) = "I'm happy"
     Vocal(happy, 2) = "I'm quite happy today"
     Vocal(happy, 3) = "What a great day today!"

Here's the source for the Speak Routine:   (Note that mood() is a function.)


     Public Sub Speak(Arg As String)
     Dim Vocal(9, 10) 'mood, room for random responses

     Randomize

     Select Case Arg

          Case "Greeting" 'Used when Albert is first turned on
                Vocal(deprs, 1) = "I'm depressed"
                Vocal(secur, 1) = "I'm secure"
                Vocal(plyfl, 1) = "I'm playful."
                Vocal(upset, 1) = "I'm upset"
                Vocal(norml, 1) = "I'm normal"
                Vocal(happy, 1) = "I'm happy"
                Vocal(angry, 1) = "I'm angry"
                Vocal(afrad, 1) = "I'm afraid"
                Vocal(cncrn, 0) = 3
                Vocal(cncrn, 1) = "I'm concerned 1"
                Vocal(cncrn, 2) = "I'm concerned 2"
                Vocal(cncrn, 3) = "I'm concerned 3"

     End Select

     Say (Vocal(Mood(), Int((Vocal(Mood(), 0) * Rnd) + 1)))

     End Sub


There are some additional benefits to this code that make it even more useful: if there is no dialogue for a particular mood/category then Albert will just skip it and not say anything.  Also, I am able to set Vocal(x,o) = to a higher number than the total of random responses, and the robot will just ignore the excess.  To illustrate, let's say that the "category" BumpedWall has 3 random responses for the Normal mood, but I only want Albert to say something half the time.  I could set this up by making BumpedWall (x,0) = 6.