title.jpg (20464 bytes)

Cybert Menu:        

 buttblue.jpg (750 bytes) Main Menu
Return to the Main Menu.
 buttblue.jpg (750 bytes) Cybert Home
The Cybert Homepage.
 buttblue.jpg (750 bytes) Project Log
Follow Cybert's progress.
 buttblue.jpg (750 bytes) Project Goals
Read about my project goals.
 buttblue.jpg (750 bytes) Gallery
Enjoy some pictures and movies of Cybert!
 buttblue.jpg (750 bytes) Cybert Programming
Read about Cybert's low level and high level code.
 buttblue.jpg (750 bytes) Reports
Study reports about Cybert's creation and programming.

Hit Counter
Hits since November 4, 2000

Report: Speak, Cybert, Speak
Originally written for Albert on 6/29/99

NOTE:  (Added 8/20/99)
I recently found a URL that will allow you to compare various text-to-speech engines (it's free).  http://www.research.att.com/projects/tts/

Goals
It has always been my intention to give Cybert a large vocabulary of words and phrases.  Not only will this make him more entertaining, he will be better able to communicate his moods and even what he is "thinking" (which will be helpful for debugging).  Since my PC has a permanent connection to the Internet I plan to have Cybert "tell me" about current news, weather, stock reports, a "joke of the day" and more.  I will merely have to parse text files from the HTML and other data I receive.

RoboBird.jpg (7315 bytes)

Since Cybert is controlled from my home PC and not a smaller processor with a small amount of RAM, I will have nearly limitless storage capacity for Cybert's "dialogue."  I've done a lot of creative writing and even scripted my own material while doing standup comedy in high school and college; so I look forward to writing the endless streams of clever commentary that will really bring Cybert to life for my family and friends.

Here is a more succinct list of my speech goals:  

  • Give Cybert lots of things to say
  • Vary what Cybert says based on his mood
  • Vary the sound of Cybert's voice based on his mood
  • Make dialogue entry (and updates) simple and painless
  • Allow Cybert to instantly vocalize information downloaded off the Internet

How Does a Text-to-Speech (TTS) Engine Work?
Text-to-speech fundamentally functions as a pipeline that converts text into PCM digital audio. The elements of the pipeline are:

Text normalization
Converts input text into a series of words, and handles abbreviations, numbers, etc.  ("Dr. Smith" becomes "Doctor Smith".   "$25.00" becomes "Twenty five dollars".)

Homograph disambiguation
Uses contextual rules to figure out the pronunciation of words that are spelled the same, but sound different.  (A good engine will correctly voice: "Tomorrow I will read the same story I read today")

Word pronunciation
Breaks down final list of words into a sequence of dictionary-like phonemes.  Thus "hello" becomes "h eh l oe".

Prosody
Chooses speed, pitch, and volume for words in a sentence.  This helps the voice sound less monotonous and actually makes it easier to understand.  Most engines, for example, will emphasize the last word in a question.

Concatenate wave segments
This important last step "munges" the various phonemes together for a more natural sound.  (No matter how fast you say "sh" and "oo" together, it's not going to sound very much like the word "shoe".)

The Search for a Cost Effective Engine
I spent several weeks downloading and evaluating TTS engines.  Some of the inexpensive shareware offerings I found were laughably bad, while many of the high quality products from companies like AT&T and Lucent were prohibitively expensive. 

Fortunately, I managed to find a very high quality engine that is also FREE!  It's the Speech SDK from Microsoft.  The large download not only provides TTS capabilities, it includes a top notch voice recognition engine.  I also found extensive documentation, sample applications, and more.   Plus, you can use it with Visual Basic or C++, so it was perfect for my needs.

Hear Cybert's Voice! - Microsoft's Speech SDK (4.0)

Using Microsoft's Speech SDK
After I read the documents and looked at the sample code, it was very easy to add speech to my Visual Basic program.  I listened to all the possible voices and chose "robosoft three", I dragged the DirectSS1 control to my form, then added the following code to my Form_Load event:

     i = 1
     While DirectSS1.ModeName(i) <> "RoboSoft Three"
          i = i + 1
     Wend
     DirectSS1.Select i  

My "Say" Subroutine
I chose to use the TextData method, rather than the Speak method, because it allows me to use "tags" to embed comments and bookmarks, or to emphasize or de-emphasize different words, etc.  A typical command looks like this: 

     Form1.DirectSS1.TextData CHARSET_TEXT, 1, "Hello, my name is Cybert."

I didn't want to have to type this lengthy line every time I wanted to make Cybert talk, so I wrote a subroutine called "Say" that allows me to pass a text string or variable.  The sub currently looks like this: (I'm going to change it to allow Cybert to "interrupt" himself on some occasions). 


     Public Sub Say(Arg As String)
     Form1.DirectSS1.TextData CHARSET_TEXT, 1, Arg
     End Sub


With this subroutine I can have Cybert speak any text string I want:

   Say "Hello, world."

My "Speak" Subroutine
Unlike the "Say" routine, which is literal, the Speak command passes a category, allowing the robot choose an appropriate comment based on his mood.  To explain this further:  when Cybert is first turned on there is code to initialize his variables and systems, and a line that looks like this:

    Speak ("Greeting") 

The Speak subroutine  references Cybert's current mood (1-9) and picks the appropriate remark.  If he's angry, for example, he might say, "I don't want to wake up.  Go away."

I'm actually kind of pleased with the code I wrote to handle the mood-based commentary.  It takes very few lines, is easy to understand and update, and it allows me to add new comments at any time.   After experimenting with nested Case statements and multiple If/Then commands to handle random variations, I hit on the idea of creating a two dimensional array called Vocal(x,y).  The first member (x) contains Cybert's mood, represented as a number from 1 to 9.  The second member of the array (y) is the response number which lets me add up to 10 or more "random" responses.   Here's an example to help clarify things:    

     Vocal(happy,1) = "I'm happy"

Happy=6, so the array variable at Vocal(6,1) = the string "I'm happy".  This gets passed to Say() for voicing.  (The 1 indicates that the string is response #1, there can be nine others and the computer will atuomatically pick one randomly.) 

EXAMPLE:  To make the computer pick from one of three possible "happy" responses, I would set Vocal(happy,0) equal to 3.  Thus:

     Vocal(happy, 0) = 3
     Vocal(happy, 1) = "I'm happy" '(The first possible response)
     Vocal(happy, 2) = "I'm quite happy today" '(Second possible response)
     Vocal(happy, 3) = "What a great day today!" '(Third possible response)

Here's the source for the Speak Routine:   (Note that mood() is a function.)


     Public Sub Speak(Arg As String)
     Dim Vocal(9, 10) '(mood, room for up to 10 random responses)

     Randomize

     Select Case Arg

          Case "Greeting" '(Used when Cybert is first turned on)
                Vocal(deprs, 1) = "I'm depressed"
                Vocal(secur, 1) = "I'm secure"
                Vocal(plyfl, 1) = "I'm playful."
                Vocal(upset, 1) = "I'm upset"
                Vocal(norml, 1) = "I'm normal"
                Vocal(happy, 1) = "I'm happy"
                Vocal(angry, 1) = "I'm angry"
                Vocal(afrad, 1) = "I'm afraid"
                Vocal(cncrn, 0) = 3
                Vocal(cncrn, 1) = "I'm concerned #1" '(1/3 chance for this response)
                Vocal(cncrn, 2) = "I'm concerned #2" '(1/3 chance for this response)
                Vocal(cncrn, 3) = "I'm concerned #3" '(1/3 chance for this response)

          Case "BatteriesLow" '(Triggered when battery level falls below 11volts)
                Vocal(deprs, 1) = "I'm running out of power... but who cares..."
                Vocal(secur, 1) = "My power is running low, I'm off to the charger!"
                Vocal(plyfl, 1) = "When my battery is charged I want to play a game!"
                Vocal(upset, 1) = "Great! Now I'm running low on battery power."
                Vocal(norml, 1) = "I need to recharge my batteries."
                Vocal(happy, 1) = "Oops! Low battery.  Gotta charge!"
                Vocal(angry, 1) = "DAMN IT!  I'm out of juice."
                Vocal(afrad, 1) = "My battery is low!  I'm going to die!  HELP ME!"
                Vocal(cncrn, 1) = "My battery is really low.  I'd better find my charger."

End Select

     Say (Vocal(Mood(), Int((Vocal(Mood(), 0) * Rnd) + 1)))

     End Sub


There are some additional benefits to this code that make it even more useful: if there is no dialogue for a particular mood/category then Cybert will just skip it and not say anything.  Also, I am able to set Vocal(x,o) = to a higher number than the total random responses, and the robot will just ignore the excess.  This will allow me to easily tell Cybert to say a random comment only X% of the time.  To illustrate, let's say that the "category" BumpedWall has 3 random responses for the Normal mood, but I only want Cybert to say something half the time.  I could set this up by making BumpedWall (x,0) = 6.  If the random number comes up with a "2" then Cybert would say the second response to the BumpedWall event (based on his mood).  But if the random number was a "5" then he wouldn't say anything since there is no BumpedWall(x,5) comment.
 

 


Copyright 1999-2000, John Cutter.
For feedback, problems, questions, or to share your own stories or ideas, please contact john@home-robot.com.
Last updated: October 25, 1999.