Changing Alexa’s Voice

Variety is a hallmark of human speech. It’s what keeps us from sounding robotic and canned. When asking a question, we might apply a different inflection than when making a statement. When we’re excited or disappointed about something, we’ll speak with a great deal of emotion, perhaps in a different rate or tone, as compared to our default speech patterns.

Likewise, if Alexa is to be accepted as anything more than a machine, it’s important that her responses exhibit some of the same variety as in human speech, exhibiting variations in tone, rate, volume, excitement, and disappointment. With SSML we can adjust all of these things and even change Alexa’s voice to a completely different voice.

Let’s have a look at how to fine-tune how loudly, quickly, and at what tone Alexa speaks.

Adjusting Prosody

In linguistics, prosody is a term used to describe things such as tone, stress, and rhythm of speech. In SSML, prosody can be specified with the <prosody> tag and is specifically concerned with volume, rate, and pitch.

As an example of using the <prosody> tag, let’s say you want Alexa to speak a word or phrase at a different volume than normal. The <prosody> tag’s volume attribute can help, as shown in the following SSML snippet:

 <speak>
  Welcome to <prosody volume=​"x-loud"​>Star Port 75 Travel</prosody>!
 </speak>

The volume attribute accepts one of six predefined values, “x-loud”, “loud”, “medium”, “soft”, “x-soft”, and “silent” (no sound whatsoever). While none of these values will result in a volume that is dramatically different from Alexa’s normal volume, they do have a subtle effect on how loud she speaks.

You can also specify a relative volume that is either greater than or less than the current volume:

 <speak>
  Welcome to <prosody volume=​"+6dB"​>Star Port 75 Travel</prosody>!
 </speak>

In this case, the value “+6dB” is about twice the volume of the current volume. Similarly “-6dB” would be approximately half the volume of the current volume. Take care; volumes much greater than “+6dB” relative to the default volume might result in distortion in the response.

The <prosody> tag can also alter the rate at which Alexa speaks one or more words. For example, consider the following use of <prosody> to have her speak slower than normal:

 <speak>
  Welcome to <prosody rate=​"x-slow"​>Star Port 75</prosody> Travel!
 </speak>

If you were to paste this into the text-to-speech simulator, Alexa would speak most of the sentence at a normal rate, but would slow down significantly when saying, “Star Port 75.”

Like the volume attribute, the rate attribute accepts a handful of predefined values, including “x-slow”, “slow”, “medium”, “fast”, and “x-fast”. But you can also specify a relative value as a percentage of the current rate, where “100%” is equal to the current rate. For example, to have her speak “Star Port 75” twice as fast as normal, set the rate attribute to “200%”:

 <speak>
  Welcome to <prosody rate=​"200%"​>Star Port 75</prosody> Travel!
 </speak>

On the other hand, if you want Alexa to slow down significantly when saying, “Star Port 75,” you can set the rate to “20%”, which is the minimum allowed value:

 <speak>
  Welcome to <prosody rate=​"20%"​>Star Port 75</prosody> Travel!
 </speak>

One other attribute of prosody that can be controlled is pitch. Pitch specifies how high or low Alexa’s voice is when she speaks. For example, if you want her to speak in a rather low tone, you can set the pitch attribute to “x-low”:

 <speak>
  Welcome to <prosody pitch=​"x-low"​>Star Port 75</prosody> Travel!
 </speak>

In addition to predefined pitch values—“x-low”, “low”, “medium”, “high”, and “x-high”—you can specify a relative pitch as a positive or negative percentage (where “+0%” is normal pitch):

 <speak>
  Welcome to <prosody pitch=​"-33.3%"​>Star Port 75</prosody> Travel!
 </speak>

In this case, setting pitch to “-33.3%” is the minimum allowed value and is equivalent to setting it to “x-low”. Similarly, “+50%” is the maximum allowed value and is the same as setting pitch to “x-high”.

You are welcome to mix and match volume, rate, and pitch as you see fit for some interesting effects. The following snippet of SSML, for instance, has Alexa saying “Star Port 75” in an amusingly low, slow, and loud voice:

 <speak>
  Welcome to
  <prosody pitch=​"-33.3%"
  rate=​"20%"
  volume=​"x-loud"​>Star Port 75</prosody> Travel!
 </speak>

You’ll definitely want to paste this example into the text-to-speech simulator and give it a try. It should give you some idea of how to alter Alexa’s voice to make her sound as if she may have had too much to drink.

While the <prosody> tag gives you near complete control over volume, rate, and pitch, these attributes are often used in combination to apply emphasis to a word or phrase. For simplicity’s sake, the <emphasis> tag can be used to control rate and volume in a simpler way when emphasis is the desired outcome:

 <speak>
  Welcome to
  <emphasis level=​"strong"​>Star Port 75</emphasis> Travel!
 </speak>

Here, a strong emphasis is applied, resulting in a slower rate and increased volume, much as a parent might speak to a child when they’re in trouble. On the other hand, setting level to “reduced” will have Alexa speak quicker and at a lower volume, much like a teenager might speak when telling their parent that they’ve wrecked the family car.

While prosody can add some dramatic effects to the words that Alexa speaks, it has its limits. Meanwhile, language is filled with brief words or phrases that express a great deal of emotion beyond what prosody can handle. Let’s see how Alexa can speak with excitement and disappointment with interjections.

Adding Interjections

Imagine you are watching your favorite sports team compete in the championship game. The competition is down to the final moments and the score is tied. It’s clear that the coach must put in his star players. They’re the only ones who can pull out a win. Inexplicably, however, the coach calls in a rookie who hasn’t seen any game time all season long. In response, you yell “Boo!” with a tone of judgmental negativity. As you continue to watch the game and the timer ticks down to the final seconds, the unseasoned rookie shocks everyone when they score, winning the game for the team! Everyone jumps to their feet and shouts “Yay!” enthusiastically.

Now, imagine that same scenario, only with the words “boo” and “yay” said in an emotionless, deadpan tone. If you can imagine it, you realize that those words, as insignificant as they may seem, carried a lot more value when they were said with emotion than when said without.

Certain words and phrases are expected to be said with excitement or disappointment. “Boo” and “yay” are two such words. “Holy smokes,” “aw man,” and “Great Scott” are a few more. These are interjections, usually used as an exclamation in speech, that just sound wrong when said flat and without emotion.

Although it is possible to have Alexa speak interjections without any special SSML handling, she’ll say them in a matter-of-fact tone, without any emotion or expression. For example, try the following SSML in the text-to-speech simulator:

 <speak>
  Great Scott!
 </speak>

Despite the exclamation mark, Alexa will simply say the words without any feeling. But by applying SSML’s <say-as> tag, we can liven up the phrase:

 <speak>
  <say-as interpret-as=​"interjection"​>Great Scott!</say-as>
 </speak>

The interpret-as attribute indicates that the contents of the <say-as> tag should be read as an interjection, more expressively than without the <say-as> tag.

It’s important to understand that officially, there are only so many words and phrases that can be used as interjections. These are referred to as speechcons in Alexa’s documentation.[28] Even so, you might find other phrases that sound good when wrapped with <say-as> as interjections, so feel free to experiment as much as you like.

Prosody and speechcons help give more character to Alexa’s natural voice. But we can take it further. Let’s see how to make Alexa speak with excitement or disappointment.

Applying Emotion

Imagine that you’re developing a game skill and want to cheer on the user, congratulating them on doing well. If you were to have Alexa simply say, “Way to go! That was awesome!” it would come out kind of flat and emotionless. Similarly, if you want Alexa to console the player when things don’t go well, you might have her say, “Aw, that’s too bad.” But without emotion, it will seem insincere.

As a computerized voice assistant, there’s not much you can do to make Alexa actually feel excitement or disappointment. But with the <amazon:emotion> tag, you can make her sound as if she’s excited or bummed out.

Applying the <amazon:emotion> tag, we can have her congratulate the user like this:

 <amazon:emotion name=​"excited"​ intensity=​"medium"​>
  Way to go! That was awesome!
 </amazon:emotion>

Or, to express a sincere feeling of disappointment when things don’t work out, you can use the <amazon:emotion> tag like this:

 <amazon:emotion name=​"disappointed"​ intensity=​"medium"​>
  Ah, that's too bad.
 </amazon:emotion>

In either case, the intensity attribute can be used to adjust how excited or disappointed Alexa speaks the text. If, after trying the <amazon:emotion> tag, you feel as if Alexa could be even more excited or disappointed (or perhaps less so), then you can adjust the intensity of the emotion by setting the intensity attribute to either “low”, “medium”, or “high”.

Apply Domain-Specific Speech

Have you ever noticed how the anchorpersons on the news often speak in a peculiar tone that despite not sounding natural, is clearly recognizable as the “news voice”? Or how the radio DJ on the local Top 40 station speaks with an energetic voice that hypes up the music to be played?

While Alexa’s native voice could be used to read news articles or announce the next hit song on the radio, it wouldn’t express the same tone we’ve come to expect from newscasters and radio DJs.

To put Alexa into those roles, you can use the <amazon:domain> tag. This tag has a single name attribute that specifies the desired voice domain, either “news”, “music”, “long-form”, “conversational”, or “fun”. Each of these domain-specific voice styles give a unique twist on how Alexa says the given text.

For example to have Alexa read a news article in a voice like that of a newscaster, you can use <amazon:domain> like this:

 <amazon:domain name=​"news"​>
  This just in: A local resident reported that he was frightened by a
  mysterious bright light shining through the trees behind his home.
  Officers responded, relying on their extensive training, and
  determined that the offending light source was not an alien spacecraft
  as originally suspected, but was, in fact, the earth's moon.
 </amazon:domain>

The “news” domain can be combined with the <voice> tag for a few of the alternate voices, including Matthew, Joanna, and Lupe. Joanna’s voice sounds particularly good when reading the news clip:

 <voice name=​"Joanna"​>
  <amazon:domain name=​"news"​>
  This just in: A local resident reported that he was frightened by a
  mysterious bright light shining through the trees behind his home.
  Officers responded, relying on their extensive training, and
  determined that the offending light source was not an alien spacecraft
  as originally suspected, but was, in fact, the earth's moon.
  </amazon:domain>
 </voice>

Similarly, if you set the name attribute to “music”, her tone will make her sound like she is announcing the next song on the local radio station’s rush hour playlist:

 <amazon:domain name=​"music"​>
  That was "Immigrant's Song" by Led Zeppelin. We've got songs by Ozzy
  Osbourne, Van Halen, and Scorpions coming up for your drive home. But
  first, here's "Highway Star" by Deep Purple on the Rockin' 98.1 FM.
 </amazon:domain>

Unfortunately, the “music” domain is incompatible with the <voice> tag. If used together, the <amazon:domain> tag will have no effect.

Another domain that Alexa may speak in is the “long-form” domain. This is useful when she is reading a lengthy span of text, such as if she’s reading a passage from a book. For example, here’s how to use the “long-form” domain to read the first paragraph from Moby Dick:

 <amazon:domain name=​"long-form"​>
  Call me Ishmael. Some years ago - never mind how long precisely - having
  little or no money in my purse, and nothing particular to interest
  me on shore, I thought I would sail about a little and see the
  watery part of the world. It is a way I have of driving off the
  spleen and regulating the circulation.
 </amazon:domain>

As with the “music” domain, you can’t combine <voice> and <amazon:domain> when the domain name is “long-form”.

The “conversational” domain causes Alexa to speak in a relaxed voice, as if she’s speaking with friends. Try the following SSML, both with and without the <amazon:domain> tag and see if you can hear the difference:

 <voice name=​"Matthew"​>
  <amazon:domain name=​"conversational"​>
  Have you read any good books lately? I just finished a
  book about anti-gravity. I couldn't put it down.
  </amazon:domain>
 </voice>

Per Amazon’s documentation, the “conversational” domain must be used with either Matthew’s or Joanna’s voice. But even if you try it with Alexa’s natural voice, there’s still a noticeable difference.

There’s one more domain that you can try. The “fun” domain causes Alexa to speak in a friendly and animated tone. It can be used like this:

 <lang xml:lang=​"ja-JP"​>
  <amazon:domain name=​"fun"​>
  今日はこれまでで最高の日です!
  </amazon:domain>
 </lang>

Unfortunately, the “fun” domain only works with Japanese skills. Even if you set the language to “ja-JP”, if the skill’s manifest does not designate this as a Japanese skill, then the <amazon:domain name="fun"> tag will have no effect. Also, the “fun” domain cannot be mixed with the <voice> tag.

All of the SSML tricks we’ve seen this far make subtle changes to Alexa’s tone. Even so, maybe you’d rather your responses be in a completely different voice than Alexa’s. Let’s see how to swap out Alexa’s voice for one of several alternate voices.

Switching to an Alternate Voice

In addition to Alexa, Amazon has several other voice-related projects, including an interesting one called Polly.[29] Polly employs machine learning to synthesize voices that sound very natural and realistic, including voices suited for a number of languages and voices with specific accents associated with a locale.

Using SSML’s <voice> tag, we can tap into a select subset of Polly voices and use them in responses from our Alexa skills as a direct replacement for Alexa’s voice.

For example, suppose that we would like our skill to greet users with the voice of a young boy. In that case, we can apply the voice of Justin, one of the Polly voices supported by Alexa:

 <speak>
  <voice name=​"Justin"​>Welcome to Star Port 75 travel!</voice>
 </speak>

Similarly, we can use the voice of Amy to hear the greeting spoken as if by a British woman:

 <speak>
  <voice name=​"Amy"​>Welcome to Star Port 75 travel!</voice>
 </speak>

Not all Polly voices are supported by Alexa. There are, however, over two dozen voices, both female and male, spanning several languages and locales.[30] It’s also important to know that the <voice> tag can be used in combination with all other SSML tags except <say-as> when using interjections.

Even if you’re happy with how Alexa’s voice sounds, you may need to adjust how she pronounces words. Let’s look at a few ways to guide Alexa’s pronunciation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset