Chapter 2. Existing APIs and Libraries

Although digital voice interfaces have been around for several decades, we are really just now at the cusp of advanced voice interfaces, especially in the realm of Voice-as-a-Service (VaaS) and intelligent, near human-like voice-enabled virtual assistants. In this chapter, we will review some of the services currently available, including Amazon Alexa, Microsoft Cognitive Services, Cortana, and Google Cloud Speech API.

While there are plenty of services to choose from these days, they all offer unique functionality that you will want to consider when building out your own service or device. Some are easier than others to implement, but aren’t as flexible or customizable (yet may be sufficient for your use case). For example, Amazon Alexa might be a great choice for a virtual assistant with a lot of pre-packaged features, but if you’re looking for more straightforward speech to text, Nuance Mix or Amazon Lex may be a better option.

In addition to exploring some of the services available today, we will also touch a bit on some technical architectures, both those you might encounter as well as ones you might build while designing your own project. To start things off, let’s take a deeper look at Amazon Alexa.

Amazon Alexa

With the launch of the Amazon Echo in November 2014, Amazon helped propel home-based voice interfaces into the mainstream: using its cloud-based voice assistant “Alexa,” anyone within proximity of the device can make a multitude of requests without lifting a finger. The Echo really stands as an IoT hub type of device in which it’s connected to the internet and can send commands to other IoT devices. In adding a voice interface, Amazon has successfully accomplished integrating both voice and IoT together into a simple, non–screen-based home appliance that just about anyone can use (Figure 2-1).

Figure 2-1. Amazon Echo 2014

In developing the Echo and Alexa, Amazon immediately recognized the value in developing a rich, third-party developer community by exposing some extremely simple, HTTP RESTful APIs and developer tools. This allowed Amazon to integrate with services such as Uber, Nest, IFTTT, Insteon, Wemo, SmartThings, and many others. In turn, this gave end users a broader selection of voice-enabled services.

Additionally, Alexa and Echo are actually two separate offerings: Alexa is a cloud-based voice service while Echo is a Bluetooth and WiFi device that connects to Alexa. This also allows Amazon to connect Alexa to additional propietary devices (e.g., the Amazon Fire product line, Echo Dot, and Amazon Tap) as well as third-party hardware makers (e.g., Nucleus, CoWatch, and Triby).

Alexa Skills Kit (ASK)

As a developer, you can also take advantage of the Amazon APIs and make similar products. Currently, there are two main Alexa developer toolkits: the Alexa Skills Kit (ASK) and Alexa Voice Service (AVS). The ASK is what you use to create skills for Alexa. Skills are essentially voice apps; for instance, when you say, “Alexa, ask Uber for a ride,” the word “Uber” refers to the Uber skill. Just like a human, the more skills Alexa learns, the smarter she gets.

The ASK supports several request types. The first one we’ll look at is the LaunchRequest, which can be used to open a dialogue with the user. When a user asks for a skill using the wrong command—for example, “Alexa, start Uber” instead of “Alexa, ask Uber for a ride”—you could program the device to provide information on how to use the skill, suggest possible commands or questions one could ask, then follow up with a question on what the skill should do next in order to keep the conversation going or execute a command. Here’s an example of a basic skill request of type LaunchRequest posted from Alexa to a third-party endpoint:

POST / HTTP/1.1
Content-Type : application/json;charset=UTF-8
Host : your.application.endpoint
Content-Length :
Accept : application/json
Accept-Charset : utf-8
Signature: [ omitted for brevity ]
SignatureCertChainUrl: https://s3.amazonaws.com/echo.api/echo-api-cert.pem

{
  "version": "1.0",
  "session": {
    "new": true,
    "sessionId": "amzn1.echo-api.session.0000000-0000-0000-0000-00000000000",
    "application": {
      "applicationId": "amzn1.echo-sdk-ams.app.000000-d0ed-0000
-ad00-000000d00ebe"
    },
    "attributes": {},
    "user": {
      "userId": "amzn1.account.AM3B00000000000000000000000"
    }
  },
  "context": {
    "System": {
      "application": {
        "applicationId": "amzn1.echo-sdk-ams.app.000000-d0ed-0000
-ad00-000000d00ebe"
      },
      "user": {
        "userId": "amzn1.account.AM3B00000000000000000000000"
      },
      "device": {
        "supportedInterfaces": {
          "AudioPlayer": {}
        }
      }
    },
    "AudioPlayer": {
      "offsetInMilliseconds": 0,
      "playerActivity": "IDLE"
    }
  },
  "request": {
    "type": "LaunchRequest",
    "requestId": "amzn1.echo-api.request.0000000-0000-0000-0000-00000000000",
    "timestamp": "2015-05-13T12:34:56Z",
    "locale": "string"
  }
}

As this example demonstrates, Amazon sends over a straightforward JSON-formatted HTTP POST request where the receiving endpoint simply processes it and replies with a JSON-formatted response containing plain-text or SSML-formatted strings for Alexa to read back to the user. Recently, Amazon added the ability to send back a URL to an MP3 file where the calling Alexa-enabled device can then stream back and play for the user to listen to.

Alexa Voice Service (AVS)

The other toolkit, AVS, is what you can use to connect your hardware to Alexa. For instance, you can start prototyping immediately with a Raspberry Pi 3, a microphone, and a speaker. Drop in some AVS code, and there you have it: your own limited-edition Echo. It’s “limited edition” not just because there is only one, but because the Echo has some additional hardware such as its seven-microphone array, which greatly improves recognition performance. You will still have the ability to ask for things like the weather and time, and all the other skills Alexa has to offer.

Some of AVS’s limitations are geographical. For example, at the time of this writing, iHeartRadio, Kindle, and traffic reports are not available in the UK or Germany. Also, availability on some third-party developer Alexa skills made with the Alexa Skills Kit are also based on geographic location. In addition to geographic limitations, you will want to read the latest version of the AVS Functional Design Guide for the most up-to-date best practices on going to production with your device. For example, you can currently only use “Alexa” as the wake word if you are developing a voice-initiated product.

Another way Amazon and its partners are helping make voice enablement easier is by offering Development Kits for AVS. In one example, Amazon has partnered with Conexant to make available the Conexant DS20924 AudioSmart 4-Mic Development Kit for AVS (shown in Figure 2-2), which allows developers to rapidly prototype an Alexa voice-enabled device. With a price tag of $349, the kit doesn’t come cheap, but it includes some powerful features such as four microphone far-field voice interaction with 360-degree Smart Source Pickup and Smart Source Locator, full duplex Acoustic Echo Cancellation (AEC), as well as Conexant’s CX22721 Audio Playback CODEC  for optimal audio quality.

Figure 2-2. Conexant DS20924 AudioSmart 4-Mic Dev Kit for AVS

Conexant also has a 2-mic version for $299 that’s price comparable to another AVS kit called Microsemi’s ZLK38AVS AcuEdge Development Kit for AVS (shown in Figure 2-3), which also has a 2-mic array. The AcuEdge also supports beam forming, two-way communication, trigger word recognition for hands-free support, as well as Smart Automatic Gain Control. Additionally, it’s square versus round and is designed as a Raspberry Pi Hat, which makes for a great Pi topping. In other words, it’s designed so the pins plug directly into the Pi for a nice and snug fit.

Figure 2-3. Microsemi’s ZLK38AVS AcuEdge Development Kit for AVS

If you’re on a tight budget, the Matrix Voice (shown in Figure 2-4) is another solid choice. The Matrix Voice is not a listed Amazon AVS–compatible device, but is designed with AVS in mind. It may be cheaper in price, but packs a punch in quality with a 7-microphone array, open source, and an 18 RGB LED ring array. Additionally, it not only supports AVS but also Microsoft Cognitive Services, Google Speech API, Houndify, and others. There’s also a standalone version with an embedded ESP32, which is a WiFi- and Bluetooth-enabled 32-bit microcontroller.

Figure 2-4. Matrix Voice Open Source Dev Board

The three official Amazon AVS dev kits, in addition to the Matrix Voice, all work with Raspberry Pi with undoubtedly more options on the way. While the added cost of the kits along with the existing cost of the Pi and any other components you want to include are exponentially higher, it may be the best option for aspiring hardware engineers just starting out to reduce the learning curve time and get hands-on experience with what the pros are bringing to market. If making your own Alexa-enabled device sounds like something you want to do, you’re in luck! We will take a deep dive into the ASK and AVS starting in Chapter 3 so make sure your workbench is ready to rock.

Amazon Lex

At AWS re:Invent 2016, Amazon announced the opening of additional APIs to help drive the future of conversational interfaces. These APIs aren’t necessarily part of the Alexa offerings, but are part of the APIs that actually power Amazon Alexa. Among these newly released APIs is Amazon Lex, Alexa’s deep learning ASR and NLU engine. Now abstracted and available for developers to integrate into our own applications, we can leverage the power of Lex and get even more flexibility out of the Amazon AI stack.

One of the great features Lex offers is the ability to take both plain-text and audio streams as input, whereas some services out there may only take in plain text while others, such as AVS, only take in an audio stream. However, that doesn’t necessarily mean that Lex is the right solution for you; instead, you might want to leverage all the skills available in AVS such as weather, music, calendar, or the many thousands of other third-party skills available today.

Amazon Polly

While Amazon Lex will help you convert natural language from text and voice to intents, Amazon Polly will help you convert plain-text sentences to voice. In other words, Polly is the TTS engine that gives Alexa her voice. By leveraging Polly as your TTS layer, you can take advantage of additional features such as a selection of 47 near natural sounding voices in over 24 languages.

With Polly, you can submit plain text or SSML where you can control the pronunciation, volume, pitch, and speech rate. The output synthesized speech can be encoded in MP3, Ogg Vorbis, or PCM (audio/pcm in a signed 16-bit, 1 channel [mono], little-endian format). This is specified by setting the "OutputFormat" parameter when posting a request to the Amazon Polly service.

Here is an example of the request structure:

POST /v1/speech HTTP/1.1
Content-type: application/json

{
   "LexiconNames": [ "string" ],
   "OutputFormat": "string",
   "SampleRate": "string",
   "Text": "string",
   "TextType": "string",
   "VoiceId": "string"
}

As you can see in the preceding example, in addition to the required OutputFormat parameter, there are additional required as well as optional parameters that you can use to get your desired voice response. Take the optional parameter LexiconNames, for instance. Here, you can preset lexicons to customize the pronunciation of words. These lexicons must conform to the Pronunciation Lexicon Specification (PLS) by the W3C. This is a great way for transforming l33t speak to voice.

With the optional SampleRate parameter, you can optionally set the audio frequency in Hz to 8000, 16000, and 22050 for MP3 and ogg (it defaults to 22050). For PCM, you can specify 8000 and 16000 (the default is 16000). Another optional parameter is TextType, which by default is text. This should be set to ssml if you are sending in SSML-structured text in the required Text parameter, which is where you would include your plain text or SSML to be synthesized.

Finally, with the required VoiceId parameter, you will need to specify which voice you would like used for the synthesis. There is an endpoint you can submit a GET request to in order to receive the latest list of voices, their respective IDs, and other information. Consult the documentation for further details, including up-to-date API information.

Microsoft Cognitive Services, Cortana, and More

Another company with a long history of voice-based offerings is Microsoft, which released its Speech Application Programming Interface (SAPI) with the initial rollout of Windows 95. SAPI included a basic form of TTS and STT that allowed developers to add voice into their Windows-based applications. This included features like voice dictation for word processors and screen readers for the visually impaired.

Fast-forward to 2014 and Microsoft has publicly released Cortana, its digital personal assistant. Cortana is the culmination of years of research, development, and acquisitions, giving Microsoft an AI personality that competes with Apple, Google, and Amazon in the voice and AI space. But Microsoft faced the same problem as Apple: its digital assistants were only as smart as its developers made them. In order to make them smarter, they needed to open up.

To expand the ecosystem further, in 2016 Microsoft opened up the APIs that make up Cortana via Cognitive Services. This cloud-based offering includes speech as well as other services around vision, language, knowledge, and search. Under speech and language alone you will find APIs for Translator, Bing Speech, Bing Spell Check, Speaker Recognition, LUIS, Linguistic Analysis, Text Analytics, and Custom Speech Service. Additionally, as an alternative to Alexa, Microsoft announced the release of Cortana Skills Kit, which allows third-party developers to essentially make Cortana smarter.

With the Cortana Skills Kit, developers create bots using the Microsoft Bot Framework, which can plug into LUIS for natural language understanding. The Microsoft Bot Framework has a pretty robust API that you can simply import into your C# or Node.js project. Here’s an example of what a basic endpoint looks like in a C# ApiController class:

public class MyCortanaSkillController : ApiController
{
    public async Task<HttpResponseMessage> Post([FromBody]Activity activity)
    {
        var connector = new ConnectorClient(new Uri(activity.ServiceUrl));
        var reply = activity.CreateReply();

        var qnaService = new QnAMakerService(new QnAMakerAttribute(
            WebConfigurationManager.AppSettings["MyAppKey"],
            WebConfigurationManager.AppSettings["MyAppAId"]));
            
        var result = await qnaService.QueryServiceAsync(activity.Text);

        reply.Speak = (result.Answer != null && result.Answers.Count > 0)
                ? result.Answers[0].Answer
                : "Sorry, I didn't get that.";

        await connector.Conversations.ReplyToActivityAsync(reply);
        
        return Request.CreateResponse(HttpStatusCode.OK);
    }
}

In this example, we create a class called MyCortanaSkillController that inherits ApiController from the Web API framework. Then we create the endpoint method that accepts a POST request of type Activity. The activity object contains a ServiceUrl, which we use to create a callback and we generate a reply object using the activity.CreateReply() method. Note the QnAMakerService object being referenced here. It’s not required, but we are using it in lieu of LUIS to show that we can essentially hit any NLP/NLU service we want when the Cortana spoken text or text input from another channel such as Slack or SMS, for example, comes in the activity.Text property. In other words, we can use Cortana with API.AI, Wit.ai, Amazon Lex, or even Watson if we wanted to.

Once we get the results back from the NLP/NLU service (in this case, QnA Maker, Microsoft’s question-answering, domain-specific NLU service), we can set the reply.Speak property to the response and send back to the connector to reply to Cortana or wherever the initial request came in from. While we won’t look at Cortana or Bot Framework in great detail, we will explore similar logic in later chapters using Windows IoT Core.

Google Cloud Speech API

In 2016, Google announced the beta release of its Cloud Speech API, available on the Google Cloud Platform, and the results are impressive. With one of the most advanced neural networks and machine learning algorithms, it supports over 80 languages, advanced noise cancellation and signal processing, as well as streaming recognition and word hints for context-aware recognition.

The biggest caveat with the Cloud Speech API is that it’s only one side of the coin—it’s an STT or ASR API and does not support voice synthesis or TTS at this time. Additionally, while audio under 60 minutes is free, any processing over 60 minutes will incur a small charge. Lastly, it does not have NLP built into it. For this, Google offers a separate Cloud Natural Language Processing API, which takes text input and extracts information that you can use for text analysis as well as understanding sentiment and intents.

Here’s an example of what a request to the “recognize” RESTful endpoint on the Google Cloud Speech API looks like:

{
  "config": {
      "encoding":"FLAC",
      "sampleRateHertz": 16000,
      "languageCode": "en-US",
      "profanityFilter": true
  },
  "audio": {
      "uri":"gs://cloud-speech-samples/speech/utterance.flac"
  }
}

The interesting thing to note here is the FLAC encoding type. Google recommends using FLAC (which stands for Free Lossless Audio Codec) because it does not compromise the voice recognition due to its high quality. However, Google does allow for 16-bit Linear PCM (LINEAR16), PCMU/mu-law (MULAW), Adaptive Multi-Rate Narrowband at 8,000 hertz (AMR), Adaptive Multi-Rate Wideband at 16,000 hertz (AMR_WB), Opus-encoded audio frames in Ogg container (OGG_OPUS), as well as Speex Wideband at 16,000 hertz (SPEEX_WITH_HEADER_BYTE).

Once the request has been posted, the response might look something like this:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "say hello to Siri and Alexa for me",
          "confidence": 0.98534203
        }
      ]
    }
  ]
}

Based on the response, you can then run the “transcript” value through a text-to-speech engine and play the resulting audio through a speaker. In addition to posting audio files, Google also allows for streaming audio via the open source gRPC, which is a high-performance, universal remote procedure call (RPC) framework.

Other Notable Services

In September 2016, Google acquired API.AI, which was another startup in this growing field of NLP/NLU services. API.AI at the time had a community of over 60,000 developers using its tools and services to create a conversational experience for both voice and text. This gave Google an instant API with a loyal following to create skills (or what Google calls actions) for its Google Home and Google Assistant offerings.

In addition to the powerful NLP/NLU toolset, another key feature that attracted developers to API.AI in the first place was the built-in support for multiple-domain–specific functionality and services such as weather, math, time, calendar, and more. Additionally, built-in support for communications applications such as Facebook Messenger, Slack, Kik, and others made it extremely efficient to deploy chatbots to a broader audience.

Another great service is Wit.ai, which was acquired by Facebook and boasts over 65,000 developers. While there are some differences in built-in capabilities and overall experience (e.g., the number of built-in elements, domains, and whether or not you need an open source private solution), it really comes down to a matter of preference and your requirements. Features change daily so if you decide to go down this route, do some research to see what the latest offerings are and give these services a test run to see what works for your use case.

In addition to Wit.ai and API.AI you will want to check out IBM Watson and Watson Virtual Agent as well as tools such as Jasper, PocketSphinx, Houndify, and Festival. You should also check out the latest offerings from Nuance and be on the lookout for startups such as Jibo, for example. Jibo is an interesting offering in that it’s an actual physical robot that moves, blinks, and reacts physically to voice input and output.

While at the time of this writing Jibo isn’t publicly available, there are tools developers can download such as the Jibo SDK, which has Atom IDE integration, as well as a Jibo Simulator (shown in Figure 2-5), which is great for visualizing how your code would affect Jibo and how users can engage with the robot.

Figure 2-5. Screen grab of Jibo Simulator

Technical Architecture

Now that we are familiar with the landscape and have a good idea of what to look for in terms of a voice interface service offering that we can plug into, let’s quickly highlight some of the architectural aspects of integrating with these services. Because architectures vary from service to service we won’t go into great detail here, but we will cover some common conceptual and application architecture approaches to help you with integrating a voice service with your home-brewed IoT device.

Conceptual Architecture

For a high-level understanding of how voice-enabled devices handle voice commands, we’ll turn to conceptual architecture. Figure 2-6 illustrates a basic architecture that conceptually visualizes how a command or utterance flows from a user through to the intent handler back out to the user and any additional graphical displays.

Figure 2-6. High-level conceptual architecture for voice-enabled devices

Let’s break this diagram down a bit further. We begin with the user providing an utterance or verbal command that flows to the voice-enabled device by way of a microphone. The microphone sends the audio to the audio processor within the device, then sends an audio stream to the NLP layer (which can either live on the device or in the cloud). The NLP layer parses the audio via speech-to-text (STT), and then maps the text to an intent. As mentioned earlier, in some cases the STT component may be part of a separate service altogether. In either case, it is then routed to its appropriate intent handler for processing.

Inside the intent handlers can live all sorts of business logic for any number of domains. For instance, a weather intent handler can send a request to a weather service for weather data, then return the response in human-readable form where the TTS engine can then convert it to audio form.

To put this all into a bit more context, let’s take a look at the Amazon Alexa version of this conceptual architecture diagram (Figure 2-7).

Figure 2-7. Conceptual architecture for Amazon Alexa

Figure 2-7 shows a similar flow as Figure 2-6, with the user utterance going to a voice-enabled device, in this case, an Alexa-enabled device like the Amazon Echo or Dot, for example. The audio stream is then pushed via HTTP streaming to the Alexa Voice Service on AWS for STT and NLP processing. There, the intent is determined and routes the request to the appropriate skill service via a REST HTTP POST request in JSON format. This can be an internal Amazon service or a skill produced by a third-party developer by specifying the endpoint URL in the Amazon Developer Console, the details of which should be outlined in an application architecture.

Keep in mind that the conceptual architecture examples shown here are as basic as it gets. Based on your own product, you will want to add to or modify the architecture as you begin to think about the functions of your device. For example, you’ll first want to determine whether your device will use a screen, LEDs, additional protocols or services, and so on. Once you have a solid conceptual architecture in place for how you conceive your own device, then you can move on to designing a solid application architecture.

Application Architecture

You will most likely need multiple application architectures, but those architectures may vary depending on your device. However, at minimum, you will need to focus on what your hardware architecture, embedded software architecture, and web services architecture might look like. Figure 2-8 illustrates an application architecture example for web services that you can use as a foundation.

Figure 2-8. Web services architecture

The web services architecture shown here is a bit monolithic with bundled Content Management and a Graphical User Interface for brevity as well as clarity around serving visual content, as with the Alexa Mobile App cards. This might also be overkill if you are creating a simple “Hello, World” application, so think of it as somewhat of a future state architecture. For now, let’s break it down a bit further: we start with the web at the top layer, which is where all the traffic comes in and out of. On the two pillars we have REST HTTP Web API component in JSON format, filtering requests to the Intent Routing Logic. Here is where you will determine which intent handler to route to when the requests come in to your endpoint.

There will be multiple intent handlers, one for each of your intents. For example, if someone says, “What’s the weather in Miami?” that could be routed to the GetWeatherIntent handler. If someone says, “What’s the stock price for AMZN?” that would go to something like GetStockPriceIntent, and so on. All these components can access the Common Business Logic, which in turn accesses the Data Access Layer for data and files such as images and videos.

The Content Management component provides access to images and content to the Graphical User Interface by way of Pub/Sub event handling or when requested directly via the web. This can be omitted altogether and files can be accessed directly via their URL, but having a content manager or basic handler in place helps with things like security or analytics and if you have a complete Content Management System in place, this can help ease content deployment, as nondevelopers can contribute to uploading content.

Figure 2-9 represents a basic hardware diagram using prototyping hardware such as a Raspberry Pi, a breadboard, and additional components. This is a great way to start rapidly prototyping and testing your product before going full-on production. In the event you want to go full-on production and develop your own board with embedded circuits, sensors, and other components, you will need to design several core and mechanical schematics, drawings, and documentation so that the factory knows exactly how you want the boards produced.

Figure 2-9. Basic hardware architecture diagram for Raspberry Pi with AVS 

Figure 2-10 illustrates what one of those advanced schematics could potentially look like.

Figure 2-10. Arduino NG hardware schematic

The schematic just shown is rather complex, and even for seasoned engineers this level of work could take countless hours to fine-tune. As we continue our exploration throughout this book, we’ll stick to basic rapid prototyping diagrams and concepts to keep things simple.

Conclusion

This chapter briefly reviewed some of the service offerings out in the wild today, including Amazon Alexa and its ASK and AVS tools. We touched a bit on architecture and what you can expect down the road as things become more and more complicated. It’s important to keep things documented and well organized as you evolve your product. Next, we’ll start to piece together some rapid prototyping components such as the Raspberry Pi, a speaker, and a microphone. Then we’ll download some code, configure some settings, kick some tires, and finally take the Alexa Voice Service out for a spin!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset