Speech recognition on a handheld

Voice is arguably as unique as our fingerprints. Voice is not only produced by the voice box, but is also shaped by the throat, tongue, lips, teeth, chest and head. Put that together with ascents and dialects and you have quite a challenge understanding spoken language. Now not only can Windows on the PC comprehend 8 languages, but so can Windows on the mobile phone, or at least one anyway.

Recently the Live Search for Mobile team introduced voice input in their application for Windows Mobiles. And it works remarkably well according to some early responses. Check out this quick demo by Microsoft’s speech-guru Rob Chambers.

Update: Interestingly enough, the speech recognition doesn’t actually occur on the mobile device itself. Instead as explained on the Speech team blog, “the phone takes your speech input, sends it to a server, the server does it’s recognition magic, and sends the results back to the phone.”

One of the people behind the technology and implementation is Oliver Scholz, Program Manager in the Speech Components Group, who not surprisingly also worked on the speech recognition technology in Windows Vista. I had the opportunity to ask him some quick questions.

When and why as the decision made to build voice recognition into a Live Search for Mobile?

Oliver: We decided to build voice recognition into Live Search for Mobile late in 2006, and started working on the project in earnest in February of 2007. We had received lots of feedback from users that search with a phone keyboard (10 key or Qwerty) was too hard. Voice was the natural alternative to this.

Is the technology based on the speech-recognition in Windows Vista?

O: Not really. The SR engine used here is the telephony engine. Sapi, the speech API is still used, but that’s all.

Live Search for Mobile with voice inputWhat were some of the challenges of building such a complex system for a mobile device?

O: The biggest challenge was in getting the grammars right. Grammars are the files that determine what the system is listening for. We’re listening for all businesses and categories of business listings in the US, as well as city, state, zip, and full addresses. That’s quite a lot of stuff.

The other hard part was building a user experience that made sense and satisfied user needs.

Are there any major differences in quality compared to the PC?

O: The quality for both is high, but the products are different enough for a comparison to not be super useful. Windows Vista gets higher fidelity input from a microphone than Live Search for Mobile does on a Windows Mobile phone. Windows Vista also listens for a whole lot more than Live Search for Mobile.

Windows Vista accepts dictation, correction, and full commanding. Live Search for Mobile is limited to business names, categories, city, state, zip and full addresses.

Are there plans to enable it to work with other languages?

O: We don’t have any firm plans for that yet, but we are looking at enabling it for other markets, and ultimately other languages.

What’s the next step? More features? Better accuracy?

O: We are constantly working on improving accuracy. We do this by looking at what users say, and what they ultimately accept as search results. We want to make the distance between these two things as short as possible. We are also working on additional features, additional languages, and improving the current feature set. There’s always something you can do better.

I find it interesting voice recognition is built into an application and not the mobile OS. Will Win Mobile 7 provide voice support on a platform level?

O: Great question! In many ways it is easier to build voice recognition into just one application, because the scope of that application is limited. We are definitely looking into providing voice support at a platform level. It’s too early to say what that might look like, but just like you, we thought it was clearly something we should think about.

Whilst the technology is not there yet for dictation or system-wide speech recognition, but imagine how much easier and safer a mobile phone could be hands-free when you’re driving. Late to a meeting? Dictate a message, and SMS it to someone without even touching a button.

15 insightful thoughts

  1. The only problem with that is you usually (well I do) have lots of background noise going on when using a phone. Say in the city, shopping centre, bus, train, car etc.

    So it would need to be very good at detecting background noise and what you are actually saying.

    Though that probably would be corrected via better quality microphones or by using array mics.

  2. @Insomniac
    Isn’t background noise generally eliminated at the microphone level on a cell phone? In which case, problem solved.

  3. @Steve,
    you would think so.
    I’m sure theres phones out there that do it.

    But I’m just going by the experience i’ve had using voice commands on my Nokia mobiles in the past.

    I have a Dopod 810x now and it works better for voice commands. But it still seems to pick up someone in the bus talking and will launch a command and ignore my command.

  4. Pingback: Rob's Rhapsody
  5. Pingback: Noticias externas
  6. The application seems a little “light” to have a built in speech recognition feature.

    Is the recognition done locally in the app, or remotely (at MS’s live server’s), or both (the client app preps the voice stream to just the essetials sounds components which then get processed further on live servers)?

Leave a Reply