Dear aunt. I
sat down slept like a baby whilst Oliver Scholz, a Microsoft employee working on the speech recognition experience in Windows Vista, spent days and nights to respond to the questions I sent out for him earlier about speech recognition and Windows Vista. I didn’t expect him to provide such an indepth and extensive behind-the-scenes look at what he does and the technology he’s working on, but he did, so you all better read it.
My commentary are surrounded with [square brackets] in italics.
Thanks to Nexus for providing this image.
The opinions expressed by Oliver Scholz herein are his own personal opinions and do not represent his employer’s views in any way. However, if you are his employer, you ought to give him a raise, since he’s obviously doing a better job than the Microsoft Partners marketing people.
What do you do at Microsoft?
I am a program manager on the Speech User Experience for Windows Vista. What does that mean? [I was just about to ask, seriously.] I work with a team of software developers and testers to design, implement and test the speech recognition user experience for Windows Vista.
We’ve worked on Windows Vista for a long time [You kidding?]. Over the course of the project cycle, I worked on different things. I was part of the team that designed the Speech User Experience you see in Windows Vista today. We designed this experience based on user feedback from real speech recognition users. I then worked with a team of developers and testers to manage the different features through the implementation and test cycle. This means that I represent customers and their needs throughout the project cycle, making sure that what we implement will meet our user’s and customer’s needs.
Where do you work, Redmond? And how many people do you work with?
Yes, I work in Redmond [Damn, I bet on Cupertino]. I work with a number of teams on speech recognition. We don’t generally talk about how many people or even teams we work with. Not because we are secretive [Liar], but because it’s really hard to measure. Microsoft is a very dynamic organization, and throughout a project cycle, people might focus on different aspects of a technology problem. For example, parts of the speech organization might work on Windows Vista, while others focus on speech integration into Exchange Server. Also, to make Speech Recognition work on Windows Vista, many things have to work together. Not all of them are related only to speech recognition. For example: Windows Vista Speech Recognition uses MSAA (Microsoft Active Accessibility) and Windows UI Automation to enable commanding of applications. Therefore, developers in Internet Explorer make sure that they enable IE for MSAA and WUIA. In a very real way, these people work on Speech Recognition. [That could mean the Solitare people are part of the Speech Recognition team too!]
I can say this though: We have different teams working on different aspects of Windows Vista Speech Recognition. I am part of the User Experience team that worries about how users want to interact with Windows using speech, or what users want to say to their computer. We work with a team that creates SAPI, the Speech API that the Speech User Experience, as well as 3rd party developers use to write Speech applications. We also work with the speech recognition engine, which does the actual work of recognizing your voice.
What sort of education/experience did you have before joining Microsoft? Why did you choose speech recognition?
My background is in economics and linguistics. I used to own a translation agency. When I started at Microsoft, I worked on localizing Microsoft Project into German (my native language). Localization is the process of “translating” an application from one language into another. It’s much more than translation though, because you have to make sure that the resulting application is still applicable for and useful in the culture and business environment of your target language/culture. I also worked on Microsoft CRM (Customer Relationship Management) and Windows Media Internet Services before I joined the Speech User Experience team in early 2004.
Why Speech Recognition? Multiple reasons. 🙂 I am super passionate about making computers easier to use [Long Zheng, complicating the world one step at a time!]. I think Speech Recognition has a huge opportunity to be at the center of that. Graphical user interfaces are awesome, fun and engaging (if done right). However, each graphical user interface places a template in front of the user that the user has to work through. Think of when you want to send an email. You first have to break down the task in your head. Let’s say you want to send an email to your mother [or your aunt]. You know you have to start Microsoft Outlook (or another application, or even the web browser). You then have to break that down to know that you start an application by going to the Start Menu, or the Quick Launch Toolbar, or your desktop shortcut. Then you have to know to start a new email message, fill out the right forms, and hit send. And before you could do all of that, you had to set up your email program with the right servers, etc., or create an email account at your favorite web email provider. The graphical user interface makes this reasonably easy, but there is still a lot to absorb that new users often have problems with.
Speech has the opportunity to let you express this in natural language: “Send an email to my mother”. Now, this isn’t possible today, but it will be possible in the not so distant future. You can make computer both easier and more efficient to use this way. Today, you can already say things like “Start Outlook”, “How do I install a printer”, “Switch to Internet Explorer”, etc. Also, it’s a lot easier to dictate text into your computer than it is to learn to type. Sure, you and I may be able to type fast, but we had to learn. Kids that go to school today shouldn’t have to learn to type. By the time they enter the work force, speech recognition will be ubiquitous. In fact, many school districts in the US already teach speech recognition today in anticipation of it being available to everyone when the current generation of kids graduates.
I believe we have a historical opportunity before us to make a huge difference in how people think of interacting with their computers. When we all speak with our homes in 20+ years (turn off the lights, turn the television to CBS, etc.) the work we are doing today to speech enable Windows Vista will be at the root of all that functionality. [And you will be in Wikipedia!]
Speech recognition as a technology is a pretty big subject, which area do you focus on?
The user experience part of it, as I already mentioned above [I’m sorry, I fell asleep]. For now, this means Windows Vista. However, I will continue to work on speech recognition for the next release of Windows. For that release we are going to focus on making speech work better in more applications, and we will go deeper in the area of commanding. We’ll focus on taking advantage of speech recognition’s potential to make speech a more efficient way to interact with your computer than the mouse and keyboard are today. For example, wouldn’t it be cool if you could say “Send an email to my mother about lunch next week”, and the computer would open your email program, fill in your mother’s email address, fill in the subject line with “Lunch next week” and put the cursor in the email body so that you could just start to dictate? We’ll have to see if our users would like that, but if they do, then I think we’ll build it. 🙂
What is the most valuable contribution you’ve made to make speech recognition in Vista better?
That’s nearly impossible to say. A program manager does a lot of things [and get paid lots of things too]. We come up with cool ideas, we take our colleagues’ cool ideas and help turn them into great features, we help keep the team structured to make sure that the organization is actually set up to succeed, and most importantly, we are the voice of the customer and user. Some of us even write a little code now and then. It’s nearly impossible to judge what the most important contribution was.
One particular feature I worked on for Windows Speech Recognition that I think will make a very big difference to users is the Speech Recognition tutorial. I worked with a number of people on this. Like everything we do, this feature was a team effort. I think the Speech Recognition tutorial will be very valuable to users, because it lets them experience and learn how to use speech recognition, and it does this in a safe environment where users don’t need to worry that anything unintended happens. [I give the Speech Recognition tutorial 3 thumbs up! It’s really a great concept.]
Beyond this I worked on the commanding feature set. I am really happy about the “Say what you see” and “Click what you see” concepts. Along with the “Show Numbers” command, which I also worked on, these three concepts are all you need to successfully command your computer. It’s actually very simple and intuitive. If you then learn and understand how to dictate (speak in longer sentences, say punctuation and line and paragraph breaks), you are set to successfully use speech recognition. This is hugely easier than any speech recognition solution that Microsoft ever released before. As I mentioned above, I think ease of use is incredibly important. These four concepts are hopefully easy enough for people to learn, with the help of the tutorial.
Since Vista Beta 2, has there been much improvement to the speech recognition engine? Or even between every public build?
The whole user experience part of Windows Vista Speech Recognition has had many improvements since Beta 2. These improvements are on every level, from the speech recognition engine to the user interface. The control panel has been cleaned up a little, the flow of certain scenarios has improved, and we’ve even added some commands, like the Move Mouse command, which moves the mouse pointer to a specific element on the screen.
The most noteworthy improvements are in the audio system on which Speech Recognition depends. You can now flawlessly dictate “Dear aunt, let’s set so double the killer delete select all”. 🙂 For those of you that saw the video about the Speech Recognition demo gone bad, rest assured that those problems have been fixed. The underlying Windows audio system had a problem that caused all microphone audio input, including that going to the Speech Recognition user experience, to be too loud. Audio input that’s too loud has the same effect as turning your stereo up too far, or screaming from the top of your lungs: the audio gets distorted and clipped. When audio is distorted and clipped, speech recognition can’t do a good job understanding what the user said. In the demo video, you could even see how the audio meter in the Speech Recognition user interface consistently went up into the red whenever Shannen (the guy who did the demo) said anything. From our perspective, it was a miracle that the commanding part he did first (which wasn’t shown in the video at all) worked as well as it did.
Anyway, that’s water under the bridge, and the RC1 (release candidate 1) release of Windows Vista shouldn’t have any of these issues. Even past the RC1 release, on our road to RTM (release to manufacturing), we have made some significant performance improvements to the Speech Recognition user experience.
We are all looking forward to Windows Vista being available to the public. We can’t wait to get your feedback. Feel free to send mail to [email protected] with your feedback about Windows Vista Speech Recognition.
What is the “goal” for speech recognition in Vista? What level of accuracy are you guys aiming for when it comes to shipping Vista?
Let me answer these in reverse order. [FINE.] Speech Recognition accuracy for Vista should be between 95 – 99.x%. Accuracy is usually measured in terms of dictation accuracy. Commanding accuracy is usually not measured, because it’s almost always 99% or higher. Even if commanding accuracy isn’t 100%, the user is still in control when using Windows Vista Speech Recognition. Let’s say you are working in an application that has a View menu and a New button. If you say “new” to click the New button, but the Windows Vista Speech Recognition doesn’t understand you completely, it will disambiguate between New and View on the application user interface, by placing numbered buttons above the View menu and the New button. Or, if Windows Vista Speech Recognition didn’t understand you at all, it will say “What was that?” What was the accuracy in this case? Also, why did the system not understand what you said? Was it because you mumbled, or because there was background noise?
At any rate, this explains why accuracy is usually measured in terms of dictation accuracy. In dictation, you can usually measure what percentage of dictated text the system understood correctly. As I said, that accuracy should be 95 to 99.x%.
Why the relatively large range? Speech Recognition accuracy always depends on the subject matter that is dictated. When you dictate general subject matter text, accuracy will be very high, and it will get better as you use Speech Recognition. Every time you correct dictated text by voice, the Speech Recognition system takes note of the correction and is unlikely to make the same mistake continuously.
For other subject matters, such as legal or dictation, accuracy will likely be lower than 99.x%. That accuracy will also get better over time. In addition, Windows Vista Speech Recognition supports the addition of third party language models that can be added to enable Speech Recognition to better understand specialized vocabulary. For example, you could add a legal language model that will dramatically increase accuracy of legal terminology.
Now, what is the goal for Windows Vista Speech Recognition. We have a number of goals. First, we target users with Repetitive Stress Injury (RSI), a condition that causes users to lose some or all of their ability to use the mouse and keyboard. We also target speech recognition enthusiasts. We target both of these user groups, because we feel that whatever works for them will work for any other user, too. People who suffer from RSI are first and foremost normal users, just like everybody else. They want to browse the Internet, send email, work with word processors, and do their jobs on a Windows computer. They are no different than anybody else. In fact, I have RSI myself, albeit a still relatively mild case.
People who suffer from RSI and speech recognition enthusiasts are also very vocal. They give us lots of feedback about our feature set. They tell us what works great, what doesn’t, and what additional functionality they’d like to see.
More than anything else, we depend on feedback from users. We are nothing without our users. We realize that Speech Recognition, although very useful in Windows Vista, still has a lot of room for improvement. We want to work closely with our customers to make improvements to the system. For that we need feedback. I mentioned [email protected]. Users send us lots of email to tell us what they think. We appreciate every single mail we get. We read every piece of feedback, and respond to most. So, the goal of adding Speech Recognition to Windows Vista is to make people’s lives better, and to get it in front of as many people as possible, to get as much useful feedback as possible. Then we’ll work hard to make the solution even better for future releases of Windows.
What about beyond Vista? What new goals or problems are you guys trying to tackle?
I already touched on a few ideas we are toying around with. There are others. 🙂 [Phew. Otherwise this section would have really sucked.]
One big thing we are working on is macro support. Windows Vista will support a lot of Speech Recognition commands, but you cannot create your own commands yet. We have a solution to add that support for users. We don’t know yet how and when we are going to make that available, but we are working out the details of that.
We also want to make the system easier to configure and personalize. For example, some users may want to turn off everything but commanding, others may want to turn off everything but dictation. Other users may want to tweak how commanding works. For example, you can now add a registry key that will cause certain commands to be confirmed (this looks like the OK box that shows up over a button when you say Show Numbers, and then a number). While this feature is available through the registry now, we haven’t had enough time to expose it through the user interface. We have a design that will make changing this option, as well as many others I haven’t mentioned here, very easy and intuitive.
Beyond that, we want to take Windows Vista Speech Recognition into applications were it doesn’t work very well today, such as Microsoft Excel and PowerPoint. Dictating into these applications could work better. We are hoping that this will be addressed by Office 2007 SP1, but we are still working on the details here.
Then there are the obvious next steps, such as deeper commanding support, multi language speech recognition for bilingual users, and some other, very cool things. 🙂
That was longer than most of my major assignments. I guess writing about “Honda’s brand media strategy” isn’t as interesting as Windows Vista’s Speech Recognition technology.
I want to thank Oliver Scholz for taking his time out of his busy schedules to give me and everyone else who might read this a very helpful insight into Microsoft and the Speech Recognition technology. I look forward to using Speech Recognition in Windows Vista.
No aunts or killers were harmed in the making of this interview.