Microsoft's Speech Technology Research
Published: Jun 23, 2007Speech technology research is underway around the world. The R&D labs of universities and corporations have hundreds of researchers developing new products.
Microsoft Corporation has two research groups focused on speech, based in Redmond, Washington and Beijing, China. These groups create state-of-the-art spoken language components and focus on improving speech recognition and text-to-speech in areas like accuracy and grammar, with special emphasis on telephony applications and call centers. The speech projects also focus on improving natural language recognition, language modeling and aspects of technology that drive the usefulness of speech, such as improving user interfaces no matter which device is used. Some projects that have been transferred into products or in product development are:
- Multimodal Interactive Pad (MiPad) - a multimodal interactive notepad prototype
- Speech Recognition (Whisper)
- Text-to-Speech (Whistler)
- Speaker Identification (Whisper ID)
- Speech Application Programming Interface (SAPI) - an interface and developer toolkit
- Speech Enabled Language Tags (SALT) - a markup language for the multimodal Web
Multimodal user interfaces are a critical area of research for the speech group. It continues with projects like:
- Noise Robustness - to improve system accuracy when background noise is present
- Acoustic Modeling - to solve how to model phones and acoustic variations
- Language Model - to predict how certain words will be spoken so that the recognizer makes the best choice, independent of acoustics
- Automatic Grammar Induction - to understand how to create grammar rules to ease the development of spoken language systems
- Multimodal Conversational User Interface Personalized Language Models - for improved accuracy of speech-enabled agents
Microsoft’s focus has been on incorporating speech into telephony applications such as phones and call centers, as well as personal interaction with a user. A speech-enabled agent accepts speech input, understands what the person is asking and then acts on it.
Audio Information Management and Extraction (AIME) moves beyond core speech technologies. It seeks to make computers smart in speech and audio recordings. The aim is to produce a search engine that can mine recorded conversations. AIME will be able to search through content from voicemails, presentations, lectures, meetings, teleconferences and broadcast news programs, and get a better understanding of the structure of conversations.
When two different sentences sound the same, language models hone the speech recognizer’s ability to figure out which is the right choice to make.
Microsoft Research has language modeling projects in:
- Language Model Adaptation - A general-domain statistical language model can be adapted to a new domain/user despite having limited amounts of sample data from the new domain/user.
- Incorporation of Syntactic Constraints in a Statistical Language Model - Reduces the word error rate, or improves speech and language understanding.
- Speech Utterance Classification - Research to classify speech utterances in a limited set of classes so that it can assign a category to a given utterance.
- Language Modeling for Other Applications – Facilitates handwriting recognition or spelling correction to help eliminate the ambiguousness of input.
Microsoft covers nearly all aspects of speech technologies. However, Microsoft’s long term goals are of linking together all computing environments and applications and using speech to enhance the efficiency and user experience of those applications.
Source: Speech Technology Magazine

