Interacting With Computers By Voice Automatic Speech Recognition And Synthesis


E-Book Content

Interacting With Computers by Voice: Automatic Speech Recognition and Synthesis DOUGLAS O’SHAUGHNESSY, SENIOR MEMBER, IEEE Invited Paper This paper examines how people communicate with computers using speech. Automatic speech recognition (ASR) transforms speech into text, while automatic speech synthesis [or text-to-speech (TTS)] performs the reverse task. ASR has largely developed based on speech coding theory, while simulating certain spectral analyses performed by the ear. Typically, a Fourier transform is employed, but following the auditory Bark scale and simplifying the spectral representation with a decorrelation into cepstral coefficients. Current ASR provides good accuracy and performance on limited practical tasks, but exploits only the most rudimentary knowledge about human production and perception phenomena. The popular mathematical model called the hidden Markov model (HMM) is examined; first-order HMMs are efficient but ignore long-range correlations in actual speech. Common language models use a time window of three successive words in their syntactic–semantic analysis. Speech synthesis is the automatic generation of a speech waveform, typically from an input text. As with ASR, TTS starts from a database of information previously established by analysis of much training data, both speech and text. Previously analyzed speech is stored in small units in the database, for concatenation in the proper sequence at runtime. TTS systems first perform text processing, including “letter-to-sound” conversion, to generate the phonetic transcription. Intonation must be properly specified to approximate the naturalness of human speech. Modern synthesizers using large databases of stored spectral patterns or waveforms output highly intelligible synthetic speech, but naturalness remains to be improved. Keywords—Continuous speech recognition, distance measures, hidden Markov models (HMMs), human–computer dialogues, language models (LMs), linear predictive coding (LPC), spectral analysis, speech synthesis, text-to-speech (TTS). I. INTRODUCTION People interact with their environment in many ways. We examine the means they use to communicate with computerbased machines. Transfer of information between human and machine is normally accomplished via one’s senses. As humans, we receive information through many modalities: sight, Manuscript received November 7, 2002; revised March 13, 2003. The author is with INRS—Telecommunications, Montreal, QC, Canada H5A 1K6 (e-mail: [email protected]). Digital Object Identifier 10.1109/JPROC.2003.817117 audition, smell, and touch. To communicate with our environment, we send out signals or information visually, auditorily, and through gestures. The primary means of communication are visual and auditory. Human–computer interactions often use a mouse and keyboard as machine input, and a computer screen or printer as output. Speech, however, has always had a high priority in human communication, developed long before writing. In terms of efficiency of communication bandwidth, speech pales before images in any quantitative measure; i.e., one can read text and understand images much more quickly on a two- dimensional (2-D) computer screen than when listening to a [one-dimensional (1-D)] speech signal. However, most people can speak more quickly than they can type, and are much more comfortable speaking than typing. When multimedia is available [181], combining modalities can enhance accuracy, e.g., a visual talking face (animation) is more pleasant and intelligible than just a synthetic voice [154]. Similarly, automatic speech recognition (ASR) is enhanced if an image of the speaker’s face is available to aid the recognition [261]. This paper deals with human–c
You might also like

Abstract State Machines: A Method For High-level System Design And Analysis
Authors: Egon Boerger , Robert Staerk    201    0


Lessons In Electric Circuits 2 - Ac
Authors: Kuphaldt.    223    0


Means Of Hilbert Space Operators
Authors: Fumio Hiai , Hideki Kosaki (auth.)    174    0


Applications Of Nonlinear Fiber Optics
Authors: Govind Agrawal    179    0


Engineering Materials 1
Authors: D R H Jones , Michael F. Ashby    192    0


Il-5 Receptor
Authors: Bagley Ch.J. , Tavernier J. , Woodcock J.M.    215    0


Cmos. Circuit Design, Layout And Simulation
Authors: R. Jacob Baker , Harry W. Li , David E. Boyce    169    0


Il-1ra
Authors: Barry Bresnihan , Jean-Michel Dayer    129    0


Handbook Of Hazardous Chemical Properties
Authors: Peter Warren    139    0


Open-source Robotics And Process Control Cookbook: Designing And Building Robust, Dependable Real-time Systems
Authors: Lewin Edwards Lewin Edwards is an embedded engineer with over 15 years experience designing embedded systems hardware firmware and control software.    95    0