Interacting With Computers by Voice: Automatic Speech Recognition and Synthesis DOUGLAS O’SHAUGHNESSY, SENIOR MEMBER, IEEE Invited Paper This paper examines how people communicate with computers using speech. Automatic speech recognition (ASR) transforms speech into text, while automatic speech synthesis [or text-to-speech (TTS)] performs the reverse task. ASR has largely developed based on speech coding theory, while simulating certain spectral analyses performed by the ear. Typically, a Fourier transform is employed, but following the auditory Bark scale and simplifying the spectral representation with a decorrelation into cepstral coefficients. Current ASR provides good accuracy and performance on limited practical tasks, but exploits only the most rudimentary knowledge about human production and perception phenomena. The popular mathematical model called the hidden Markov model (HMM) is examined; first-order HMMs are efficient but ignore long-range correlations in actual speech. Common language models use a time window of three successive words in their syntactic–semantic analysis. Speech synthesis is the automatic generation of a speech waveform, typically from an input text. As with ASR, TTS starts from a database of information previously established by analysis of much training data, both speech and text. Previously analyzed speech is stored in small units in the database, for concatenation in the proper sequence at runtime. TTS systems first perform text processing, including “letter-to-sound” conversion, to generate the phonetic transcription. Intonation must be properly specified to approximate the naturalness of human speech. Modern synthesizers using large databases of stored spectral patterns or waveforms output highly intelligible synthetic speech, but naturalness remains to be improved. Keywords—Continuous speech recognition, distance measures, hidden Markov models (HMMs), human–computer dialogues, language models (LMs), linear predictive coding (LPC), spectral analysis, speech synthesis, text-to-speech (TTS). I. INTRODUCTION People interact with their environment in many ways. We examine the means they use to communicate with computerbased machines. Transfer of information between human and machine is normally accomplished via one’s senses. As humans, we receive information through many modalities: sight, Manuscript received November 7, 2002; revised March 13, 2003. The author is with INRS—Telecommunications, Montreal, QC, Canada H5A 1K6 (e-mail:
[email protected]). Digital Object Identifier 10.1109/JPROC.2003.817117 audition, smell, and touch. To communicate with our environment, we send out signals or information visually, auditorily, and through gestures. The primary means of communication are visual and auditory. Human–computer interactions often use a mouse and keyboard as machine input, and a computer screen or printer as output. Speech, however, has always had a high priority in human communication, developed long before writing. In terms of efficiency of communication bandwidth, speech pales before images in any quantitative measure; i.e., one can read text and understand images much more quickly on a two- dimensional (2-D) computer screen than when listening to a [one-dimensional (1-D)] speech signal. However, most people can speak more quickly than they can type, and are much more comfortable speaking than typing. When multimedia is available [181], combining modalities can enhance accuracy, e.g., a visual talking face (animation) is more pleasant and intelligible than just a synthetic voice [154]. Similarly, automatic speech recognition (ASR) is enhanced if an image of the speaker’s face is available to aid the recognition [261]. This paper deals with human–c