Speech recognition

From lingwiki

Speech recognition, a subfield of natural language understanding, aims to convert an acoustic signal to the words that it represents (either to be transcribed or used as an input for further processing).

Some important areas in speech recognition are:

Acoustic Modeling, or estimating probabilities of sounds that make up words.
Language Modeling, or estimating probabilities of sequences of words.
Signal Processing, or interpreting and manipulating acoustic signals.

[edit] Basic Process

The aim of speech recognition can be stated as follows: given an acoustic signal (a sequence of acoustic samples of human speech), find a sequence of words that maximizes the probability that the words "match" the signal. The process can be broken down into two main parts: acoustic processing, and linguistic decoding.

Acoustic processing. (Most of this front-end design is developed in signal processing.)
- The speech sound must be picked up by some kind of device (a microphone), which converts the sound into an electrical signal.
- The resulting signal must be sampled, or converted from a continuous signal into a discrete one which can be processed acoustically.
- The system must have a manner of processing the resulting sampled signals.
  - The resulting signals are represented vectors with acoustic properties as parameters (such as intensities at specific frequencies).
  - Spectral information from the signal can be obtained by performing Fourier transforms and using filter banks.
Linguistic decoding.
- The measurements found from acoustic processing are used in order to search for the most likely candidate. The constrictions of the language model and acoustic model of the recognizer guide the search. This search through the hypothesis space is done algorithms based around that of Viterbi.
- The output, or most likely candidate words, is transcribed or used as an input for some kind of other processing.

[edit] Problems in speech recognition

Speech recognition is a difficult problem for various reasons.

Variability of the input signal. Since speech recognition is attempting to process natural human language - a very complex, diverse, and irregular entity - input signals are not likely to be very consistent (even if they're meant to be representing the same thing).
- Intra-speaker variance and performance. An individual speaker may have different ways of pronouncing the same word in different contexts, may have different pronunciations of one word in free variation, or may just make errors.
- Speaker dependency. Recognizers must go through training process, which is usually done with data from only one speaker. This makes trained systems incompatible with multiple speakers.
- Accents. Accents are another form of variability of the input signal; discrepancies between accented speech and the dictionary or training data being used by the recognition system can cause bad performance.
- Co-articulation. In real-time speech, words are acoustically dependent on each other. Pronunciation of phones at the end of one word may influence the pronunciation of phones at the beginning of the next word. The same process can happen inside individual words. Hidden Markov Models have difficulty in modeling such a dependent process.
- Noise. Speech recognition systems must have a method from distinguishing speech signals from non-speech noise.
Lack of human judgment. Variance, errors, or accents may be easily decipherable by other speakers, but causes great difficulty for speech recognition systems.
Large vocabularies. Systems must be trained and dictionaries must be constructed for them; this is a tedious task for large vocabularies.
Similar sounding words, such as "sad" and "sat". Systems must be able to deal with vocabularies that contain words which differ by a few fine details.
Determining word boundaries. "Ice cream," or "I scream?" Unless speech is carefully articulated, this is a very difficult task.