ASR translates audio into texts by creating a digital representation of spoken words. This is done by working with phonemes. These are small pieces of the audio, i.e., the smallest unit of sound that may affect the specific word's meaning in a particular language.
Phonemes do not have their own meanings, and for ASR, there are usually specific sets of phonemes used. ARPABET
is an example of such a collection, and it contains 39 phonemes. The list may be longer and depends on the target language we'd like our ASR to be able to understand.
To make it all possible, the ASR consists of several components, including:
- The decoder (also called recognizer or ASR) performs the recognition. This component receives an audio input and then generates the so-called recognition hypothesis.
- The language model includes all phrases the ASR system can recognize and a unique dictionary with phonetic representations of specific words.
- The acoustic model provides additional data like the language or dialect. For better efficiency, it is recommended to use a US English acoustic model to recognize the person's speech from the US.