A Connected Speech Recognition System Based on Spotting Diphone-like Segments - Preliminary Results
01 January 1987
A template-based, connected speech recognition system which represents words as sequences of diphone-like segments has been implemented and evaluated. The inventory of segments is divided into two principle classes: "steady-state" speech sounds such as vowels, fricatives, and nasals, and "composite" speech sounds consisting of sequences of two or more speech sounds in which the transitions from one sound to another are intrinsic to the presentation of the composite sound. Templates representing these segments are extracted from labeled training utterance. They are constructed by extracting the shortest possible segments from the utterances consistent with obtaining stable and reliable representations of the speech sounds. Words are represented by network models whose branches are diphone segments. A path through such a network represents a particular pronunciation of the word. Word juncture phenomena are accommodated by including final segment branches that characterize the transition pronunciation for specified classes of words. Word models also include specifications of the maximum expected separations between adjacent segments. The recognition of a word in a specified utterance takes place by "spotting" all the segments contained in the model of the word. "Spotting" is accomplished using an "open-end" DTW (dynamic time warp) comparison between a segment template and every frame in the utterance.