An Improved Training Procedure for Connected-Digit Recognition
01 July 1982
Recently, several algorithms for recognizing a connected string of words based on a set of discrete word reference patterns have been proposed.1"5 Although the details and the implementations of each of these algorithms differ substantially, basically they all try to find the optimum (smallest distance) concatenation of isolated-word reference patterns that best matches the spoken word string. Thus, the success 981 of all these algorithms hinges on how well a connected string of words can be matched by concatenating members of a set of isolated-word reference patterns. Clearly, for talking rates that are comparable to the rate of articulation of isolated words [e.g., on the order of 100 words per minute (wpm)], these algorithms have the potential of performing quite accurately.1-6 However, when the talking rates of the connected-word strings become substantially larger than the articulation rate for isolated words, for example, around 150 wpm, then all the pattern-matching algorithms tend to become unreliable (yield higherror rates). The breakdown mechanism is easily understood in such cases. The first effect of a higher talking rate is a shortening of the duration of the word within the connected-word string. This shortening is highly nonlinear (i.e., vowels and steady fricatives are first shortened and then they are reduced, whereas sound transitions are substantially unaffected) and is not easily compensated by the dynamic time warping (DTW) alignment procedure, which has its own inherent local and global warping path constraints.