Skip to main content

Application of Clustering Techniques to Speaker-Trained Isolated Word Recognition

01 December 1979

New Image

Although a great deal has been learned about isolated word speech recognition systems,1-14 several key issues are not as well understood as others. One such issue is the manner in which the word reference templates for such a system are obtained. To date, there have been at least three distinct ways of obtaining templates, including: (i) Casual training in which the designated talker (for a speakertrained system) speaks each word of the vocabulary (one or more times) and a reference template is created for each spoken word.'1,4 Thus, for casual training, there is a direct correspondence between a spoken token of the word and the reference template. 2217 (ii) Averaging methods in which the designated talker (for a speaker-trained system) or a set of talkers (for a speaker-independent system) speaks the word a number of times and a weighted, timenormalized average of the feature sets for that word is used as the reference template.1,7,15 (Hi) Statistical clustering methods in which a set of talkers speak the word and a statistical pattern recognition algorithm is used to group the feature sets of the tokens into a set of clusters.14,1(5 The similarity of tokens within a cluster is high (small intratoken distances), whereas the similarity of tokens in different clusters is low (large intertoken distances). Reference templates are obtained by representing each cluster by a single template (either using a minimax approach,14 or via averaging techniques17). Thus, a word is generally represented by a set of templates rather than one or two templates.