A Comparison of Four Metrics for Auto-Inducing Semantic Classes
01 January 2001
A speech understanding system typically includes a natural language understanding module that defines groups, or concepts, of semantically related words. It is large challenge to build a set of concepts for a new domain for which prior knowledge and training data is limited. These concepts can be auto-induced from unannotated training data if there is an appropriate metric for comparing the similarity of candidate words and phrases. We compare four different context-dependent metrics by auto-inducing concepts from training data for each of four different tasks: movie information, a children's game, travel reservations, and news articles from the Wall Street Journal. Two of these metrics are based on the Kullback-Leibler (KL) distance measure, a third is the Manhattan norm, and the fourth is the vector product similarity measure. The KL distance consistently underperforms the other three metrics.