A comparison of text-based methods for detecting duplication in scanned document databases

This paper presents an experimental evaluation of several text-based methods for detecting duplication in scanned document databases using uncorrected OCR output. This task is made challenging both by the wide range of degradations printed documents can suffer, and by conflicting interpretations of what it means to be a ``duplicate.{''} We report results for four sets of experiments exploring various aspects of the problem space. While the techniques studied are generally robust in the face of most types of OCR errors, there are nonetheless important differences which we identify and discuss in detail.

View Publication

Select your country

Select your country

A comparison of text-based methods for detecting duplication in scanned document databases

Looking for Nokia licensed products support?

Looking for Nokia licensed products support?