Handwritten Document Image Analysis at Los Alamos: Script, Language, and Writer Identification (original) (raw)
1999
Abstract
A system for automatically identifying the script used in a handwritten document image is described. The system was developed using a 496-document dataset representing six scripts, eight languages, and 281 writers. Documents were characterized by the mean, standard deviation, and skew of five connected component features. A linear discriminant analysis was used to classify new documents, and tested using writer-sensitive cross-validation. Classification accuracy averaged 88% across the six scripts. The same method, applied within the Roman subcorpus, discriminated English and German documents with 85% accuracy. Pilot results indicate that a variation of the method may be applicable to writer identification. 1. Introduction Script and language identification are important parts of the automatic processing of document images in an international environment. A document's script (e.g., Cyrillic or Roman) must be known in order to choose an appropriate optical character recognition ...
Kevin Bowers hasn't uploaded this paper.
Let Kevin know you want this paper to be uploaded.
Ask for this paper to be uploaded.