Script and Language Identification for Handwritten Document Images (original) (raw)

1998

Abstract

A system for automatically identifying the script used in a handwritten document image is described. The system was developed using a 496-document dataset representing six scripts, eight languages, and 281 writers. Documents were characterized by the mean, standard deviation, and skew of five connected component features. A linear discriminant analysis was used to classify new documents, and tested using writer-sensitive cross-validation. Classification accuracy averaged 88% across the six scripts. The same method, applied within the Roman subcorpus, discriminated English and German documents with 85% accuracy. Keywords: script, language, handwriting, discrimination, features 1. Introduction Script and language identification are important parts of the automatic processing of document images in an international environment. A document's script (e.g., Cyrillic or Roman) must be known in order to choose an appropriate optical character recognition (OCR) algorithm. For scripts use...

Kevin Bowers hasn't uploaded this paper.

Let Kevin know you want this paper to be uploaded.

Ask for this paper to be uploaded.