Accurate, Data-Efficient, Unconstrained Text Recognition with Convolutional Neural Networks (original) (raw)
Related papers
End-to-End Text Recognition with Convolutional Neural Networks
Full end-to-end text recognition in natural images is a challenging problem that has received much attention recently. Traditional systems in this area have relied on elaborate models incorporating carefully handengineered features or large amounts of prior knowledge. In this paper, we take a different route and combine the representational power of large, multilayer neural networks together with recent developments in unsupervised feature learning, which allows us to use a common framework to train highly-accurate text detector and character recognizer modules. Then, using only simple off-the-shelf methods, we integrate these two modules into a full end-to-end, lexicon-driven, scene text recognition system that achieves state-of-the-art performance on standard benchmarks, namely Street View Text and ICDAR 2003.
EASTER: Simplifying Text Recognition using only 1D Convolutions
Proceedings of the Canadian Conference on Artificial Intelligence, 2021
Recurrent units and complex gated layers are key components of most text recognition models. Their sequential nature and complex mechanisms require large labelled training datasets, high computational requirements and lead to slower inference times. In this paper, we present an Efficient And Scalable TExt Recognizer (EASTER) to perform optical character recognition on both machine printed and handwritten text. Our model utilises only 1-D convolutional layers without any recurrence or complex gating mechanisms. Our proposed architecture achieves performance similar to best performing recurrent architectures by using only 4% of training data for offline handwritten text recognition task. We present results of our model on different machine printed text recognition datasets as well. We also showcase improvements over the current best results on line level offline handwritten text recognition task. Our work presents a highly scalable and deployable model for real-world settings while being highly performant.
Recurrence-free unconstrained handwritten text recognition using gated fully convolutional network
2020
Unconstrained handwritten text recognition is a major step in most document analysis tasks. This is generally processed by deep recurrent neural networks and more specifically with the use of Long Short-Term Memory cells. The main drawbacks of these components are the large number of parameters involved and their sequential execution during training and prediction. One alternative solution to using LSTM cells is to compensate the long time memory loss with an heavy use of convolutional layers whose operations can be executed in parallel and which imply fewer parameters. In this paper we present a Gated Fully Convolutional Network architecture that is a recurrence-free alternative to the well-known CNN+LSTM architectures. Our model is trained with the CTC loss and shows competitive results on both the RIMES and IAM datasets. We release all code to enable reproduction of our experiments: https://github.com/FactoDeepLearning/LinePytorchOCR.
Reading Text in the Wild with Convolutional Neural Networks
In this work we present an end-to-end system for text spotting-localising and recognising text in natural scene images-and text based image retrieval. This system is based on a region proposal mechanism for detection and deep convolutional neural networks for recognition. Our pipeline uses a novel combination of complementary proposal generation techniques to ensure high recall, and a fast subsequent filtering stage for improving precision. For the recognition and ranking of proposals, we train very large convolutional neural networks to perform word recognition on the whole proposal region at the same time, departing from the character classifier based systems of the past. These networks are trained solely on data produced by a synthetic text generation engine, requiring no human labelled data. Analysing the stages of our pipeline, we show state-of-the-art performance throughout. We perform rigorous experiments across a number of standard end-to-end text spotting benchmarks and text-based image retrieval datasets, showing a large improvement over all previous methods. Finally, we demonstrate a real-world application of our text spotting system to allow thousands of hours of news footage to be instantly searchable via a text query.
Easter2.0: Improving convolutional models for handwritten text recognition
arXiv (Cornell University), 2022
Convolutional Neural Networks (CNN) have shown promising results for the task of Handwritten Text Recognition (HTR) but they still fall behind Recurrent Neural Networks (RNNs)/Transformer based models in terms of performance. In this paper, we propose a CNN based architecture that bridges this gap. Our work, Easter2.0, is composed of multiple layers of 1D Convolution, Batch Normalization, ReLU, Dropout, Dense Residual connection, Squeeze-and-Excitation module and make use of Connectionist Temporal Classification (CTC) loss. In addition to the Easter2.0 architecture, we propose a simple and effective data augmentation technique 'Tiling and Corruption (T ACo)' relevant for the task of HTR/OCR. Our work achieves state-of-the-art results on IAM handwriting database when trained using only publicly available training data. In our experiments, we also present the impact of T ACo augmentations and Squeeze-and-Excitation (SE) on text recognition accuracy. We further show that Easter2.0 is suitable for few-shot learning tasks and outperforms current best methods including Transformers when trained on limited amount of annotated data. Code and model
World Journal of Advanced Engineering Technology and Sciences (Wjaets) , 2024
Handwritten text recognition (HTR) is a pivotal technology with extensive applications in document digitization, postal automation, and educational tools. This paper delves into the implementing a deep learning-based system for recognizing handwritten digits using TensorFlow and the MNIST dataset. The MNIST dataset, a widely-used benchmark, comprises 60,000 training images and 10,000 testing images of handwritten digits, each standardized to a 28x28 pixel grayscale format. Leveraging the power of Convolutional Neural Networks (CNNs), our model effectively extracts features and classifies digits with high accuracy. The model architecture consists of multiple convolutional layers for feature extraction, pooling layers for dimensionality reduction, and fully connected layers for classification. Preprocessing steps include normalizing pixel values and one-hot encoding the labels, ensuring the data is optimally formatted for training. The TensorFlow framework, known for its robustness and scalability, facilitates the development and deployment of the model. Through a series of experiments, the model demonstrates impressive performance, achieving high accuracy on the MNIST test set. This paper underscores the potential of deep learning in handwritten text recognition. It sets the stage for future enhancements, such as recognizing more complex handwritten texts and optimizing the system for practical applications. The results highlight the effectiveness of deep learning techniques in overcoming the challenges associated with handwritten text recognition, paving the way for advanced, real-world implementations.
Towards efficient unconstrained handwriting recognition using Dilated Temporal Convolution Network
Expert Systems with Applications, 2020
Recognition of cursive handwritten images has advanced well with recent recurrent architectures and attention mechanism. Most of the works focus on improving transcription performance in terms of Character Error Rate (CER) and Word Error Rate (WER). Existing models are too slow to train and test networks. Furthermore, recent studies have recommended models be not only efficient in terms of task performance but also environmentally friendly in terms of model carbon footprint. Reviewing the recent state-of-the-art models, it recommends considering model training and retraining time while designing. High training time increases costs not only in terms of resources but also in carbon footprint. This becomes challenging for handwriting recognition model with popular recurrent architectures. It is truly critical since line images usually have a very long width resulting in a longer sequence to decode. In this work, we present a fully convolution based deep network architecture for cursive handwriting recognition from line level images. The architecture is a combination of 2-D convolutions and 1-D dilated non causal convolutions with Connectionist Temporal Classification (CTC) output layer. This offers a high parallelism with a smaller number of parameters. We further demonstrate experiments with various re-scaling factors of the images and how it affects the performance of the proposed model. A data augmentation pipeline is further analyzed while model training. The experiments show our model, has comparable performance on CER and WER measures with recurrent architectures. A comparison is done with state-of-the-art models with different architectures based on Recurrent Neural Networks (RNN) and its variants. The analysis shows training performance and network details of three different dataset of English and French handwriting. This shows our model has fewer parameters and takes less training and testing time, making it suitable for low-resource and environment-friendly deployment. ✩ The authors are thankful to Ministry of Electronics and IT (MeitY), Government of India for supporting this work under Visvesvaraya Ph.D. fellowship from MeitY.
EASTER: Efficient and Scalable Text Recognizer
ArXiv, 2020
Recent progress in deep learning has led to the development of Optical Character Recognition (OCR) systems which perform remarkably well. Most research has been around recurrent networks as well as complex gated layers which make the overall solution complex and difficult to scale. In this paper, we present an Efficient And Scalable TExt Recognizer (EASTER) to perform optical character recognition on both machine printed and handwritten text. Our model utilises 1-D convolutional layers without any recurrence which enables parallel training with considerably less volume of data. We experimented with multiple variations of our architecture and one of the smallest variant (depth and number of parameter wise) performs comparably to RNN based complex choices. Our 20-layered deepest variant outperforms RNN architectures with a good margin on benchmarking datasets like IIIT-5k and SVT. We also showcase improvements over the current best results on offline handwritten text recognition task....
Cursive Text Recognition in Natural Scene Images using Deep Convolutional Recurrent Neural Network
IEEE Access
Text recognition in natural scene images is a challenging problem in computer vision. Different than the optical character recognition (OCR), text recognition in natural scene images is more complex due to variations in text size, colors, fonts, orientations, complex backgrounds, occlusion, illuminations and uneven lighting conditions. In this paper, we propose a segmentation-free method based on a deep convolutional recurrent neural network to solve the problem of cursive text recognition, particularly focusing on Urdu text in natural scenes. Compared to the non-cursive scripts, Urdu text recognition is more complex due to variations in the writing styles, several shapes of the same character, connected text, ligature overlapping, stretched, diagonal and condensed text. The proposed model gets a whole word image as an input without pre-segmenting into individual characters, and then transforms into the sequence of the relevant features. Our model is based on three components: a deep convolutional neural network (CNN) with shortcut connections to extract and encode the features, a recurrent neural network (RNN) to decode the convolutional features, and a connectionist temporal classification (CTC) to map the predicted sequences into the target labels. To increase the text recognition accuracy further, we explore deeper CNN architectures like VGG-16, VGG-19, ResNet-18 and ResNet-34 to extract more appropriate Urdu text features, and compare the recognition results. To conduct the experiments, a new large-scale benchmark dataset of cropped Urdu word images in natural scenes is developed. The experimental results show that the proposed deep CRNN network with shortcut connections outperform than other network architectures. The dataset is publicly available and can be downloaded from https://data.mendeley.com/datasets/k5fz57zd9z/1\. INDEX TERMS Cursive text recognition in natural images, Urdu scene text recognition, natural scene text recognition, convolutional recurrent neural network, segmentation-free scene text recognition
Stretching deep architectures for text recognition
2015 13th International Conference on Document Analysis and Recognition (ICDAR), 2015
In recent years, many deep architectures have been proposed for handwritten text recognition. However, most of the previous deep models need large scale training data and a long training time to obtain good results. In this paper, we propose a novel deep learning method based on "stretching" the projection matrices of stacked feature learning models. We call the proposed method "stretching deep architectures" (or SDA). In the implementation of SDA, stacked feature learning models are first learned layer by layer, and then the stretching technique is applied on the weight matrices between successive layers. As the feature learning models can be efficiently optimized and the stretching results can be easily computed, the training of SDA is very fast and no back propagation is needed. We have tested SDA on handwritten digits recognition, Arabic subword recognition and English letter recognition tasks. Extensive experiments demonstrate that SDA performs not only better than shallow feature learning models, but also state-of-the-art deep learning models.