A Neural Model for Text Localization, Transcription and Named Entity Recognition in Full Pages (original) (raw)

TreyNet: A Neural Model for Text Localization, Transcription and Named Entity Recognition in Full Pages

2019

In the last years, the consolidation of deep neural network architectures for information extraction in document images has brought big improvements in the performance of each of the tasks involved in this process, consisting of text localization, transcription, and named entity recognition. However, this process is traditionally performed with separate methods for each task. In this work we propose an end-to-end model that jointly performs handwritten text detection, transcription, and named entity recognition at page level, capable of benefiting from shared features for these tasks. We exhaustively evaluate our approach on different datasets, discussing its advantages and limitations compared to sequential approaches.

Joint Recognition of Handwritten Text and Named Entities with a Neural End-to-End Model

2018 13th IAPR International Workshop on Document Analysis Systems (DAS), 2018

When extracting information from handwritten documents, text transcription and named entity recognition are usually faced as separate subsequent tasks. This has the disadvantage that errors in the first module affect heavily the performance of the second module. In this work we propose to do both tasks jointly, using a single neural network with a common architecture used for plain text recognition. Experimentally, the work has been tested on a collection of historical marriage records. Results of experiments are presented to show the effect on the performance for different configurations: different ways of encoding the information, doing or not transfer learning and processing at text line or multi-line region level. The results are comparable to state of the art reported in the ICDAR 2017 Information Extraction competition, even though the proposed technique does not use any dictionaries, language modeling or post processing.

OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page Text Recognition by learning to unfold

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

Text recognition is a major computer vision task with a big set of associated challenges. One of those traditional challenges is the coupled nature of text recognition and segmentation. This problem has been progressively solved over the past decades, going from segmentation based recognition to segmentation free approaches, which proved more accurate and much cheaper to annotate data for. We take a step from segmentation-free single line recognition towards segmentation-free multi-line / full page recognition. We propose a novel and simple neural network module, termed OrigamiNet, that can augment any CTC-trained, fully convolutional single line text recognizer, to convert it into a multi-line version by providing the model with enough spatial capacity to be able to properly collapse a 2D input signal into 1D without losing information. Such modified networks can be trained using exactly their same simple original procedure, and using only unsegmented image and text pairs. We carry out a set of interpretability experiments that show that our trained models learn an accurate implicit line segmentation. We achieve state-of-the-art character error rate on both IAM & ICDAR 2017 HTR benchmarks for handwriting recognition, surpassing all other methods in the literature. On IAM we even surpass single line methods that use accurate localization information during training. Our code is available online at https:

TextBoxes: A Fast Text Detector with a Single Deep Neural Network

This paper presents an end-to-end trainable fast scene text detector, named TextBoxes, which detects scene text with both high accuracy and efficiency in a single network forward pass, involving no post-process except for a standard nonmaximum suppression. TextBoxes outperforms competing methods in terms of text localization accuracy and is much faster, taking only 0.09s per image in a fast implementation. Furthermore, combined with a text recognizer, TextBoxes significantly outperforms state-of-the-art approaches on word spotting and end-to-end text recognition tasks.

Challenges of Deep Learning-based Text Detection in the Wild

Journal of Computational Vision and Imaging Systems

The reported accuracy of recent state-of-the-art text detection methods, mostly deep learning approaches, is in the order of 80% to 90% on standard benchmark datasets. These methods have relaxed some of the restrictions of structured text and environment (i.e., "in the wild") which are usually required for classical OCR to properly function. Even with this relaxation, there are still circumstances where these state-of-the-art methods fail. Several remaining challenges in wild images, like in-plane-rotation, illumination reflection, partial occlusion, complex font styles, and perspective distortion, cause exciting methods to perform poorly. In order to evaluate current approaches in a formal way, we standardize the datasets and metrics for comparison which had made comparison between these methods difficult in the past. We use three benchmark datasets for our evaluations: ICDAR13, ICDAR15, and COCO-Text V2.0. The objective of the paper is to quantify the current shortcomin...

End-to-End Text Recognition with Convolutional Neural Networks

Full end-to-end text recognition in natural images is a challenging problem that has received much attention recently. Traditional systems in this area have relied on elaborate models incorporating carefully handengineered features or large amounts of prior knowledge. In this paper, we take a different route and combine the representational power of large, multilayer neural networks together with recent developments in unsupervised feature learning, which allows us to use a common framework to train highly-accurate text detector and character recognizer modules. Then, using only simple off-the-shelf methods, we integrate these two modules into a full end-to-end, lexicon-driven, scene text recognition system that achieves state-of-the-art performance on standard benchmarks, namely Street View Text and ICDAR 2003.

Reading Text in the Wild with Convolutional Neural Networks

In this work we present an end-to-end system for text spotting-localising and recognising text in natural scene images-and text based image retrieval. This system is based on a region proposal mechanism for detection and deep convolutional neural networks for recognition. Our pipeline uses a novel combination of complementary proposal generation techniques to ensure high recall, and a fast subsequent filtering stage for improving precision. For the recognition and ranking of proposals, we train very large convolutional neural networks to perform word recognition on the whole proposal region at the same time, departing from the character classifier based systems of the past. These networks are trained solely on data produced by a synthetic text generation engine, requiring no human labelled data. Analysing the stages of our pipeline, we show state-of-the-art performance throughout. We perform rigorous experiments across a number of standard end-to-end text spotting benchmarks and text-based image retrieval datasets, showing a large improvement over all previous methods. Finally, we demonstrate a real-world application of our text spotting system to allow thousands of hours of news footage to be instantly searchable via a text query.

EASTER: Efficient and Scalable Text Recognizer

ArXiv, 2020

Recent progress in deep learning has led to the development of Optical Character Recognition (OCR) systems which perform remarkably well. Most research has been around recurrent networks as well as complex gated layers which make the overall solution complex and difficult to scale. In this paper, we present an Efficient And Scalable TExt Recognizer (EASTER) to perform optical character recognition on both machine printed and handwritten text. Our model utilises 1-D convolutional layers without any recurrence which enables parallel training with considerably less volume of data. We experimented with multiple variations of our architecture and one of the smallest variant (depth and number of parameter wise) performs comparably to RNN based complex choices. Our 20-layered deepest variant outperforms RNN architectures with a good margin on benchmarking datasets like IIIT-5k and SVT. We also showcase improvements over the current best results on offline handwritten text recognition task....

Deep Learning in Text Recognition and Text Detection : A Review

IRJET, 2022

Detecting text in natural situations is a difficult task that is more difficult than extracting the text from those natural images in which the background and foreground are clearly separated and every character is isolated from the images. Text in the landscapes which is nature may occur in a range of states like a text in dark with the background light and vice versa, with a broad diversity of fonts, even for letters of the same word, sections of words can be overlapped by environment objects, making detection of these parts impossible. Deep learning which is a subset of Machine learning employs a neural network, a technique that replicates how the brain analyses data. An Optical Character Recognition engine has two parts: i)Text recognition and ii)Text detection. The process of locating the sections of text in a document is known as text detection. Since the different documents (invoices, newspapers, etc.) have varied structures, this work has historically proved difficult. A text recognition system, on the other hand, takes a portion of a document containing text (a word or a line of text) and outputs the associated text. Both text detection and text recognition have shown considerable promise with deep learning algorithms.

Deep Learning Based Object Detection Models Applied to Document Images

2020

In general, this thesis follows three main lines of research. It is first focused on the understanding of floor plan images, using deep learning-based object detection methods for text and object recognition. Next, the lack of large data sets of floor plan images, that could be used to investigate object detection techniques in floor plan data sets, is addressed by creating a comprehensive data set. Then, text alignment in document images is considered • Chapter 5 introduces two novel floor plan image data sets, ISTA and Flo2plan. The labeling approach, adopted to annotate these data sets, is also highlighted. • Chapter 6 concludes this dissertation by highlighting the suitability of the presented deep learning-based object detection models for document image understanding together with a brief discussion about future research directions.