Text Extraction Research Papers - Academia.edu (original) (raw)

2025, Real-Time Image and Video Processing 2017

Several approaches were proposed in order to extract text from scanned documents. However, text extraction in heterogeneous documents stills a real challenge. Indeed, text extraction in this context is a dicult task because of the variation of the text due to the dierences of sizes, styles and orientations, as well as to the complexity of the document region background. Recently, we have proposed the improved hybrid binarization based on Kmeans method (I-HBK) 5 to extract suitably the text from heterogeneous documents. In this method, the Page Layout Analysis (PLA), part of the Tesseract OCR engine, is used to identify text and image regions. Afterwards our hybrid binarization is applied separately on each kind of regions. In one side, gamma correction is employed before to process image regions. In the other side, binarization is performed directly on text regions. Then, a foreground and background color study is performed to correct inverted region colors. Finally, characters are located from the binarized regions based on the PLA algorithm. In this work, we extend the integration of the PLA algorithm within the I-HBK method. In addition, to speed up the separation of text and image step, we employ an ecient GPU acceleration. Through the performed experiments, we demonstrate the high F-measure accuracy of the PLA algorithm reaching 95% on the LRDE dataset. In addition, we illustrate the sequential and the parallel compared PLA versions. The obtained results give a speedup of 3.7x when comparing the parallel PLA implementation on GPU GTX 660 to the CPU version.

2025

Text based Mining is the process of analyzing a document or set of documents to understand the content and meaning of the information they contain. Text Mining enhances human's ability to Process massive quantities of information and it has high Commercial values. Text mining, sometimes alternately referred to as text data mining, roughly, process of deriving highquality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. It usually involves the process of structuring the input text deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, analysis and entity relation modeling (i.e., learning relations between named entities).

2025, ACM SIGKDD Explorations Newsletter

This paper presents an approach for knowledge discovery in texts extracted from the Web. Instead of analyzing words or attribute values, the approach is based on concepts, which are extracted from texts to be used as characteristics in the mining process. Statistical techniques are applied on concepts in order to find interesting patterns in concept distributions or associations. In this way, users can perform discovery in a high level, since concepts describe real world events, objects, thoughts, etc. For identifying concepts in texts, a categorization algorithm is used associated to a previous classification task for concept definitions. Two experiments are presented: one for political analysis and other for competitive intelligence. At the end, the approach is discussed, examining its problems and advantages in the Web context.

2025, JADT 2004

How have reacted the Italian newsgroups to the war in Iraq? During the 28 days of "declared" war, what have been the main political and ideological guidelines? The monitoring of the 8 newsgroups of the it.politica category has permitted to explore this topic. The more frequent non empty word in the corpus is (15.605 occurrences on 5.220.932 occurrences in total). This paper intends to analyze the use of this form with a minimal loss of information and a meaningful gain of thematic deepening in a very "noisy" and "dirty" source. For this purpose, the analysis is performed on a sub-text extracted by concordances with keywords . In complex, there is a climate of general opposition to the war, although with various ways and reasonings. It is not always possible to give a description of the newgroups based on the political choice of the users. They search the argument and the challenge with the political opponents. Come hanno reagito i newsgroups italiani alla guerra in Iraq? Durante i 28 giorni di guerra "dichiarata" quali sono stati gli orientamenti politici ed ideologici prevalenti? Il monitoraggio degli 8 newsgroups della categoria it.politica ha permesso di esplorare questo tema a partire dalla forma grafica , la più frequente nel corpus (escluse le parole vuote) con 15.605 occorrenze su 5.220.932 occorrenze in totale. L'obiettivo è di minimizzare la perdita di informazione ma, nel contempo, ottenere un guadagno significativo di approfondimento tematico in una fonte di informazione nota per essere molto "rumorosa" e "sporca". A questo scopo, l'analisi viene condotta su un sub-testo estratto dal corpus e composto dalle concordanze della forma . In complesso, ne emerge un clima di opposizione generalizzata alla guerra, sebbene con modalità e argomentazioni diverse. Non sempre è possibile dare una caratterizzazione precisa del newsgroup in base all'orientamento politico che lo identifica nominalmente. Gli utenti dei newsgroups cercano la discussione e il confronto polemico con gli avversari politici.

2025, Transactions of the Japanese Society for Artificial Intelligence

Directory services are popular among people who search their favorite information on the Web. Those services provide hierarchical categories for finding a user's favorite page. Pages on the Web are categorized into one of the categories by hand. Many existing studies classify a web page by using text in the page. Recently, some studies use text not only from a target page which they want to categorize, but also from the original pages which link to the target page. We have to narrow down the text part in the original pages, because they include many text parts that are not related to the target page. However these studies always use a unique extraction method for all pages. Although web pages usually differ so much in their formats, they do not change their extraction methods. We have already developed an extraction method of anchor-related text. We use text parts extracted by our method for classifying web pages. The results of the experiments showed that our extraction method improves the classification accuracy.

2025

Persons of visual impairment make up a growing segment of modern society. To cater to the special needs of these individuals, society ought to consider the design of special constructs to enable them to fulfill their daily necessities. This research proposes a new method for text extraction from indoor signage that will help persons of visual impairment maneuver in unfamiliar indoor environments, thus enhancing their independence and quality of life. In this thesis, images are acquired through a video camera mounted on glasses of the walking person. Frames are then extracted and used in an integrated framework that applies Maximally Stable Extremal Regions (MSER) to detect alphabets along with a morphological dilation operation to identify clusters of alphabets (words). Proposed method has the ability to localize and detect the orientation of these clusters. A rotation transformation is performed when needed to realign the text into a horizontal orientation and allow the objects to be in an acceptable input to any of the available optical character recognition (OCR) systems. Analytical and simulation results verify the validity of the proposed system.

2025, Ubiquitous Information Technologies and Applications: CUTE 2013

Since born-digital images usually have low resolution, they are distinctly different from natural scene images. Extracting text information from born-digital images has been an increasing interest in the document analysis and recognition field. We propose an automatic method to recognize words from low-resolution color images. First, the image is smoothed by using the bilateral filter, which preserves edge information. Then, it is binarized using the global thresholding method and cleaned from noise. Finally, the open-source Optical Character Recognition engine, with the incorporation of a post-processor trained in the knowledge of the English language, is applied to obtain meaningful words from the binary image. We experiment with the proposed system on ICDAR 2011 and the music sheet dataset, and the result shows better performance than several previous works.

2025, Journées Internationales D' …

In the last years the new Italian Communist party (Rifondazione Comunista) has acquired such a national and international importance to provoke the interest of political scientists. The 6th conference of the party (spring 2005) was very important because of the change from a traditional opposition attitude to the decision to form an alliance with the future Left government, in the case Romano Prodi wins the competition with Berlusconi. This change of perspective was long discussed within the party. As a matter of fact, during the 6th conference, five alternatives motions were discussed : the first was proposed by the secretary Fausto Bertinotti, who decided the alliance with the others Left wing parties, while the other four motions proposed alternative strategies. To analyse this process we have considered 72 articles, published before the conference in the party newspaper (Liberazione) to support the different motions (18 for each motion). The textual analysis we carried out with the new release of TaltaC2 allows us to describe the contents of five groups of articles, the specific language utilised, the kind of actors cited in the corpus and the negative language used. This description is carried out also in a multidimensional perspective.

2025, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of

Reasonable success has been achieved at developing mono lingual OCR systems in Indian scripts. Scientists, optimistically, have started t o look beyond. Development of bilingual OCR systems and OCR systems with capability t o identify the text areas are some of the pointers to future activities in Indian scenario. T h e separation of text and non-text regions before considering t h e document image for OCR is a n important task. In this paper, we present a biologically inspired, multichannel filtering scheme for page layout analysis. The same scheme has been used for script recognition as well. Parameter tuning is mostly done heuristically. It has also been seen t o be computationally viable for commercial OCR system development.

2024, HAL (Le Centre pour la Communication Scientifique Directe)

In this paper, we present our contribution to the FinTOC-2022 Shared Task "Financial Document Structure Extraction". We participated in the three tracks dedicated to English, French and Spanish document processing. Our main contribution consists in considering financial prospectus as a bundle of documents, i.e., a set of merged documents, each with their own layout and structure. Therefore, Document Layout and Structure Analysis (DLSA) first starts with the boundary detection of each document using general layout features. Then, the process applies inside each single document, taking advantage of the local properties. DLSA is achieved considering simultaneously text content, vectorial shapes and images embedded in the native PDF document. For the Title Detection task in English and French, we observed a significant improvement of the F-measures for Title Detection compared with those obtained during our previous participation.

2024

In this paper we propose a course-grained NLP approach to text segmentation based on the analysis of lexical cohesion within text. Most work in this area has focused on the discovery of textual units that discuss subtopic structure within documents. In contrast our segmentation task requires the discovery of topical units of text i.e. distinct news stories from broadcast news programmes. Our system SeLeCT first builds a set of lexical chains, in order to model the discourse structure of the text. A boundary detector is then used to search for breaking points in this structure indicated by patterns of cohesive strength and weakness within the text. We evaluate this technique on a test set of concatenated CNN news story transcripts and compare it with an established statistical approach to segmentation called TextTiling.

2024, Proceedings of SPIE

2024, 2015 IEEE International Conference on Image Processing (ICIP)

Text extraction from complex colored images involves the suppression of unwanted background while keeping text features. Imaging devices are almost omnipresent and the unrestricted conditions of the images present new challenges for real-time OCR systems. The recently proposed Gamma Correction Method [1] is a robust and good quality method for text extraction in complex colored images. However it requires a large amount of computing resources and is not well suited for real-time applications. In this paper we propose an efficient acceleration of the GCM to drastically reduce its execution-time, while preserving the text extraction quality. Experimental results on ICDAR dataset show that our approach is effective and can reach a speedup of up to 11.430.

2024, Intelligent Systems for Molecular Biology

Bioinformatics is a new research field that aims at using computer technology to uncover biological knowledge of high relevance to the biotechnology community. An important research topic in Bioinformatics involves the exploration of vast amounts of biological and biomedical scientific literature (BioLiterature). Over the last few decades, text-mining systems have exploited this BioLiterature to reduce the time spent by researchers in its analysis. However, many of these systems rely on manually inserted domain knowledge, which is time-consuming. This thesis proposes an approach where domain knowledge is automatically acquired from publicly available biological databases, instead of using manually inserted domain knowledge. Based on this approach, innovative methods for retrieval, extraction and validation of information published in BioLiterature were developed and evaluated. The results show that the proposed approach is an efficient alternative to domain knowledge explicitly provided by experts. The new methods were also integrated into a system for automatic annotation of genes and proteins, which was successfully demonstrated in several applications.

2024

Given the large amount of data stored in biological databases, the management of uncertainty and incompleteness in them is a non-trivial problem. To cope with the large amount of sequences being produced, a significant number of genes and proteins have been functionally characterized by automated tools. However, these tools have also produced a significant number of misannotations that are now present in the databases. This paper proposes a new approach for validating the automated annotations, which uses the ...

2024, Lecture Notes in Computer Science

It is well known that the text that appears in a video scene or is graphically added to it is an important source of semantic information for indexing and retrieval, notably in the context of video databases. This paper proposes an improved algorithm for the automatic extraction of text in digital video; its major strengths are its robustness in terms of text skew and its improved performance in dealing with scene text. The system is based on a segmentation approach, using geometrical and spatial analyses for text detection. After, temporal redundancy is exploited to improve the detection performance by means of motion analysis. The output of the text detection step is then directly passed to a standard OCR software package in order to obtain the detected text as ASCII characters.

2024

Abstract: Text categorization is a challenging task when it comes to categorizing text from different sources such as images, videos, and handwritten text. Handwritten text may vary as per the diversified user. Hence, it is difficult to find the best technique to categorize such kind of texts due to the unavailability of standard dataset and evaluation measures. Our system presents a standard method for recognition of text from all kinds of aforementioned input sources using the Support Vector Machine (SVM) classifier. Additionally, it classifies and places the words into predefined classes of parts of speech for English language using Deep Learning algorithms.

2024, International Journal of Innovative Research in Computer and Communication Engineering

This paper represents text extracting information (represented by data) from a sequence of images (video) is the main objective of Video segmentation. In order to extract and search important information from a huge amount of video data, we are focusing on extraction of texts from video. However variations of the text due to differences in text style, font, size, orientation, alignment as well as low image contrast and complex background make the problem of automatic text extraction extremely difficult and challenging job. A large number of techniques have been proposed to address this problem and the purpose of this paper is to design algorithms for each phase of extracting text from a video using java libraries and classes. Here first we frame the input video into stream of images using the Java Media Framework (JMF) with the input being a real time or a video from the database and considering the connected component analysis form. We apply preprocessing algorithms to convert the shot-frame to gray scale and remove the disturbances like superimposed lines over the text, discontinuity removal, and dot removal. Then we continue with the algorithms for localization, segmentation, tracking and recognition.

2024, ijcset.net

This paper describe the one of the method of text extraction from visiting card images with fanciful design containing complicated color background and reverse contrast regions. The proposed method is based on edge detection followed by grouping of edges based on size, orientation and color attributes. After detecting edges of text from card we can apply image processing steps such as noise removal, segmentation to extract text from complex image. This text image can be recognized using OCR .We can design OCR using Artificial Neural Network. Some post processing like color identification and binarization will be helpful to get a pure binary text image for OCR.

2024

As digital libraries have grown, so has the need for developing more effective ways to access collections. This talk will present an overview of the CLiMB project (Computational Linguistics for Metadata Building), funded by the Mellon Foundation and currently underway at Columbia University. The goal of the project is to use computational linguistic techniques to extract metadata relevant to image collections, and thus to improve cataloging access. This research addresses the access bottleneck by applying the latest natural language processing techniques to the problem of identifying descriptive metadata. Our goal is to load our results into a database for image search, although we have not yet reached this phase of the project. This talk will report on research in CLiMB’s first phase. In addition, the talk will provide an overview of selected digital library projects at Columbia, in terms of collections, access and technology. • Mellon Technical Meeting, November 21st, 2003. Presen...

2024, researchtrend.net

ABSTRACT: This paper is proposed License Plate Recognition systems have been developed in the past. Our objective is to design a system implemented on a standard camera-equipped mobile phone, capable of recognizing vehicle license number.... more

2024, Discourse Studies

presents recent developments in the Geneva modular and interactionist approach to discourse organization. The first section analyses the main epistemological, theoretical and methodological properties of the Geneva Model by examining its relationship to data, communicative action, complexity and discourse organization, and then outlines the Geneva Model's modular methodology. The second section of the article focuses on a text extract from a service encounter and applies some aspects of the modular methodology to the analysis of request sequences. The authors argue that requests cannot be reduced to the utterance of single speech acts but are best described as complex discourse practices linking praxeological information, conceptual knowledge and textual competence.

2024

Abstract. Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. This paper describes the KDT system for Knowledge Discovery in Text, in which documents are labeled by keywords, and knowledge discovery is performed by analyzing the co-occurrence frequencies of the various keywords labeling the documents. We show how this keyword-frequency approach supports a range of KDD operations, providing a suitable foundation for knowledge discovery and exploration for collections of unstructured text. Keywords: data mining, text mining, text categorization, distribution comparison, trend analysis

2024

Stub This paper talks about a new approach to recognize named entities for Indian languages. Phonetic matching technique is used to match the strings of different languages on the basis of their similar sounding property. We have tested our system with a comparable corpus of English and Hindi language data. This approach is language independent and requires only a set of rules appropriate for a language.

2024, HAL (Le Centre pour la Communication Scientifique Directe)

2024, … de Informática em …

Resumo A busca pela inclusão social tem promovido a pesquisa de ferramentas que ampliam o uso do computador por pessoas com necessidades especiais, como emuladores de teclado e mouse, interfaces adaptadas e aceleradores de uso. Um... more

2024, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL)

Text extraction plays an important function for data processing work ows in digital libraries. For example, it is a crucial prerequisite for evaluating the quality of migrated textual documents. Complex le formats make the extraction process error-prone and have made it very challenging to verify the correctness of extraction components. Based on digital preservation and information retrieval scenarios, three quality requirements in terms of e ectiveness of text extraction tools are identi ed: 1) is a certain text snippet correctly extracted from a document, 2) does the extracted text appear in the right order relative to other elements and, 3) is the structure of the text preserved. A number of text extraction tools is available ful lling these three quality requirements to various degrees. However, systematic benchmarks to evaluate those tools are still missing, mainly due to the lack of datasets with accompanying ground truth. e contribution of this paper is twofold. First we describe a dataset generation method based on model driven engineering principles and use it to synthesize a dataset and its ground truth directly from a model. Second, we de ne a benchmark for text extraction tools and complete an experiment to calculate performance measures for several tools that cover the three quality requirements. e results demonstrate the bene ts of the approach in terms of scalability and e ectiveness in generating ground truth for content and structure of text elements.

2024, 2006 Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution (AXMEDIS'06)

This paper describes the part of the European PrestoSpace project dedicated to the study and development of a Metadata Access and Delivery (MAD) system for television broadcast archives. The mission of the MAD system, inside the wider perspective of the PrestoSpace factory, is to generate, validate and deliver to the archive users metadata created through the employment of both automatic and manual information extraction tools. Automatic tools include audiovisual content analysis and semantic analysis of text extracted by automatic speech recognition (ASR). The MAD publication platform provides access and search facilities to the imported and newly produced metadata in a synergic and easyto-use interface.

2024, Lecture Notes in Computer Science

In this paper, a new algorithm for traffic sign recognition is presented. It is based on a shape detection algorithm that classifies the shape of the content of a sign using the capabilities of a Support Vector Machine (SVM). Basically, the algorithm extracts the shape inside a traffic sign, computes the projection of this shape and classifies it into one of the shapes previously trained with the SVM. The most important advances of the algorithm is its robustness against image rotation and scaling due to camera projections, and its good performance over images with different levels of illumination. This work is part of a traffic sign detection and recognition system, and in this paper we will focus solely on the recognition step.

2024, Journal of Computer Science

Problem statement: Data entry form is a convenient and successful tool for information collection by filling in the sheets using pen and handwriting. One of the most important fields in these forms is the data filled boxes. Extracting the handwriting from the data entry forms is important for many purposes such as in documenting and archiving. The extraction process is also important in situations such as when it is necessary to the handwritten recognition process. Approach: A simple and effective approach is presented to extract handwritten characters, including digits and letters of any language from data filled boxes of data entry form and to deal with cases of overlaps between the handwritten characters and boxes' lines. The proposed approach is based on line shape characteristic by detecting and removing the vertical and horizontal straight boxes' lines, while preserving the curved lines which represent the handwritten characters. The problem of the handwritten characters overlapping with the data filled boxes' line is solved using morphology dilation to reconstruct the broken characters after the removal of the boxes' lines. Results: Experimental results have demonstrated that the proposed approach can extract handwriting from data filled boxes with overall 94.052% for data collection of 150 forms. Conclusion: The proposed algorithm has been successfully implemented and tested to achieve the objectives of handwritten extraction of any language from data filled boxes. However, this work could not deal with situations whereby the characters touch other immediate characters.

2024, Zenodo (CERN European Organization for Nuclear Research)

Text data present in multimedia contain useful information for automatic annotation, indexing. Extracted information used for recognition of the overlay or scene text from a given video or image. The Extracted text can be used for retrieving the videos and images. In this paper, firstly, we are discussed the different techniques for text extraction from images and videos. Secondly, we are reviewed the techniques for indexing and retrieval of image and videos by using extracted text.

2024

This paper investigates the use of a language independent model for named entity recognition based on iterative learning in a co-training fashion, using word-internal and contextual information as independent evidence sources. Its bootstrapping process begins with only seed entities and seed contexts extracted from the provided annotated corpus. F-measure exceeds 77 in Spanish and 72 in Dutch.

2024, proceeding of the 6th conference on Natural language learning - COLING-02

2024, International Business Management

The study considers the main text analytics programs and their functions. The researcherd studied the functions of 54 leading text analysis software and identified in which areas of marketing research they can be applied. Authors offer marketing (customers) research methods based on text analytics tools. In this study, we found out that automated text mining and analysis is especially useful to research the decision-making process of target customers which is not the same as their psychological portrait. Study of decision-making process requires specific research design, based on combination and adjusting of different text analytics tools. Understanding of customers decision-making process is relevant for individual work with clients, setting contextual advertising and E-mails customization, designing products, proposals and marketing campaigns for targeted customers groups.

2024, 2009 10th International Conference on Document Analysis and Recognition

Text within a camera grabbed image can contain a huge amount of meta data about that scene. Such meta data can be useful for identification, indexing and retrieval purposes. Detection of coloured scene text is a new challenge for all camera based images. Common problems for text extraction from camera based images are the lack of prior knowledge of any kind of text features such as colour, font, size and orientation. In this paper we propose a new algorithm for the extraction of text from an image which can overcome these problems. In addition, problems due to an unconstrained complex background in the scene has also been addressed. Here a new technique is applied to determine the discrete edges around the text boundaries. A novel methodology is also proposed to extract the text exploiting its appearance in terms of colour and spatial distribution.

2024, Proceedings of the Second Workshop on Computational Approaches to Code Switching

We address the problem of Part of Speech tagging (POS) in the context of linguistic code switching (CS). CS is the phenomenon where a speaker switches between two languages or variants of the same language within or across utterances, known as intra-sentential or inter-sentential CS, respectively. Processing CS data is especially challenging in intrasentential data given state of the art monolingual NLP technology since such technology is geared toward the processing of one language at a time. In this paper we explore multiple strategies of applying state of the art POS taggers to CS data. We investigate the landscape in two CS language pairs, Spanish-English and Modern Standard Arabic-Arabic dialects. We compare the use of two POS taggers vs. a unified tagger trained on CS data. Our results show that applying a machine learning framework using two state of the art POS taggers achieves better performance compared to all other approaches that we investigate.

2024

Modern era has observed an enormous development in media information in the manifestation of audio, video and image data. Retrieval and Indexing of content-oriented video has evolved as an intriguing research zone with the colossal development in the product of advanced digital mass media. Not with standing varying media information, text showing up in videos can assist in an effective contraption for semantic abstraction, video analysis and recovery of video data. A proficient algorithm and high quality videos of news are required for accomplishing the desired task. This paper recommends a system dependent upon gray-scale edges-features for evenly arranged English ticker text localization and extraction from news videos. The framework exploits edge based localization of text regions to concentrate text based materials from videos. For low quality videos, some contrast enhancement operations are used to enhance the video frames first and then morphological operators are applied to segment out the ticker text regions in news videos. At last, these regions are cropped from the video frames and on satisfying certain geometrical constraints, the results are acknowledged to be text regions. No assumptions about the ticker color, style of text fonts, size of text and the types of ticker is made because no standard format of tickers exist in news videos of different channels and different countries have separate style of ticker texts format and color. The proposed algorithm is evaluated on a data set of CNN and BBC news videos and it displayed promising results.

2024

Decision making in a real-world domain like logistics is challenging for an autonomous technical system like a software agent. In this paper the problem of planning in such an environment is addressed. Classical planning and probabilistic criteria-directed scheduling components are tied together by a metalevel control and supplemented by a sophisticated world model and a risk management module to form a plan-based decision support system for autonomous control of logistic entities. The system is designed to be integrated in a multi-agent based simulation for evaluation and will later be used to support autonoumous decision making in real-world logistic domains.

2024, Seventh IEEE/ACIS International Conference on Computer and Information Science (icis 2008)

Texts in web pages, images and videos contain important clues for information indexing and retrieval. Most existing text extraction methods depend on the language type and text appearance. In this paper, a novel and universal method of image text extraction is proposed. A coarse-to-fine text location method is implemented. Firstly, a multi-scale approach is adopted to locate texts with different font sizes. Secondly, projection profiles are used in location refinement step. Color-based k-means clustering is adopted in text segmentation. Compared to grayscale image which is used in most existing methods, color image is more suitable for segmentation based on clustering. It treats corner-points, edge-points and other points equally so that it solves the problem of handling multilingual text. It is demonstrated in experimental results that best performance is obtained when k is 3. Comparative experimental results on a large number of images show that our method is accurate and robust in various conditions.

2024, International Journal of Advanced Science and Technology

Most of the image retrieval systems published and implemented have focused on basic features like color, shape and texture of an image with little or no consideration of the included text region of the image. In this work, we have developed an image retrieval system that works on finding similar composite images containing graphical shapes as well as text from a database of thousands of images. By proposing a novel method for text localization, extraction followed by detection, we have demonstrated how this method outperforms commercial OCR tools. The significant feature of this work is its handling the requirements of invariance to font size, design, text region orientation and its ability to give accurate result even in the presence of complex background and graphical elements. The methodology has been tested for English text but is capable to handle any other language.

2024, Proceedings of the 2011 International Conference on Electrical Engineering and Informatics

This research is conducted to accommodate the needs of visually impaired people through an intelligent system, which reads textual information on papers and produces corresponding voice. Indonesian Automated Document Reader (I-ADR) is operated via a voice-based user interface to scan a document page. Textual information from the scanned page is then extracted using Optical Character Recognition (OCR) techniques. A user can then choose to have the system read the whole page, or they can opt to listen to a summary of the information in page. SIDoBI (Sistem Ikhtisar Dokumen untuk Bahasa Indonesia) is integrated into the system to provide summarization feature. The result of either the whole-page reading or summarization is converted to speech through a textto-speech synthesizer. This whole system is developed under the Free Open Source Software policy and will be distributed openly to all users in need without any cost. This paper is focused on the text segmentation algorithm implemented in I-ADR to extract text from documents with complex layout. We implemented I-ADR text segmentation module using Enhanced CRLA and propose an improved algorithm for text extraction. Evaluation of the proposed system with various page layouts showed promising results.

2023, International Journal of Advanced Research in Computer Science and Software Engineering

The Character Recognition of both keyboard typed and handwritten characters has still a long way to go in terms of research. Although significant success has been achieved in type written characters but in handwritten it is still to touch an appreciable level. Most of the methods that have been proposed in this regard have huge computational complexity. The proposed review provides an in depth review of the OCR methods which include segmentation, classification and recognition of characters independent in size and texture. The proposed review also provides the literature survey in a summarized manner providing a comparative analysis of various OCR techniques.

2023, 2009 10th International Conference on Document Analysis and Recognition

With the increasing popularity of digital cameras attached with various handheld devices, many new computational challenges have gained significance. One such problem is extraction of texts from natural scene images captured by such devices. The extracted text can be sent to OCR or to a text-to-speech engine for recognition. In this article, we propose a novel and effective scheme based on analysis of connected components for extraction of Devanagari and Bangla texts from camera captured scene images. A common unique feature of these two scripts is the presence of headline and the proposed scheme uses mathematical morphology operations for their extraction. Additionally, we consider a few criteria for robust filtering of text components from such scene images. Moreover, we studied the problem of binarization of such scene images and observed that there are situations when repeated binarization by a well-known global thresholding approach is effective. We tested our algorithm on a repository of 100 scene images containing texts of Devanagari and / or Bangla

2023, … de Informática em …

2023, Journal of Signal and Information Processing

Many Text Extraction methodologies have been proposed, but none of them are suitable to be part of a real system implemented on a device with low computational resources, either because their accuracy is insufficient, or because their performance is too slow. In this sense, we propose a Text Extraction algorithm for the context of language translation of scene text images with mobile phones, which is fast and accurate at the same time. The algorithm uses very efficient computations to calculate the Principal Color Components of a previously quantized image, and decides which ones are the main foreground-background colors, after which it extracts the text in the image. We have compared our algorithm with other algorithms using commercial OCR, achieving accuracy rates more than 12% higher, and performing two times faster. Also, our methodology is more robust against common degradations, such as uneven illumination, or blurring. Thus, we developed a very attractive system to accurately separate foreground and background from scene text images, working over low computational resources devices.

2023, 2021 13th International Conference on Knowledge and Systems Engineering (KSE)

The amount of lecture videos is rapidly growing due to the popularity of massive online open courses in academic institutions. Thus, the efficient method for lecture video retrieval in various languages is needed. In this paper, we propose an approach for automated lecture video indexing and retrieval. First, the lecture video is segmented into keyframes in a manner that the duplication of these frames is minimal. The textual information embedded in each keyframe is then extracted. We consider this issue as a matter of text detection and recognition. The text detection is solved by our segmentation network in which we propose a binarization approach for optimizing text locations in an image. For text recognition, we take advantage of VietOCR to recognize both English and Vietnamese text. Lastly, we integrate a vector-based semantic search in ElasticSearch to enhance the ability of lecture video search. The experimental results show that our approach gives high performance in detecting and recognizing the text content in both English and Vietnamese as well as enhancing the speed and accuracy of lecture video retrieval.