Document processing Research Papers - Academia.edu (original) (raw)

2025

English. In this work we present a methodology for the annotation of Attribution Relations (ARs) in speech which we apply to create a pilot corpus of spoken informal dialogues. This represents the first step towards the creation of a resource for the analysis of ARs in speech and the development of automatic extraction systems. Despite its relevance for speech recognition systems and spoken language understanding, the relation holding between quotations and opinions and their source has been studied and extracted only in written corpora, characterized by a formal register (news, literature, scientific articles). The shift to the informal register and to a spoken corpus widens our view of this relation and poses new challenges. Our hypothesis is that the decreased reliability of the linguistic cues found for written corpora in the fragmented structure of speech could be overcome by including prosodic clues in the system. The analysis of SARC confirms the hypothesis showing the crucial role played by the acoustic level in providing the missing lexical clues.

2025, World Academy of Science, Engineering and Technology, International Journal of Computer, Electrical, Automation, Control and Information Engineering

In this paper, an efficient structural approach for recognizing on-line handwritten digits is proposed. After reading the digit from the user, the slope is estimated and normalized for adjacent nodes. Based on the changing of signs of the slope values, the primitives are identified and extracted. The names of these primitives are represented by strings, and then a finite state machine, which contains the grammars of the digits, is traced to identify the digit. Finally, if there is any ambiguity, it will be resolved. Experiments showed that this technique is flexible and can achieve high recognition accuracy for the shapes of the digits represented in this work.

2025

2025, Informatica (lithuanian Academy of Sciences)

Recognition of On-line Handwritten Arabic Digits Using Structural Features and Transition Network

2025, International Journal for Research in Applied Science & Engineering Technology (IJRASET)

The need for a versatile mobile and web application that integrates document and multimedia conversion, and accessibility features has increased due to advancements of digital technologies. This research presents the design and development of a cross-platform mobile application built using Flutter, utilizing cutting-edge technologies to meet various users' requirements. The application includes a set of tools such as PDF-to-Word, Word-to-PDF, and PPT-to-PDF converters, images to PDF, XLS-to-PDF, merge PDFs, image-based text extraction through Optical Character Recognition (OCR), speech-to-text, and video summarization through YouTube link parsing. The application also offers access to several AI tools needed for different tasks through a single search interface, enabling efficient discovery and use of AI-based features by users. This app has a scalable architecture that utilizes Flutter's responsive design capabilities to ensure optimal usability on browser, tablet, and mobile devices. The Provider state management solution provides efficient state management for seamless navigation and interaction. Integrating AI-driven capabilities, such as summarization and OCR, enriches user experience by offering precision and automation in data processing. This paper outlines the technical realization, architectural design, and development issues.

2025, International Journal for Research in Applied Science & Engineering Technology (IJRASET)

In an era where synthetic media is becoming increasingly sophisticated, this project introduces an advanced AIpowered solution designed to detect deepfake content in both images and videos. Deepfakes-media that has been digitally altered or artificially created using machine learning techniques-pose growing threats by facilitating the spread of misinformation, fabricating news content, and infringing on individual privacy. As these manipulated visuals become more convincing and widespread, the need for reliable detection methods becomes more urgent. To address this issue, the system leverages two state-of-the-art artificial intelligence models. For analyzing static visuals, it utilizes YOLOv8 (You Only Look Once, version 8)-a model renowned for its real-time object detection capabilities, blending both high speed and accuracy. YOLOv8 excels in scrutinizing image content to flag potential signs of tampering or fabrication. For video-based analysis, the system incorporates the ViViT (Video Vision Transformer) model. ViViT is designed to interpret not only the spatial characteristics within individual frames but also the temporal relationships between frames, enabling robust detection of manipulated video sequences. A user-friendly web interface built with the Flask framework in Python serves as the front end of the system. Users can upload media files-either images or videos-through the interface for authenticity evaluation. The system processes the input and displays the outcome along with a confidence score, indicating how certain the model is about its classification. The ultimate goal of this initiative is to provide an effective and easy-to-use platform that empowers users to authenticate digital media. As deepfake technology continues to evolve-especially across social media and digital journalism-such tools are essential for preserving the trustworthiness of visual information. Future enhancements may include support for detecting synthetic audio and implementing real-time detection for live video streams, broadening the system's scope in combating digital disinformation.

2025, International Journal for Research in Applied Science & Engineering Technology (IJRASET)

2025, HAL (Le Centre pour la Communication Scientifique Directe)

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

2025, Pattern Recognition

This paper deals with the problem of o -line handwritten text recognition. It presents a system of text recognition that exploits an original principle of adaptation to the handwriting to be recognized. The adaptation principle is based on the automatic learning, during the recognition, of the graphical characteristics of the handwriting. This on-line adaptation of the recognition system relies on the iteration of two steps: a word recognition step that allows to label the writer's representations (allographs) on the whole text and a re-evaluation step of character models. Tests carried out on a sample of 15 writers, all unknown by the system, show the interest of the proposed adaptation scheme since we obtain during iterations an improvement of recognition rates both at the letter and the word levels.

2025, arXiv (Cornell University)

This paper presents a cognitive typology of reuse processes, and a cognitive typology of documenting processes. Empirical studies on design with reuse and on software documenting provide evidence for a generalized cognitive model. First, these studies emphasize the cyclical nature of design: cycles of planning, writing and revising occur. Second, natural language documentation follows the hierarchy of cognitive entities manipulated during design. Similarly software reuse involves exploiting various types of knowledge depending on the phase of design in which reuse is involved. We suggest that these observations can be explained based on cognitive models of text processing: the van Dijk and model of text comprehension, and the model of text production. Based on our generalized cognitive model, we suggest a framework for documenting reusable components.

2025

This paper describes a NEURAL NETWORK based technique for feature extraction applicable to segmentation-based word recognition systems. The proposed system extracts the geometric features of the character contour. The system gives a feature vector as its output. The feature vectors so generated from a training set, were then used to train a pattern recognition engine based on Neural Networks so that the system can be benchmarked. Since, an attempt was made to develop a system that used the methods that humans use to perceive handwritten characters. Hence a system that recognizes handwritten characters using Pattern recognition was developed.Here the data generated by comparison of two images was stored in excel format and then calling that data as an indivual input for generation of

2025, International Journal of Imaging Systems and Technology

This article presents a new method for the binarization of color document images. Initially, the colors of the document image are reduced to a small number using a new color reduction technique. Specifically, this technique estimates the dominant colors and then assigns the original image colors to them in order that the background and text components to become uniform. Each dominant color defines a color plane in which the connected components (CCs) are extracted. Next, in each color plane a CC filtering procedure is applied which is followed by a grouping procedure. At the end of this stage, blocks of CCs are constructed which are next redefined by obtaining the direction of connection (DOC) property for each CC. Using the DOC property, the blocks of CCs are classified as text or nontext. The identified text blocks are binarized properly using suitable binarization techniques, considering the rest of the pixels as background. The final result is a binary image which contains always black characters in white background independently of the original colors of each text block. The proposed document binarization approach can also be used for binarization of noisy color (or grayscale) document images. Several experiments that confirm the effectiveness of the proposed technique are presented. V

2025, International Journal for Research in Applied Science and Engineering Technology

2025

The main goal of this project was to use crowd sourcing to make handwritten Malayalam family records from India publicly accessible. It involved engaging volunteers and native speakers to verify and correct computer-generated transcriptions. A small group of volunteers was trained to create a training set using web-aletheia. Ultralytics Yolo was then trained to identify important information areas, while Keras-based Connectionist Temporal Classifiers were used to generate the transcriptions. The training accuracy was measured using Yolo's mAP@0.50 and Keras' edit distance, showing promising results. The accuracies were high enough (edit distance of 6.70 on predicted transcriptions) to enable quick crowd-sourced indexing through platforms like FamilySearch.org's 'Get Involved' tab.

2025

We describe a novel multi-pass, multistrategy architecture for natural language processing (NLP). The commercial integrated development environment (IDE), VisualText(TM), and the associated NLP++(TM) programming language, as well as... more

2025, International Journal of Computer Trends and Technology

The current decade has witnessed an explosion in the volume of documents generated by businesses, academic institutions, and other organizations. Managing, analyzing, and extracting value from this vast array of documents has become a challenge. We argued that the integration of Large Language Models (LLMs) into intelligent document processing can provide significant contributions in addressing this challenge. This research aims to explore the contributions of Large Language Models (LLMs) in enhancing the various stages of the Intelligent Document Processing (IDP) workflow. Specifically, we showed how LLMs can enhance each stage of current IDP offered on AWS. In the initial document classification stage of the workflow, LLMs can offer an improved semantic-based and hierarchical classification of documents. However, this can introduce challenges such as overfitting, bias, and increased computational overhead. During the document extraction stage, LLMs provide benefits in terms of contextual interpretation, cross-referencing data, and data transformation. In the review & validation stage, LLMs can augment human efforts by offering automated suggestions and anomaly detection, although this can sometimes result in false alarms. In the document enrichment stage, LLMs contribute by offering contextual enrichment, better sentiment analysis, and topic modeling, but risk over-enriching data. In the data integration stage, LLMs can synthesize data for consistency, generate automated narratives, and facilitate API interactions for smoother integration. Across these different stages, the use of LLMs is, however, subject to limitations like increased computational costs, dependency on training data for specialized tasks, and latency in real-time operations.

2025, Journal of the American Society for Information Science

We describe a prototype Information Retrieval system, SENTINEL, under development at Harris Corporation's Information Systems Division. SENTINEL is a fusion of multiple information retrieval technologies, integrating n-grams, a vector space model, and a neural network training rule. One of the primary advantages of SENTINEL is its 3-dimenstional visualization capability that is based fully upon the mathematical representation of information within SENTINEL. This 3-dimensional visualization capability provides users with an intuitive understanding, with relevance feedback/query refinement techniques that can be better utilized, resulting in higher retrieval accuracy (precision).

2025, International Conference on Computational Linguistics

This demo presents the TextCoop platform and the Dislog language, based on logic programming, which have primarily been designed for discourse processing. The linguistic architecture and the basics of discourse analysis in TextCoop are introduced. Application demos include: argument mining in opinon texts, dialog analysis, and procedural and requirement texts analysis. Via prototypes in the industry, this framework has now reached the TRL5 level.

2025

DEDICATION I dedicate this work to my grandparents, Mum, Dad and everyone back home in India. This work was possible due to their love, support and enduring confidence. v ACKNOWLEDGMENTS I would like to express my deep and sincere gratitude to my advisor Dr. Joshua D. Summers. His enthusiasm and encouragement helped me constantly during the course of my Master's degree at Clemson University. His understanding and guidance have made possible this thesis. I would also like to express my warm and sincere thanks to Dr. Gregory G. Mocko and Dr. John C Ziegert for reviewing my work. I am indebted to my colleagues at the AID Lab for providing a stimulating and fun environment in which I could learn and grow. I am especially grateful to Sudhakar Teegavarapu, Srinivasan Anandan and Stuart Miller. I would also like to thank my friends both at Clemson and back home in India, particularly Madhurima Dey for providing constant encouragement. Lastly, and most importantly, I wish to thank my parents, Avinash Kanda and Balwant Singh Kanda. They raised me, supported me, taught me, and loved me. To them I dedicate this thesis. vi

2025

Text line segmentation is an essential pre-processing stage for handwriting recognition in many Optical Character Recognition (OCR) systems. It is an important step because inaccurately segmented text lines will cause errors in the recognition stage. Text line segmentation of the handwritten documents is still one of the most complicated problems in developing a reliable OCR. The nature of handwriting makes the process of text line segmentation very challenging. Text characteristics 3can vary in font, size, shape, style, orientation, alignment, texture, color, contrast and background information. These variations turn the process of word detection complex and difficult [2]. In the case of handwritten documents, differently from machine printed, the complexity of the problem even increases. Since handwritten text can vary greatly depending on the user skill, disposition and even cultural background. A new technique to segment a handwritten document into distinct lines of text is presented. The proposed method is robust to handle line fluctuation.

2025, International Journal

Abstract: There are immense efforts to design a complete OCR for most of the world's leading languages, however, multilingual documents either of handwritten or of printed form. As a united attempt, Unicode based OCRs were studied... more

2025

Optical Character Recognition (OCR), is that the process of conversion of image text or handwritten text into machine understandable form. Simply OCR means conversion of characters that is recognized and convert it into computer readable form. It is widely used as a kind of data entry from original paper data sources such as banking papers or consultation papers, whether passport documents, invoices, statement, receipts, card, mail or any number of printed records. It is a standard method of digitizing printed texts in order that they will be electronically edited, searched, and stored more compactly. OCR is the field of research in Pattern Recognition, Artificial Intelligence and Computer Vision. OCR is that the electronic translation of handwritten, type written or printed text into machine translated images. It is widely used to recognize and search text from documents or to publish the text on a website. This document represents review of Optical Character Recognition methods su...

2025, Journal of Engineering Science and Technology

In the northern part of Thailand since 1802, Lanna characters were popular as ancient characters. The segmentation of printed documents in Lanna characters is a challenging problem, such as the partial overlapping of characters and touching characters. This paper focuses on only the touching characters such as touching between consonants and vowels. Segmentation method begins with the horizontal histogram and then vertical histogram for segmentation of text lines and characters, respectively. The results are characters consisted of correct clear characters, partial overlapping characters, and touching characters. The proposed method computes the left edge junction points and right edge junction points. Then find their maximum numbers and find the value of its row to separate consonant and vowel from touching. The trial over the text documents printed in Lanna characters can be processed with an accuracy of 95.81%.

2025, journal of engineering science and technology

2025

This paper introduces HHD-Ethiopic, a new OCR dataset for historical handwritten Ethiopic script, characterized by a unique syllabic writing system, low resource availability, and complex orthographic diacritics. The dataset consists of roughly 80,000 annotated text-line images from 1700 pages of 18 th to 20 th century documents, including a training set with text-line images from the 19 th to 20 th century and two test sets. One is distributed similarly to the training set with nearly 6,000 text-line images, and the other contains only images from the 18 th century manuscripts, with around 16,000 images. The former test set allows us to check baseline performance in the classical IID setting (Independently and Identically Distributed), while the latter addresses a more realistic setting in which the test set is drawn from a different distribution than the training set (Out-Of-Distribution or OOD). Multiple annotators labeled all text-line images for the HHD-Ethiopic dataset, and an expert supervisor double-checked them. We assessed human-level recognition performance and compared it with state-of-the-art (SOTA) OCR models using the Character Error Rate (CER) and Normalized Edit Distance(NED) metrics. Our results show that the model performed comparably to human-level recognition on the 18 th century test set and outperformed humans on the IID test set. However, the unique challenges posed by the Ethiopic script, such as detecting complex diacritics, still present difficulties for the models. Our baseline evaluation and HHD-Ethiopic dataset will encourage further research on Ethiopic script recognition. The dataset and source code can be accessed at .

2025

We establish a rigorous foundation for Relational Syntax Theory (RST), a novel mathematical framework for discrete physics based on purely syntactic principles. Central to RST is the Non-Duplication Theorem, which states that any derivable expression contains at most one occurrence of a distinguished symbol ⋆. We provide a complete mathematical formalization of derivable expressions over an admissible alphabet equipped with concatenation and enclosure operators. The proof of the Non-Duplication Theorem proceeds by structural induction and is verified using the Coq proof assistant. Our main contributions include: (1) a formal definition of derivability with explicit constraints on symbol occurrence, (2) an algorithmic Boolean parser proven equivalent to the inductive definition, (3) a rigorous proof of the Non-Duplication Theorem, and (4) the construction of a syntactic topology with a boundary operator satisfying ∂ 2 = NIL. This establishes RST as a well-defined mathematical structure suitable for discrete formulations of physical theories. The framework's consistency is demonstrated through machineverified proofs, and its falsifiability is established through an explicit criterion based on the Non-Duplication Theorem. These results provide a foundation for developing discrete analogues of conservation * Independent Researcher-Paris 1 laws and symmetries without requiring pre-existing notions of space, time, or continuous parameters.

2025

During its first decade, T E X has been at home mainly in the academic world. Therefore it comes as a surprise to find that it has been spreading into industry during the last few years, and we try to outline some highlights of this development first. Then criteria for an industrial environment application area and reasons for using the structured document processing approach are discussed. It is shown what rôle T E X can play in an integrated document processing environment, and this rôle is exemplified by a case study from application at EDS.

2025

this article. Frank Mittelbach has also had, in addition to his own published contributions, a key r ole in implementing some of the concepts described in this article.

2025, Proceedings of Conference on Computer Architectures for Machine Perception

This paper describes a n e w computational model f o r a handwritten document recognition system. It consists of a perceptive subsystem that recognizes each character image extracted f r o m a document using a template matching method and a cognitive subsystem that recognizes a series of input character images as a sentence using semantical and syntactical knowledge. Semantical and syntactical knowledge i s represented in a concept graph. t e m will be realized by a parallel object oriented model and suitable f o r massive parallel processing.

2025, Journal of Advanced Zoology

Optical Character Recognition (OCR) of papers has tremendous practical value given the prevalence of handwritten documents in human exchanges. A discipline known as optical character recognition makes it possible to convert many kinds of texts or photos into editable, searchable, and analysable data. In the past ten years, academics have developed systems that automatically evaluate printed and handwritten documents to convert them to electronic format. In the modern era, as demand for computer systems arose, the demand to convert paper text and computer vision also erose. To interact the computer with capability to read text from images, videos and images have been arose rapidly and many software companies came in role to fulfil this need. One of the active and difficult study areas in the world of pattern recognition and image processing has been handwriting recognition. Among its many uses are bank checks, reading assistance for the blind, and the conversion of any handwritten document into structural text. The main aim of this paper is to create a searchable pdf from the image and bring the application to easy use and deployable on premises and cloud.

2025, Journal of Advanced Zoology

2025, International Journal of Data Analysis Techniques and Strategies

Three document processing operations: comparison, categorisation, and scrutinisation are fundamental problems in any organisation that bases its business operations on electronic documents. We propose XML based architectures for documents comparison, categorisation, and scrutinisation. We also discuss all the components that are used in constructing these architectures. Moreover, we implement these proposed architectures to validate the application of XML technology in document processing operations. We found motivating results from these three architectures for selected applications. Our proposed architectures have potential for solving problems related to automated resume comparison system, automated matrimonial match-maker system, automated tender selection, automated resume scrutinisation system, and so on.

2025

In this article we present a proposal for sequence based audiovisual markup and retrieval which can be applied to information management in television. To achieve this we analyze and study audiovisual information on TV, and from this knowledge we generate a vocabulary for markup which might ease and improve information retrieval, providing direct access to sequences according to parameters such as distinguishing among visualized information and referenced information, which provides a working framework better adapted to the domain of audiovisual documentation

2025

The development of an ontology for information control is proposed to manage visual images of and references to individuals in information management systems of television channels. Research has established precedents for the use of ontologies in the audiovisual field, providing a baseline for measuring and understanding ongoing developments. This practical approach offers a quick, straightforward solution to onomastic control in information retrieval, permitting control and development of relationships between individuals covered in the media. This creates a network of television personalities we can draw upon to improve the outcomes of retrieval processes

2025, Lecture Notes in Computer Science

In this paper we present a new algorithm for document clustering called Condensed Star (ACONS). This algorithm is a natural evolution of the Star algorithm proposed by Aslam et al., and improved by them and other researchers. In this method, we introduced a new concept of star allowing a different star-shaped form; in this way we retain the strengths of previous algorithms as well as address previous shortcomings. The evaluation experiments on standard document collections show that the proposed algorithm outperforms previously defined methods and obtains a smaller number of clusters. Since the ACONS algorithm is relatively simple to implement and is also efficient, we advocate its use for tasks that require clustering, such as information organization, browsing, topic tracking, and new topic detection.

2025, Journal ijetrm

The objective of this project is to develop an advanced artificial intelligence (AI) system designed to accurately
identify digits on bank cheques and transaction slips, including both withdrawal and deposit slips. This digit
identification system is a critical component in automating and streamlining the processing of various banking
documents, which traditionally require significant manual handling and are prone to human error. The AI
system leverages state-of-the-art image processing techniques and sophisticated machine learning algorithms to
detect and recognize digits with a high degree of accuracy and efficiency. These technologies are instrumental in
ensuring that the system can handle the variability and complexity of handwritten and printed digits on financial
documents. Python has been selected as the programming language for this project due to its extensive libraries,
such as OpenCV for image processing and TensorFlow/Keras for machine learning, which facilitate the
implementation of the necessary functionalities. Python's ease of use and robust community support further
enhance the development process, allowing for rapid prototyping and deployment. By automating the digit
extraction process, this AI system significantly enhances the speed and reliability of processing financial
documents. It reduces the likelihood of manual errors, increases operational efficiency, and cuts down on
processing costs. This project represents a step forward in the digitization and automation of banking operations,
providing a scalable solution that can be integrated into existing financial systems to improve overall workflow
and accuracy.

2025, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of

Reasonable success has been achieved at developing mono lingual OCR systems in Indian scripts. Scientists, optimistically, have started t o look beyond. Development of bilingual OCR systems and OCR systems with capability t o identify the text areas are some of the pointers to future activities in Indian scenario. T h e separation of text and non-text regions before considering t h e document image for OCR is a n important task. In this paper, we present a biologically inspired, multichannel filtering scheme for page layout analysis. The same scheme has been used for script recognition as well. Parameter tuning is mostly done heuristically. It has also been seen t o be computationally viable for commercial OCR system development.

2025

Dans cet article, nous nous intéressons à l'utilisation de la translittération arabe pour l'amélioration des résultats d'une approche linguistique d'alignement de mots simples et composés à partir de corpus de textes parallèles français-arabe. Cette approche utilise, d'une part, un lexique bilingue et les caractéristiques linguistiques des entités nommées et des cognats pour l'alignement de mots simples, et d'autre part, les relations de dépendance syntaxique pour aligner les mots composés. Nous avons évalué l'aligneur de mots simples et composés intégrant la translittération arabe en utilisant deux procédés : une évaluation de la qualité d'alignement à l'aide d'un alignement de référence construit manuellement et une évaluation de l'impact de cet alignement sur la qualité de la traduction en faisant appel au système de traduction automatique statistique Moses. Les résultats obtenus montrent que la translittération améliore aussi bien la qualité de l'alignement que celle de la traduction.

2025, International Journal of Computer Applications

Handwritten recognition is an area of research where many researchers have presented their work and is still an area under research to achieve higher accuracy. In past collecting, storing and transmitting information in form of handwritten script was the most convenient way and is still prevailing as a convenient medium in the era of digital technology. As technology has advanced tablet and many similar devices allows humans to input data in form of handwriting. Use of paper to write handwritten text, converting to an image using scanner, identifying handwritten characters from the image is known as off-line handwritten text recognition is a challenging area due to the fact that different people will have different style of writing and all scripts have their own character set and complexities to write text. Many researchers have presented their work and many algorithms are proposed to recognize handwritten and printed characters. One can trace extensive work for off-line handwritten recognition for English and Arabic script. This paper presents review of work to recognize off-line handwritten text for various Indian language scripts. Paper reviews methodologies with respect to the phases of character recognition.

2025

Demos is a modeling environment designed to help a co-operating team design, analyze, critique and refine quantitative models for policy research. Earlier research found that readers of Demos models tended to become disoriented while exploring models online. In response we have designed and implemented a graphical interface to Demos named Demaps. Demaps displays diagrams of the model structure, both dependence networks and abstraction hierarchies, to provide graphic context and direct manipulation style of interaction. We describe a study of the use of Demaps to understand and compare multiple versions of models. The study employs verbal protocol analysis to evaluate the design of Demaps and to discover expert strategies for model understanding and criticism. Subjects were able to learn to use Demaps effectively in about an hour to review and compare policy models and perform sensitivity analyses. The study describes two strategies used in reading models and suggests the desirabilit...

2025, Journal of Communications Technology and Electronics

A method for processing of graphical information is proposed. The method makes it possible to code contour images with the use of complex numbers unambiguously defined by the image shape. Mapping of a noise bitmap image onto the complex plane is studied. The possibility of solving such recognition problems as object identification and determination of the orientation of a figure in a plane is demonstrated.

2025

We present the Romanian legislative corpus which is a valuable linguistic asset for the development of machine translation systems, especially for under-resourced languages. The knowledge that can be extracted from this resource is necessary for a deeper understanding of how law terminology is used and how it can be made more consistent. At this moment the corpus contains more than 140k documents representing the legislative body of Romania. This corpus is processed and annotated at different levels: linguistically (tokenized, lemmatized and pos-tagged), dependency parsed, chunked, named entities identified and labeled with IATE terms and EUROVOC descriptors. Each annotated document has a CONLL-U Plus format consisting in 14 columns, in addition to the standard 10-column format, four other types of annotations were added. Moreover the repository will be periodically updated as new legislative texts are published. These will be automatically collected and transmitted to the processin...

2024, Integrated Network Management

The rise of cloud computing has paved the way for many new applications. Many of these new cloud applications are also multi-tenant, ensuring multiple end users can make use of the same application instance. While these technologies make it possible to create many new applications, many legacy applications can also benefit from the added flexibility and costsavings of cloud computing and multi-tenancy. In this paper, we describe the steps required to migrate a .NET-based medical communications application to the Windows Azure public cloud environment, and the steps required to add multi-tenancy to the application. We then discuss the advantages and disadvantages of our migration approach. We found that the migration to the cloud itself requires only a limited amount of changes to the application, but that this also limited the benefits, as individual instances would only be partially used. Adding multi-tenancy requires more changes, but when this is done, it has the potential to greatly reduce the cost of running the application.

2024, Pattern Recognition Letters

There are many types of documents where machine-printed and hand-written texts intermixedly appear. Since the optical character recognition (OCR) methodologies for machine-printed and hand-written texts are dierent, to achieve optimal performance it is necessary to separate these two types of texts before feeding them to their respective OCR systems. In this paper, we present a machine-printed and hand-written text classi®cation scheme for Bangla and Devnagari, the two most popular Indian scripts. The scheme is based on the structural and statistical features of the machine-printed and hand-written text lines. The classi®cation scheme has an accuracy of 98.6%.

2024, Pattern Recognition Letters

When a document is fed to a scanner either mechanically or by a human operator for digitization, it suffers from some degrees of skew or flit. Skew angle detection is an important component of any Optical Character Recognition (OCR) and document analysis system. In this letter we consider skew estimation of Roman script. The method considers the lowermost and uppermost pixels of some selected characters of the text which may be subject to Hough transform for skew angle detection. A fast approach is also proposed which works almost as accurately as Hough transform. Experimental results are presented and compared with results on several other skew detection methods.

2024, 2016 International Joint Conference on Neural Networks (IJCNN)

Due to the rapid increase of different digitized documents, the development of a system to automatically retrieve document images from a large collection of structured and unstructured document images is in high demand. Many techniques have been developed to provide an efficient and effective way for retrieving and organizing these document images in the literature. This paper provides an overview of the methods which have been applied for document image retrieval over recent years. It has been found that from a textual perspective, more attention has been paid to the feature extraction methods without using OCR.