Text Segmentation Research Papers - Academia.edu (original) (raw)

2025, Journal of Telecommunication, Electronic and Computer Engineering

In today’s digital era, most scholarly publications are made available online. These include the data of a university’s research publications which can be reached through Google Scholar. Determining the prominent research areas of a university and finding its experts is the motivation of this study. Although many people may be aware of the published articles of certain university researchers, however there are little or no information on the main research areas of the university where the researchers belong to. Thus, this study will investigate how the prominent research areas can be determined by implementing Refined Text Clustering (RTC) technique for clustering scholarly data based on the titles of publications. Then, an expert search approach can be used to determine the key players who are the experts in each research cluster. The Expert Finding System (EFS) is proposed by applying statistical analysis based on the total of number researcher’s publications and their number of c...

2025, Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

In the context of multi-domain and multimodal online asynchronous discussion analysis, we propose an innovative strategy for manual annotation of dialog act (DA) segments. The process aims at supporting the analysis of messages in terms of DA. Our objective is to train a sequence labelling system to detect the segment boundaries. The originality of the proposed approach is to avoid manually annotating the training data and instead exploit the human computational efforts dedicated to message reply formatting when the writer replies to a message by inserting his response just after the quoted text appropriate to his intervention. We describe the approach, propose a new electronic mail corpus and report the evaluation of segmentation models we built.

2025

Notre but applicatif est de faciliter l'accès au contenu d'un texte. Nous nous situons dans une approche de résumé dynamique s'adaptant aux besoins d'un utilisateur. Á cette fin, nous dégageons des termes significatifs descripteurs des thèmes et de la fonction argumentative d'énoncés. Les techniques d'extraction de ces différents descripteurs s'inspirent de méthodes statistiques que l'on combine avec des heuristiques linguistiques. Les premiers sont identifiés par rapport à leur pertinence dans un document donné, et les suivants en regard du corpus étudié, ici scientifique. Une attention particulière est portée sur cette seconde technique, qui nous permet notamment de construire automatiquement un dictionnaire de concepts et de relations du domaine.

2025, Lecture Notes in Computer Science

There are two main topics in this paper: (i) Vietnamese words are recognized and sentences are segmented into words by using probabilistic models; (ii) the optimum probabilistic model is constructed by an unsupervised learning processing. For each probabilistic model, new words are recognized and their syllables are linked together. The syllable-linking process improves the accuracy of statistical functions which improves contrarily the new words recognition. Hence, the probabilistic model will converge to the optimum one. Our experimented corpus is generated from about 250.000 online news articles, which consist of about 19.000.000 sentences. The accuracy of the segmented algorithm is over 90%. Our Vietnamese word and phrase dictionary contains more than 150.000 elements.

2025

Topic segmentation attempts to divide a document into segments, where each segment corresponds to a particular discourse topic. Lexical chains are a disambiguation tool often used for text summarization, and more recently in topic segmentation. A lexical chain encapsulates the concept of a single word (or group of closely related words) that occurs repeatedly across some portion of a document. While it might be uninteresting to attempt topic segmentation on news articles, which often revolve around just a single topic for the duration of the article, meeting conversations typically move across several topics and are more interesting to segment by topic. Some work has been done using lexical chains to perform topic segmentation on transcribed meeting corpora (Galley et al., 2003), but this work used a very simple implementation of lexical chains that only counted identical words as belonging to one lexical chain. We present here an implementation of topic segmentation on meeting text...

2025, Everyday Communication in Antiquity: Frames and Framings

This study explores the textual and visual organisation of Greek letters on papyrus. While previous scholarship has focused on cataloguing formulaic elements in epistolary texts, it has often overlooked how these elements, along with other linguistic features such as discourse particles, tense-aspect marking, and pronouns, provide cues for discourse segmentation. This contribution discusses the preliminary results of an annotation framework designed to capture these aspects more effectively and examines the correspondences between generic structure and pragmatic concepts such as 'speech act'. In the second part of the study, we identify various layout elements that contribute to the visual organisation of the texts. We preliminarily assess how sensitive writers were to the type of speech act being expressed and the ways in which visual cues were used to emphasise certain thematic blocks within the letters. This integrated analysis offers new insights into the complex interactional form of communication presented by ancient letters.

2025, Information

Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experiment...

2025

Word Segmentation is the foremost obligatory task in almost all the NLP applications, where the initial phase requires tokenization of input into words. Like other Asian languages such as Chinese, Thai and Myanmar, Urdu also faces word segmentation challenges. Though the Urdu word segmentation problem is not as severe as the other Asian language, since space is used for word delimitation, but the space is not consistently used, which gives rise to both space omission and space insertion errors in Urdu. In this paper we present a word segmentation system for handling space omission problem in Urdu script with application to Urdu-Devnagri Transliteration system. Instead of using manually segmented monolingual corpora to train segmenters, we make use of bilingual corpora and statistical word disambiguation techniques. Though our approach is adapted for the specific transliteration task at hand by taking the corresponding target (Hindi) language into account, the techniques suggested can be adapted to independently solve the space omission Urdu word segmentation problems. The two major components of our system are : identification of merged words for segmentation and proper segmentation of the merged words. The system was tested on 1.61 million word Urdu test data. The recall and precision for the merged word recognition component were found to be 99.29% and 99.38% respectively. The words are correctly segmented with 99.15% accuracy.

2025, Communications in computer and information science

Word Segmentation is an important prerequisite for almost all Natural Language Processing (NLP) applications. Since word is a fundamental unit of any language, almost every NLP system first needs to segment input text into a sequence of words before further processing. In this paper, Shahmukhi word segmentation has been discussed in detail. The presented word segmentation module is part of Shahmukhi-Gurmukhi transliteration system. Shahmukhi script is usually written without short vowels leading to ambiguity. Therefore, we have designed a novel approach for Shahmukhi word segmentation in which we used target Gurmukhi script lexical resources instead of Shahmukhi resources. We employ a combination of techniques to investigate an effective algorithm by applying syntactical analysis process using Shahmukhi Gurmukhi dictionary, writing system rules and statistical methods based on n-grams models.

2025, Proceedings of the Fourth Arabic Natural Language Processing Workshop

Parallel corpora available for building machine translation (MT) models for dialectal Arabic (DA) are rather limited. The scarcity of resources has prompted the use of Modern Standard Arabic (MSA) abundant resources to complement the limited dialectal resource. However, clitics often differ between MSA and DA. This paper compares morphologyaware DA word segmentation to other word segmentation approaches like Byte Pair Encoding (BPE) and Sub-word Regularization (SR). A set of experiments conducted on Egyptian Arabic (EA), Levantine Arabic (LA), and Gulf Arabic (GA) show that a sufficiently accurate morphology-aware segmentation used in conjunction with BPE or SR outperforms the other word segmentation approaches.

2025, 2010 8th International Conference on Communications

We present in this article our approach for building a text-to-speech system for Romanian. Main stages of this work were: voice signal analysis, region segmentation, construction of acoustic database, text analysis, unit and prosody detection, unit matching, concatenation and speech synthesis. In our approach we consider word syllables as basic units and stress indicating intrasegmental prosody. A special characteristic of current approach is rule-based processing of both speech signal analyse and text analyse stages.

2025, The Proceedings of the Annual Convention of the Japanese Psychological Association

2025, Libia Justo

La cohésion est une propriété du texte qui concerne les éléments linguistiques explicites reliant ses constituants. Par contre, la cohérence ne se manifeste pas toujours ainsi, le lecteur fait des inférences à partir de ce que le texte lui donne ou de ce qu'il connaît du monde pour reconstruire le sens du texte. Au moment d'aborder l'analyse de la gestion de la cohésion ou de la cohérence, il existe plusieurs modèles à suivre. Pour la cohérence, on peut voir les propositions de l'École de Prague et au moins quatre modèles complets. En ce qui concerne la cohésion, la taxonomie proposée par Halliday et Hassan (1976) reste le seul modèle d'analyse complet. Enfin, la cohérence et la cohésion suscitent de nombreuses recherches empiriques portant sur des textes d'apprenants écrits en L1, L2 ou LE. Mots-clés : cohésion, cohérence, modèles d'analyses, recherche empirique Cohesión, coherencia, modelos de análisis e investigación empírica, una visión general Resumen La cohesión es una propiedad del texto que tiene que ver con los elementos lingüísticos explícitos que unen sus componentes. Al contrario, la coherencia no siempre se manifiesta de esta forma, el lector hace inferencias a partir de lo que el texto le da y de lo que él conoce del mundo para reconstruir el sentido del texto.

2025

Automatic term extraction is the first step towards automatic or semi-automatic update of existing domain knowledge base. Most of the researches applied word segmentation as a preprocessing step to Chinese term extraction. However, segmentation ambiguity is unavoidable, especially in identifying unknown words for Chinese. In this paper, we discuss the effect and limitations of segmentation to Chinese terminology extraction. Detailed study shows that propagated errors caused by word segmentation have great impact on the result of terminology extraction. Based on our analysis and experiments, it is proven that character-based terminology extraction yields much better result than that using segmentation as a preprocessing step.

2025, Proceedings of the AAAI Conference on Artificial Intelligence

The quadratic memory complexity of transformers prevents long document summarization in low computational resource scenarios. State-of-the-art models need to apply input truncation, thus discarding and ignoring potential summary-relevant contents, leading to a performance drop. Furthermore, this loss is generally destructive for semantic text analytics in high-impact domains such as the legal one. In this paper, we propose a novel semantic self-segmentation (Se3) approach for long document summarization to address the critical problems of low-resource regimes, namely to process inputs longer than the GPU memory capacity and produce accurate summaries despite the availability of only a few dozens of training instances. Se3 segments a long input into semantically coherent chunks, allowing transformers to summarize very long documents without truncation by summarizing each chunk and concatenating the results. Experimental outcomes show the approach significantly improves the performanc...

2025, Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Information extraction by text segmentation (IETS) applies to cases in which data values of interest are organized in implicit semi-structured records available in textual sources (e.g. postal addresses, bibliographic information, ads). It is an important practical problem that has been frequently addressed in the recent literature. In this paper we introduce ONDUX (On Demand Unsupervised Information Extraction), a new unsupervised probabilistic approach for IETS. As other unsupervised IETS approaches, ONDUX relies on information available on pre-existing data to associate segments in the input string with attributes of a given domain. Unlike other approaches, we rely on very e↵ective matching strategies instead of explicit learning strategies. The e↵ectiveness of this matching strategy is also exploited to disambiguate the extraction of certain attributes through a reinforcement step that explores sequencing and positioning of attribute values directly learned on-demand from test data, with no previous human-driven training, a feature unique to ONDUX. This assigns to ONDUX a high degree of flexibility and results in superior e↵ectiveness, as demonstrated by the experimental evaluation we report with textual sources from di↵erent domains, in which ONDUX is compared with a state-of-art IETS approach.

2025, Proceedings of NLPRS'01

Word segmentation is the first and obligatory task for every NLP. For inflectional languages like English, French, Dutch,.. their word boundaries are simply assumed to be whitespaces or punctuations. Whilst in various Asian languages,... more

2025, 2012 20th Signal Processing and Communications Applications Conference (SIU)

Osmanlı Metin Arşivi Projesi kapsamında Osmanlı Türkçesi metinlerinin yüklenmesi, ikilileştirilmesi, satır ve kelime bölütlenmesi, etiketlenmesi, tanınması ve testlerinin yapılması amacıyla bir Genel Ag arabirimi geliştirilmiştir. Bu arabirim sayesinde Osmanlı arşivleriyle çalışan araştırmacıların uzmanlık yardımının alınması ve geliştirdigimiz tanıma teknolojilerinin elyazması arşivlere uygulanması mümkün hale gelmiştir.

2025, arXiv (Cornell University)

Non-Māori-speaking New Zealanders (NMS) are able to segment Māori words in a highly similar way to fluent speakers . This ability is assumed to derive through the identification and extraction of statistically recurrent forms. We examine this assumption by asking how NMS segmentations compare to those produced by Morfessor, an unsupervised machine learning model that operates based on statistical recurrence, across words formed by a variety of morphological processes. Both NMS and Morfessor succeed in segmenting words formed by concatenative processes (compounding and affixation without allomorphy), but NMS also succeed for words that invoke templates (reduplication and allomorphy) and other cues to morphological structure, implying that their learning process is sensitive to more than just statistical recurrence.

2025, Journal of Cancer Education

Background-Patient navigation (PN) programs are being widely implemented to reduce disparities in cancer care for racial/ethnic minorities and the poor. However, few systematic studies cogently describe the processes of PN. Methods-We qualitatively analyzed 21 transcripts of semi-structured exit interviews with three navigators about their experiences with patients who completed a randomized trial of PN. We iteratively discussed codes/categories, reflective remarks, and ways to focus/organize data and developed rules for summarizing data. We followed a three-stage analysis model: reduction, display, and conclusion drawing/verification. We used ATLAS.ti_5.2 for text segmentation, coding, and retrieval. Results-Four categories of factors affecting cancer care outcomes emerged: patients, navigators, navigation processes, and external factors. These categories formed a preliminary conceptual framework describing ways in which PN processes influenced outcomes. Relationships between processes and outcomes were influenced by patient, navigator, and external factors. The process of PN has at its core relationship-building and instrumental assistance. An enhanced understanding of the process of PN derived from our analyses will facilitate improvement in navigators' training and rational design of new PN programs to reduce disparities in cancer-related care.

2025

Text line segmentation is an essential pre-processing stage for handwriting recognition in many Optical Character Recognition (OCR) systems. It is an important step because inaccurately segmented text lines will cause errors in the recognition stage. Text line segmentation of the handwritten documents is still one of the most complicated problems in developing a reliable OCR. The nature of handwriting makes the process of text line segmentation very challenging. Text characteristics 3can vary in font, size, shape, style, orientation, alignment, texture, color, contrast and background information. These variations turn the process of word detection complex and difficult [2]. In the case of handwritten documents, differently from machine printed, the complexity of the problem even increases. Since handwritten text can vary greatly depending on the user skill, disposition and even cultural background. A new technique to segment a handwritten document into distinct lines of text is presented. The proposed method is robust to handle line fluctuation.

2025, arXiv (Cornell University)

A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive training for any variation. Text-driven approach, e.g. [12], can be easily adapted for domain and genre changes yet has difficulty matching the high f-scores of the lexicon-driven approaches. In this paper, we refine and implement an innovative text-driven word boundary decision (WBD) segmentation model proposed in [15]. The WBD model treats word segmentation simply and efficiently as a binary decision on whether to realize the natural textual break between two adjacent characters as a word boundary. The WBD model allows simple and quick training data preparation converting characters as contextual vectors for learning the word boundary decision. Machine learning experiments with four different classifiers show that training with 1,000 vectors and 1 million vectors achieve comparable and reliable results. In addition, when applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall rates that are higher than all published results. Unlike all previous work, our OOV recall rate is comparable to our own F-score. Both experiments support the claim that the WBD model is a realistic model for Chinese word segmentation as it can be easily adapted for new variants with the robust result. In conclusion, we will discuss linguistic ramifications as well as future implications for the WBD approach.

2025, Online Proceedings of the 2nd International Conference on e-Social Science, June

Abstract: Social scientists face an overload of digitized information. In particular, they must often spend inordinate amounts of time coding and analyzing transcribed speech. This paper describes a study, in the field of learning science, of the feasibility of semi-automatically coding and scoring verbal data. Transcripts from 48 individual learners comprising 2 separate data sets of 44,000 and 23,000 words were used as test domains for the investigation of three research questions:(1) how well can utterancetype codes ...

2025

Persons of visual impairment make up a growing segment of modern society. To cater to the special needs of these individuals, society ought to consider the design of special constructs to enable them to fulfill their daily necessities. This research proposes a new method for text extraction from indoor signage that will help persons of visual impairment maneuver in unfamiliar indoor environments, thus enhancing their independence and quality of life. In this thesis, images are acquired through a video camera mounted on glasses of the walking person. Frames are then extracted and used in an integrated framework that applies Maximally Stable Extremal Regions (MSER) to detect alphabets along with a morphological dilation operation to identify clusters of alphabets (words). Proposed method has the ability to localize and detect the orientation of these clusters. A rotation transformation is performed when needed to realign the text into a horizontal orientation and allow the objects to be in an acceptable input to any of the available optical character recognition (OCR) systems. Analytical and simulation results verify the validity of the proposed system.

2025, Journal of Engineering Science and Technology

In the northern part of Thailand since 1802, Lanna characters were popular as ancient characters. The segmentation of printed documents in Lanna characters is a challenging problem, such as the partial overlapping of characters and touching characters. This paper focuses on only the touching characters such as touching between consonants and vowels. Segmentation method begins with the horizontal histogram and then vertical histogram for segmentation of text lines and characters, respectively. The results are characters consisted of correct clear characters, partial overlapping characters, and touching characters. The proposed method computes the left edge junction points and right edge junction points. Then find their maximum numbers and find the value of its row to separate consonant and vowel from touching. The trial over the text documents printed in Lanna characters can be processed with an accuracy of 95.81%.

2025, journal of engineering science and technology

2025

Chinese texts do not contain spaces as word separators like English and many alphabetic languages. To use Moses to train translation models, we must segment Chinese texts into sequences of Chinese words. Increasingly more software tools for Chinese segmentation are populated on the Internet in recent years. However, some of these tools were trained with general texts, so might not handle domain-specific terms in patent documents very well. Some machine-learning based tools require us to provide segmented Chinese to train segmentation models. In both cases, providing segmented Chinese texts to refine a pre-trained model or to create a new model for segmentation is an important basis for successful Chinese-English machine translation systems. Ideally, high-quality segmented texts should be created and verified by domain experts, but doing so would be quite costly. We explored an approach to algorithmically generate segmented texts with parallel texts and lexical resources. Our scores in NTCIR-10 PatentMT indeed improved from our scores in NTCIR-9 PatentMT with the new approach.

2025, Zenodo (CERN European Organization for Nuclear Research)

2025

Language identification is the task of giving a language label to a text. It is an important preprocessing step in many automatic systems operating with written text. In this paper, we present the evaluation of seven language identification methods that was done in tests between 285 languages with an out-of-domain test set. The evaluated methods are, furthermore, described using unified notation. We show that a method performing well with a small number of languages does not necessarily scale to a large number of languages. The HeLI method performs best on test lengths of over 25 characters, obtaining an F1-score of 99.5 already at 60 characters.

2025

Handwriting word recognition has been researched many researchers. The most method used is Line based representation. However, it has a weakness, which is high cost to recognize object. In this research, line detection model is proposed to determine the right, left, top and bottom object boundary. In order to separate all of objects, the line detection has been conducted for segmentation iteratively. All of objects are labeled to get number of object in image. One of parameters that have effect to the segmentation results is threshold value. Errors in the determination of the threshold will affect to the segmentation results. However, it is necessary to determine the adaptive threshold value. In this research, Otsu’s method is proposed to achieve the best threshold value. The threshold value depends on the testing image. 84 images have been used as training set. Proposed method has been tested by using 30 images. In this case, 20 images have single line handwriting word and the 10 i...

2025

With the aim of storing learner corpora as well as information about the Basque language students who wrote the texts, two different but complementary databases were created: ERREUS and IRAKAZI. Linguistic and technical information (error description, error category, tools for detection/correction…) will be stored in ERREUS, while IRAKAZI will be filled in with psycholinguistic information (error diagnosis, characteristics of the writer, grammatical competence…). These two databases will be the basis for constructing i) a robust Basque grammar corrector and, ii) a computer-assisted languagelearning environment for advising on the use of Basque syntax.

2025, HAL (Le Centre pour la Communication Scientifique Directe)

Toute bonne anthologie du XVIII e siècle comme toute approche historique du genre romanesque se doivent de faire figurer les oeuvres de Diderot. On y retrouve généralement les sulfureux Bijoux indiscrets, ce roman renié par son auteur sous la pression du Lieutenant de police 1 , La Religieuse ainsi que Jacques le fataliste, généralement classé comme un antiroman, et parfois même Le Neveu de Rameau. Quelle production romanesque plus éclectique que la sienne ? Car si La Religieuse faisait encore scandale en 1966 2 , force est de reconnaître que la visée du romancier, en l'occurrence, était on-ne-peut-plus morale et il y a loin de ce conte oriental faisant parler les « bijoux » à cette défense parfois larmoyante des jeunes filles enfermées contre leur gré dans les couvents. Cette oeuvre tragique relevant à la fois de l'épistolaire et du roman-mémoires s'oppose aussi sur le plan formel aux autres oeuvres dans lesquelles le dialogue domine sur la narration. Henri Coulet lui-même témoigne de son embarras à propos du Neveu de Rameau, auquel il consacre plusieurs pages de son ouvrage de référence, Le Roman avant la Révolution : « Traditionnellement rangé parmi les romans, Le Neveu de Rameau est plutôt un dialogue philosophique ; Diderot lui-même l'appelait une "satire", mais ses qualités de romancier y apparaissent mieux que dans aucune autre oeuvre » 3 . Voilà sans doute un paradoxe de plus à attribuer à Diderot. Et sur la dizaine de pages qu'il consacre à Jacques le fataliste, dans le même ouvrage, le critique ne cesse de l'appeler un « dialogue ». Si l'auteur se disait « habitué de longue main à l'art du soliloque » 4 , si ce disciple de Socrate que ses amis surnommaient frère Platon est féru d'entretiens, genre dans lequel il excelle, cela ne suffit pas à justifier que la plupart de ses « romans » soient à ce point des dialogues. Certes, Diderot lui-même n'a catégorisé aucun de ces textes du nom de roman, le Neveu de Rameau est même sous-titré « satire seconde », genre dont semblerait relever, mutatis mutandis, La Religieuse à propos de laquelle il évoque une 1

2025, 2009 International Conference on Knowledge and Systems Engineering

Word segmentation is one of the most important tasks in NLP. This task, within Vietnamese language and its own features, faces some challenges, especially in words boundary determination. To tackle the task of Vietnamese word segmentation, in this paper, we propose the WS4VN system that uses a new approach based on Maximum matching algorithm combining with stochastic models using part-of-speech information. The approach can resolve word ambiguity and choose the best segmentation for each input sentence. Our system gives a promising result with an F-measure of 97%, higher than the results of existing publicly available Vietnamese word segmentation systems.

2025

La facon de comprendre un texte depend fortement du domaine qu'il traite, mais aussi de son type ; on distingue essentiellement des textes descriptifs, des textes argumentatifs et des textes narratifs. Jusqu'aux annees quatre-vingt, la plupart des travaux existants, tant en intelligence artificielle qu'en linguistique ou en psycholinguistique, se sont limites aux recits. La comprehension de recits presente l'avantage de ne pas etre orientee par une tâche precise et permet ainsi d'etudier les problemes reels de la comprehension. Les mecanismes mis en oeuvre dans ce cadre sont donc representatifs des processus cognitifs utilises pour la comprehension en general et peuvent etre utilises dans des applications variees. La premiere idee utilisee pour mettre en evidence la structure d'un texte a consiste a tenter de decrire des structures globales de textes et a determiner comment le texte precis analyse cadre avec une de ces structures preetablies. Nous decrirons d...

2025, 2010 Annual IEEE India Conference (INDICON)

Several License Plate Recognition systems have been developed in the past. Our objective is to design a system implemented on a standard camera-equipped mobile phone, capable of recognising vehicle license number. As a first step towards it we propose a license plate text segmentation approach that is robust to various lighting conditions, complex background owing to dirty or rusted LP and non-convential fonts. In the Indian scenario, some vehicle owners choose to write their vehicle number plates in regional languages. Since our method does not rely on language-specific features, it is therefore capable of segmenting license number written in different languages. Using color connected component labeling, stroke width and text heuristics we perform the task of accurately segmenting the number from the license plate. Experiments carried out on Indian vehicle license plate (LP) images acquired using a camera-equipped cellphone shows that our system peforms well on different LP images some with different types of degradations. OCR evaluation on the extracted LP number text with the proposed method has an accuracy of 98.86%.

2025, JES. Journal of engineering sciences

The term "search engine" is traditionally used to refer to crawler based search engines, manually maintained directories, and hybrid search engines. However, current search engines do not fully satisfy the users' needs especially in terms of accuracy and specificity of the results. This paper proposes an approach to build an intelligent search agent system on top of the Semantic Web. The presented system consists of five main parts: the Annotator, the Ontology Parser, the Indexer, the Search Agent, and the Data Repository. Two kinds of search are implemented: keyword based and concept based search. The keyword based search matches a user's query terms to concepts while concept based search allows a user to choose the concept that s/he want to search for together with some attributes for this concept.

2024, 2008 Eighth IEEE International Conference on Advanced Learning Technologies

2024

The purpose of this paper color images with complex background for text and non-text segmentation is to propose a new system. The existing text extraction methods in the case of images with complex background do not work efficiently. Locating text in case of variation in style, color, as well as complex image background makes text reading from images challenging. Here the approach used is based on preprocessing steps, edge detection, CC-analysis, bounding rectangles, segmentation and finally extraction of only those blobs which consist of textual part. This approach is tested successfully across various images taken manually and from internet.

2024

In this paper we propose a course-grained NLP approach to text segmentation based on the analysis of lexical cohesion within text. Most work in this area has focused on the discovery of textual units that discuss subtopic structure within documents. In contrast our segmentation task requires the discovery of topical units of text i.e. distinct news stories from broadcast news programmes. Our system SeLeCT first builds a set of lexical chains, in order to model the discourse structure of the text. A boundary detector is then used to search for breaking points in this structure indicated by patterns of cohesive strength and weakness within the text. We evaluate this technique on a test set of concatenated CNN news story transcripts and compare it with an established statistical approach to segmentation called TextTiling.

2024

The Japanese language has absorbed large numbers of loanwords from many languages, in particular English. As well as using single loanwords, compound nouns, multiword expressions (MWEs), etc. constructed from loanwords can be found in use in very large quantities. In this paper we describe a system which has been developed to segment Japanese loanword MWEs and construct likely English translations. The system, which leverages the availability of large bilingual dictionaries of loanwords and English n-gram corpora, achieves high levels of accuracy in discriminating between single loanwords and MWEs, and in segmenting MWEs. It also generates useful translations of MWEs, and has the potential to being a major aid to lexicographers in this area.

2024

Arabic is a morphologically rich language, which presents a challenge for part of speech tagging. In this paper, we compare two novel methods for POS tagging of Arabic without the use of gold standard word segmentation but with the full POS tagset of the Penn Arabic Treebank. The first approach uses complex tags that describe full words and does not require any word segmentation. The second approach is segmentation-based, using a machine learning segmenter. In this approach, the words are first segmented, then the segments are annotated with POS tags. Because of the word-based approach, we evaluate full word accuracy rather than segment accuracy. Wordbased POS tagging yields better results than segment-based tagging (93.93% vs. 93.41%). Word based tagging also gives the best results on known words, the segmentation-based approach gives better results on unknown words. Combining both methods results in a word accuracy of 94.37%, which is very close to the result obtained by using gold standard segmentation (94.91%).

2024, Lecture Notes in Computer Science

This paper describes a technique for text segmentation of machine printed Gurmukhi script documents. Research in the field of segmentation of Gurmukhi script faces major problems mainly related to the unique characteristics of the script like connectivity of characters on the headline, two or more characters in a word having intersecting minimum bounding rectangles, multicomponent characters, touching characters which are present even in clean documents. The segmentation problems unique to the Gurmukhi script such as horizontally overlapping text segments and touching characters in various zonal positions in a word have been discussed in detail and a solution has been proposed.

2024, Dutch Journal of Applied Linguistics

Since Saffran, Aslin and Newport (1996) showed that infants were sensitive to transitional probabilities between syllables after being exposed to a few minutes of fluent speech, there has been ample research on statistical learning. Word segmentation studies usually test learning by making use of “offline methods” such as forced-choice tasks. However, cognitive factors besides statistical learning possibly influence performance on those tasks. The goal of the present study was to improve a method for measuring word segmentation online. Click sounds were added to the speech stream, both between words and within words. Stronger expectations for the next syllable within words as opposed to between words were expected to result in slower detection of clicks within words, revealing sensitivity to word boundaries. Unexpectedly, we did not find evidence for learning in multiple groups of adults and child participants. We discuss possible methodological factors that could have influenced ou...

2024, Procedia Technology

The process of assessing the outcomes obtained by various groups of researchers is heavily facilitated by conventional databases. This paper introduces, an database (AHDB/FTR) comprising Arabic Handwritten Text Images, which helps the researches associated with recognition of Arabic handwritten text with open vocabulary, word segmentation and writer identification and can be freely accessed by researchers worldwide. This database consists of four hundred and ninety seven images of Libyan cities, which were hand written by five Arabic scholars.

2024, Behavioral sciences

We have previously shown that bilingual Spanish and English-learning infants can segment English iambs, two-syllable words with final stress (e.g., guiTAR), earlier than their monolingual peers. This is consistent with accelerated development in bilinguals and was attributed to bilingual infants' increased exposure to iambs through Spanish; about 10% of English content words start with an unstressed syllable, compared to 40% in Spanish. Here, we evaluated whether increased exposure to a stress pattern alone is sufficient to account for acceleration in bilingual infants. In English, 90% of content words start with a stressed syllable (e.g., KINGdom), compared to 60% in Spanish. However, we found no evidence for accelerated segmentation of Spanish trochees by Spanish-English bilingual infants compared to their monolingual Spanish-learning peers. Based on this finding, we argue that merely increased exposure to a linguistic feature in one language does not result in accelerated development in the other. Instead, only the acquisition of infrequent patterns in one language may be accelerated due to the additive effects of the other language.

2024, The Journal of the Acoustical Society of America

Speech segmentation skills develop in infancy and are influenced by many phonological properties of the native language and, in particular, the prosodic structure of the infant's native language. Studies using the HPP task show that American English infants appear to favor a stress-based procedure (Jusczyk et al., 1999) whereas Parisian French infants favor a syllable-based procedure (Nazzi et al., 2006), in line with a prosodic-based bootstrapping account of segmentation abilities (Nazzi et al., 1998). However, in a study using different stimuli, Polka and Sundara (2003) found results that might suggest a developmental trajectory for Canadian French infants that does not rely on syllable-based segmentation. Given that the stimuli used in both studies on French were different, both research teams tested their infant populations (at 8, 12, and 16 months of age) with the stimuli originally used by the other team The results suggest a complex interaction between specific stimuli an...

2024, Journal of Child Language

ABSTRACTSix experiments explored Parisian French-learning infants' ability to segment bisyllabic words from fluent speech. The first goal was to assess whether bisyllabic word segmentation emerges later in infants acquiring European French compared to other languages. The second goal was to determine whether infants learning different dialects of the same language have partly different segmentation abilities, and whether segmenting a non-native dialect has a cost. Infants were tested on standard European or Canadian French stimuli, in the word–passage or passage–word order. Our study first establishes an early onset of segmentation abilities: Parisian infants segment bisyllabic words at age 0;8 in the passage–word order only (revealing a robust order of presentation effect). Second, it shows that there are differences in segmentation abilities across Parisian and Canadian French infants, and that there is a cost for cross-dialect segmentation for Parisian infants. We discuss the...

2024, Infancy

In five experiments, we tested segmentation of word forms from natural speech materials by 8‐month‐old monolingual infants who are acquiring Canadian French or Canadian English. These two languages belong to different rhythm classes; Canadian French is syllable‐timed and Canada English is stress‐timed. Findings of Experiments 1, 2, and 3 show that 8‐month‐olds acquiring either Canadian French or Canadian English can segment bi‐syllable words in their native language. Thus, word segmentation is not inherently more difficult in a syllable‐timed compared to a stress‐timed language. Experiment 4 shows that Canadian French‐learning infants can segment words in European French. Experiment 5 shows that neither Canadian French‐ nor Canadian English‐learning infants can segment two syllable words in the other language. Thus, segmentation abilities of 8‐month‐olds acquiring either a stress‐timed or syllable‐timed language are language specific.

2024

In this paper we describe how the annotation method- ology adopted in our approach allows us to explain the organization of indexed references in scientific research articles. We identify the semantic values of author judg- ments in the text segments containing indexed refer- ences. We use an automated semantic annotation plat- form to annotate our corpora. Exploiting this result, we obtain a representation of the annotation distribution on different scales. Finally, we present two evaluations of the annotation.