Document Structure Research Papers - Academia.edu (original) (raw)

Artificial intelligence techniques useful in diagnosis can be classified into eight categories. Among these techniques the integrated diagnostic techniques are often presented as the most efficient to resolve practical problems. This... more

Artificial intelligence techniques useful in diagnosis can be classified into eight categories. Among these techniques the integrated diagnostic techniques are often presented as the most efficient to resolve practical problems. This approach may lack however some flexibility to be applied in many application domains, when the role played by users takes an important part of the diagnosis process. Hypertext systems

Modern web search engines are expected to return the top-k results efficiently. Although many dynamic index pruning strategies have been proposed for efficient top-k computation, most of them are prone to ignoring some especially... more

Modern web search engines are expected to return the top-k results efficiently. Although many dynamic index pruning strategies have been proposed for efficient top-k computation, most of them are prone to ignoring some especially important factors in ranking functions, such as term-proximity (the distance relationship between query terms in a document). In our recent work [Zhu, M., Shi, S., Li,

The World Wide Web is an increasingly important data source for business decision making; however, extracting information from the Web remains one of the challenging issues related to Web business intelligence applications. To use... more

The World Wide Web is an increasingly important data source for business decision making; however, extracting information from the Web remains one of the challenging issues related to Web business intelligence applications. To use heterogeneous Web data for decision making, documents containing relevant data must be located, and the data of interest within the documents must be identified and extracted. Currently, most automatic information extraction systems can only cope with a limited set of document formats or do not adapt well to changes in document structure, as a result, many real-world data sources with complex document structures cannot be consistently interpreted using a single information extraction system. This paper presents an adaptive information extraction system prototype that combines multiple information extraction approaches to allow more accurate and resilient data extraction for a wide variety of Web sources. The Amorphic Web information extraction system prototype can locate data of interest based on domain knowledge or page structure, can automatically generate a wrapper for a data source, and can detect when the structure of a Web-based resource has changed and act on this to search the updated resource to locate the desired data. The prototype Amorphic information extraction system demonstrated improved information extraction accuracy for the four different extraction scenarios examined when compared with traditional data extraction approaches.

Safety critical software requires integrating verification techniques in software development methods. Software architectures must guarantee that developed systems will meet safety requirements and safety analyses are frequently used in... more

Safety critical software requires integrating verification techniques in software development methods. Software architectures must guarantee that developed systems will meet safety requirements and safety analyses are frequently used in the assessment. Safety engineers and software architects must reach a common understanding on an optimal architecture from both perspectives. Currently both groups of engineers apply different modelling techniques and languages: safety analysis models and software modelling languages. The solutions proposed seek to integrate both domains coupling the languages of each domain. It constitutes a sound example of the use of language engineering to improve efficiency in a software-related domain. A model-driven development approach and the use of a platform-independent language are used to bridge the gap between safety analyses (failure mode effects and criticality analysis and fault tree analysis) and software development languages (e.g. unified modelling language). Language abstract syntaxes (metamodels), profiles, language mappings (model transformations) and language refinements, support the direct application of safety analysis to software architectures for the verification of safety requirements. Model consistency and the possibility of automation are found among the benefits.

This paper presents the design of the text analysis component of a TTS system for the Romanian language. Our text analysis is performed in two steps: document structure detection and text normalization. The output is a tree-based... more

This paper presents the design of the text analysis component of a TTS system for the Romanian language. Our text analysis is performed in two steps: document structure detection and text normalization. The output is a tree-based representation of the processed data. Parsing is made efficient with the help of the Boost Spirit LL parser [1], the usage of this tool allowing for a greater flexibility in the source code and in the output representation.

Knowledge acquisition and representation has been characterised as the major bottleneck in the development of expert systems , especially in problem domains of high complexity. Financial analysis is one of the most complicated practical... more

Knowledge acquisition and representation has been characterised as the major bottleneck in the development of expert systems , especially in problem domains of high complexity. Financial analysis is one of the most complicated practical problems, where the expert systems technology is highly applicable, mainly because of its symbolic reasoning and its explanation capabilities. The aim of this paper is to present a complete methodology for knowledge acquisition and representation for expert systems development in the field of financial analysis. This methodology has been implemented in the development of the FINEVA multicriteria knowledge-based decision support system for the assessment of corporate performance and viability. The application of this methodology in the development of the FINEVA system is presented.

Compared with construction data sources that are usually stored and analyzed in spreadsheets and single data tables, data sources with more complicated structures, such as text documents, site images, web pages, and project schedules have... more

Compared with construction data sources that are usually stored and analyzed in spreadsheets and single data tables, data sources with more complicated structures, such as text documents, site images, web pages, and project schedules have been less intensively studied due to additional challenges in data preparation, representation, and analysis. In this paper, our definition and vision for advanced data analysis addressing such challenges are presented, together with related research results from previous work, as well as our recent developments of data analysis on text-based, image-based, web-based, and networkbased construction sources. It is shown in this paper that particular data preparation, representation, and analysis operations should be identified, and integrated with careful problem investigations and scientific validation measures in order to provide general frameworks in support of information search and knowledge discovery from such information-abundant data sources.

Since the introduction of mechatronics as an integrated and integrating approach to the design, development and operation of complex systems, there have been significant developments in technology, and in particular in processing power,... more

Since the introduction of mechatronics as an integrated and integrating approach to the design, development and operation of complex systems, there have been significant developments in technology, and in particular in processing power, which have changed the nature of a wide range of products and systems from domestic appliances and consumer goods to manufacturing systems and vehicles. In addition, the development and implementation of strategies such as those associated with concurrent engineering and the introduction of intelligent tools to support the design of complex products and systems has also changed the way in which such systems are conceived, implemented and manufactured.

Accessing the structured content of PDF document is a difficult task, requiring pre-processing and reverse engineering techniques. In this paper, we first present different methods to accomplish this task, which are based either on... more

Accessing the structured content of PDF document is a difficult task, requiring pre-processing and reverse engineering techniques. In this paper, we first present different methods to accomplish this task, which are based either on document image analysis, or on electronic content extraction. Then, XCDF, a canonical format with well-defined properties is proposed as a suitable solution for representing structured electronic documents and as an entry point for further researches and works. The system and methods used for reverse engineering PDF document into this canonical format are also presented. We finally present current applications of this work into various domains, spacing from data mining to multimedia navigation, and consistently benefiting from our canonical format in order to access PDF document content and structures.

Track. The main goals of the Ad Hoc Track were three-fold. The first goal was to investigate the impact of the collection scale and markup, by using a new collection that is again based on a the Wikipedia but is over 4 times larger, with... more

Track. The main goals of the Ad Hoc Track were three-fold. The first goal was to investigate the impact of the collection scale and markup, by using a new collection that is again based on a the Wikipedia but is over 4 times larger, with longer articles and additional semantic annotations. For this reason the Ad Hoc track tasks stayed unchanged, and the Thorough Task of INEX 2002-2006 returns. The second goal was to study the impact of more verbose queries on retrieval effectiveness, by using the available markup as structural constraints-now using both the Wikipedia's layout-based markup, as well as the enriched semantic markup-and by the use of phrases. The third goal was to compare different result granularities by allowing systems to retrieve XML elements, ranges of XML elements, or arbitrary passages of text. This investigates the value of the internal document structure (as provided by the XML mark-up) for retrieving relevant information. The INEX 2009 Ad Hoc Track featured four tasks: For the Thorough Task a ranked-list of results (elements or passages) by estimated relevance was needed. For the Focused Task a ranked-list of non-overlapping results (elements or passages) was needed. For the Relevant in Context Task non-overlapping results (elements or passages) were returned grouped by the article from which they came. For the Best in Context Task a single starting point (element start tag or passage start) for each article was needed. We discuss the setup of the track, the results for the four tasks, and examine the relative effectiveness of element and passage retrieval. This is examined in the context of content only (CO, or Keyword) search as well as content and structure (CAS, or structured) search. In addition, we look at the effectiveness of systems using a reference run with a solid article ranking, and of systems using the phrase query. Finally, we look at the ability of focused retrieval techniques to rank articles.

Nous presentons dans cet article une analyse du fonctionnement textuel des syntagmes prepositionnels en selon X introducteurs de discours rapporte (selon enonciatifs). Dans la premiere partie, nous mentionnons les principaux criteres... more

Nous presentons dans cet article une analyse du fonctionnement textuel des syntagmes prepositionnels en selon X introducteurs de discours rapporte (selon enonciatifs). Dans la premiere partie, nous mentionnons les principaux criteres permettant de reperer les selon enonciatifs (parmi les divers emplois de selon), et parmi eux, ceux qui sont susceptibles de porter sur plusieurs phrases (introduisant ainsi des cadres de discours specifiques dits univers enonciatifs). En second lieu, nous enumerons les principaux indices, ou reseaux d'indices signalant le plus efficacement la cloture des univers enonciatifs. Dans un troisieme temps, nous montrons comment ces connaissances linguistiques sont exploitees dans notre plate-forme logicielle ContextO.

Dans les systèmes documentaires classiques, l'utilisateur reçoit le plus souvent les références des documents et au mieux les documents primaires et c'est à lui de dépouiller le document pour juger s'il répond à son besoin ou non. Ainsi,... more

Dans les systèmes documentaires classiques, l'utilisateur reçoit le plus souvent les références des documents et au mieux les documents primaires et c'est à lui de dépouiller le document pour juger s'il répond à son besoin ou non. Ainsi, la demande d'information sera transformée en demande d e documents. Le document contient l'information, mais cette dernière est trouvée indirectement. Ce genre de réponse, dans un processus de recherche d'information dans un document technique, n'est pas adapté à la situation. L'utilisateur du document techniq ue, qui est souvent chargé d'exécuter les procédures et de maintenir le dispositif en état de marche, recherche de l'information en vue de répondre à un besoin professionnel. Il effectue une recherche dans le but de savoir pour faire, tout en cherchant généralement à atteindre directement l'information la plus élémentaire satisfaisant son besoin. Le processus de recherche d'information se doit alors d'être particulièrement rapide et efficace, d'où vient l'intérêt de traiter le document technique comme une construction moléculaire. Il sera décomposé pour donner naissance à de nouvelles unités utilisables et ce pour effectuer des lectures spécifiques. Ce qui permettra, d'une part, une représentation fine de son contenu, et d'autre part, un accès plus localisé à l'information facilitant la tâche de consultation pour l'utilisateur (Salton et al 96). Par conséquent, l'indexation du document technique nécessite une étape préliminaire consistant à le segmenter en unités fines. C'est à cet aspect que nous allons nous intéresser dans cet article, en proposant un modèle qui part de la réalité des documents techniques. 2-Propriétés des documents techniques Nous désignons par document technique, les documents du type manuel d'utilisation de dispositifs techniques complexes. Ce document véhicule des savoirs et des savoir-faire propres à un champ technique particulier. Il représente aussi bien la description d'une machine (avion, train, système informatique,...), du fonctionnement de cette machine et des divers processus la concernant, que la description des procédures de réalisation d'une action technique dans un environnement bien précis. L'objectif de l'utilisation du document technique est essentiellement à visée opératoire, pour réaliser une tâche ou une action (Vigner et al 76, Bronckart 85). Le document technique, qui est généralement volumineux, se caractérise par une forte structuration avec une organisation logique bien définie. Du point de vue linguistique, il a des traits qui lui sont propres : la grammaire de la phrase est simple, le lexique est monosémique ; il désigne sans la moindre ambiguïté telle pièce, tel outil ou telle opération. Ce caractère univoque et monoréférentiel des termes du vocabulaire véhiculé par le document technique se

Documents are often marked up in XML-based tagsets to delineate major structural components such as headings, paragraphs, figure captions and so on, without much regard to their eventual displayed appearance. And yet these same abstract... more

Documents are often marked up in XML-based tagsets to delineate major structural components such as headings, paragraphs, figure captions and so on, without much regard to their eventual displayed appearance. And yet these same abstract documents, after many transformations and 'typesetting' processes, often emerge in the popular format of Adobe PDF, either for dissemination or archiving.

XML is nowadays considered the standard meta-language for document markup and data representation. XML is widely employed in Web-related applications as well as in database applications, and there is also a growing interest for it by the... more

XML is nowadays considered the standard meta-language for document markup and data representation. XML is widely employed in Web-related applications as well as in database applications, and there is also a growing interest for it by the literary community in order to develop tools for supporting document-oriented retrieval operations. The purpose of this paper is to show the basic new requirements of this kind of applications and to present the main features of a typed query language, called Tequyla-TX, designed to support them.

We present the GaiusT 2.0 framework for annotating legal documents. The framework was designed and implemented as a web-based system to semi-automate the extraction of legal concepts from text. In requirements analysis these concepts can... more

We present the GaiusT 2.0 framework for annotating legal documents. The framework was designed and implemented as a web-based system to semi-automate the extraction of legal concepts from text. In requirements analysis these concepts can be used to identify requirements a software system has to fulfil to comply with a law or regulation. The analysis and annotation of legal documents in prescriptive natural language is still an open problem for research in the field. In GaiusT 2.0, a multistep process exploits a number of linguistic and technological resources to offer a comprehensive annotation environment. The modules of the system are presented as evolutions from corresponding modules of the original GaiusT framework, which in turn was based on a general-purpose annotation tool, Cerno. The application of GaiusT 2.0 is illustrated with two use cases, to demonstrate the extraction process and its adaptability to different law models.

A significant part of medical data remains stored as unstructured texts. Semantic search requires introduction of markup tags. Experts use their background knowledge to categorize new documents, and knowing category of these documents... more

A significant part of medical data remains stored as unstructured texts. Semantic search requires introduction of markup tags. Experts use their background knowledge to categorize new documents, and knowing category of these documents disambiguate words and acronyms. A model of document similarity that includes a priori knowledge and captures intuition of an expert, is introduced. It has only a few parameters that may be evaluated using linear programming techniques. This approach applied to categorization of medical discharge summaries provided simpler and much more accurate model than alternative text categorization approaches.

Nous travaillons à la conception d'un entrepôt de données de ressources documentaires dans un cadre pédagogique intégrant la modélisation de l'utilisateur. La description de ressources, en vue de leur réutilisation dans des parcours de... more

Nous travaillons à la conception d'un entrepôt de données de ressources documentaires dans un cadre pédagogique intégrant la modélisation de l'utilisateur. La description de ressources, en vue de leur réutilisation dans des parcours de formation, évoquent les difficultés rencontrées et formulent des propositions pour combler des manques dans les normes existantes et rendre plus opérationnels certains descriptifs. La modélisation des acteurs d'une part et des types de documents d'autre part, permettent d'élaborer des corrélations afin d'améliorer les réponses. La mise en relation des acteurs et des documents est possible par les méta-données de l'entrepôt de données et la méta-modélisation de l'entrepôt de données. Nous élaborons également les méta-données propres à l'entrepôt de données qui définissent les méta-données structurelles et d'accessibilité propres au système de pilotage. Afin de procéder au mieux au développement de notre contribution au système d'information stratégique, la méta-modélisation de l'entrepôt de données permet d'élaborer un schéma directeur pour la construction de l'entrepôt de données.

In this paper we present a novel approach to automatically restructuring HTML documents by extracting semantic structures from their header and body, The body of a web page is generally software generated via template and it's layout has... more

In this paper we present a novel approach to automatically restructuring HTML documents by extracting semantic structures from their header and body, The body of a web page is generally software generated via template and it's layout has a physical schema. Our approach is to extract trees that are based on hierarchical relations in HTML documents, for this task we used two algorithms, first is Header extraction Algorithm which extracts header trees from head of HTML document and second is an algorithm for automatically partitioning HTML documents into tree like semantic structures from body part of web pages. Then we use an application called layout changer which changes a layout of one web page to another by aligning extracted header trees and partition trees.

Cette communication propose une réflexion méthodologique sur les principes de balisage d'un corpus de comptes rendus universitaires. Elle s'inscrit dans le cadre d'un projet interdis-ciplinaire visantà analyser des archives universitaires... more

Cette communication propose une réflexion méthodologique sur les principes de balisage d'un corpus de comptes rendus universitaires. Elle s'inscrit dans le cadre d'un projet interdis-ciplinaire visantà analyser des archives universitaires numérisées en mettant en relation les régularités formelles observées avec différentes déterminations (´ evolutions socio-historiques et législatives, les spécificités disciplinaires, ...) saisies sous l'angle des genres de discours, de leurévolution et de leur institutionnalisation. On considère le compte rendu (CR) comme un genre " tenant lieu " d'un autre discours : le CR est un texté ecrit qui doit, au sein de l'institution o` u il est produit, " tenir lieu " ou " représenter " unévénement de parole. La question de la représentation du discours autre (RDA) y est donc cruciale. Sur le plan matériel, le corpus est constituéà partir de boˆıtes d'archives qui comprennent, outre les CR, des documents annexes tels que convocations, feuilles d'´ emargement, textes discutés au cours de la réunion. On trouvé egalement dans certaines boites unepremì ere version corrigée (brouillon). Nous exposerons tout d'abord la structure des métadonnées retenue pour ce projet : basée sur l'adaptation du modèle TEI METAélaboré pour la description des données orales, notre démarche s'inscrit globalement dans le travail d'harmonisation des métadonnées de l'´ ecrit, initiée au sein de CORLI 2 dans le but de favoriser l'interopérabilité et la misè a disposition. Les balises propres au corps du texte, dont nous expliciterons la définition et les attributs, portent sur les champs suivants : 1) structure endogène du texte : paratexte (en-tête, nom du document, signature, présents et des excusés, sommaire et ordre du jour, pagination), texte (titres de section, paragraphes, textes insérés dans le CR)... 2) informations liées au contenu du texte : noms propres et statuts des intervenants, unités thématiques... 3) catégories identifiées via une analyse linguistique : discours direct, verbes et noms de parole, embrayeurs de personne... Le choix des balises et de leur structuration doit permettre de mettre au jour des observables permettant des explorationsétroitement articulées aux questions de recherche. Ainsi la car-actérisation générique du CR comme " tenant lieu " conduit-ellè a se pencher sur : les formes de RDA, les séquences thématiques métadiscursives consacrées au genre même, la relation entre le CR et les textes qui y sont intégrés... On montrera comment ces choix recoupent partiellement le modèle proposé pour les textes de sciencesconf.org:nacla2:203501

This study is an attempt to investigate the effects of document structure and knowledge level of the reader on reading comprehension, browsing, and perceived control. Four types of texts are distinguished, differing in structure (linear... more

This study is an attempt to investigate the effects of document structure and knowledge level of the reader on reading comprehension, browsing, and perceived control. Four types of texts are distinguished, differing in structure (linear text, hierarchical hypertext, mixed hypertext, and generative text). All the materials were on a PC. In all conditions, participants were allowed 1 h to read through the document. After completing the reading part of the experiment, they were asked to fill out the perceived control questionnaire followed by the reading comprehension test. As far as reading comprehension was concerned, knowledgeable participants had higher reading comprehension scores than non-knowledgeable participants only in the linear text. In addition, there were no significant differences in terms of the reading comprehension scores of the knowledgeable participants among the four topologies. However, the performance of non-knowledgeable participants differed with respect to the type of the topology. In particular, non-knowledgeable participants in the hierarchical and generative conditions performed better than those in the other two conditions. With respect to perceived control, the performance of knowledgeable and non-knowledgeable participants was equivalent in all four conditions. The results are discussed in terms of their implications for the computer-based learning.

In this paper, we address the problems caused by changes in the regulations on the legal corpus and its calculative implementations. From the ontology of law suggested by R. van Kralingen, we construct a generic document structure, which... more

In this paper, we address the problems caused by changes in the regulations on the legal corpus and its calculative implementations. From the ontology of law suggested by R. van Kralingen, we construct a generic document structure, which renders explicit the semantics of legal documents. These are represented in a sufficiently fine and rigorous way to link together documents belonging to distinct levels of right. Thus we identify the nature of the relation which links semantics emerging from legal texts from distinct hierarchical levels. A generic matching structure is then proposed that can be instanciated in terms of hyperlinks. The whole of our proposals makes it possible to model not only the semantics of each legal document, but also the relations linking them vertically within the regulations. The semantics of the regulations are then clarified globally in the form of an hyperdocument, from the most general texts to the most specific texts of regulation implementing systems. T...

Scholarly digital libraries increasingly provide analytics to information within documents themselves. This includes information about the logical document structure of use to downstream components, such as search, navigation, and... more

Scholarly digital libraries increasingly provide analytics to information within documents themselves. This includes information about the logical document structure of use to downstream components, such as search, navigation, and summarization. In this paper, the authors describe SectLabel, a module that further develops existing software to detect the logical structure of a document from existing PDF files, using the formalism of conditional random fields. While previous work has assumed access only to the raw text ...

The current fMRI study investigated correlations of low-frequency signal changes in the left inferior frontal gyrus, right inferior frontal gyrus and cerebellum in 13 adult dyslexic and 10 normal readers to examine functional networks... more

The current fMRI study investigated correlations of low-frequency signal changes in the left inferior frontal gyrus, right inferior frontal gyrus and cerebellum in 13 adult dyslexic and 10 normal readers to examine functional networks associated with these regions. The extent of these networks to regions associated with phonological processing (frontal gyrus, occipital gyrus, angular gyrus, inferior temporal gyrus, fusiform gyrus, supramarginal gyrus and cerebellum) was compared between good and dyslexic readers. Analysis of correlations in lowfrequency range showed that regions known to activate during an bon-offQ phoneme-mapping task exhibit synchronous signal changes when the task is administered continuously (without any boffQ periods). Results showed that three functional networks, which were defined on the basis of documented structural deficits in dyslexics and included regions associated with phonological processing, differed significantly in spatial extent between good readers and dyslexics. The methodological, theoretical and clinical significance of the findings for advancing fMRI research and knowledge of dyslexia are discussed. D

In this paper we present a systematic analysis of document retrieval using unstructured and structured queries within the score region algebra (SRA) structured retrieval frame- work. The behavior of difierent retrieval models, namely... more

In this paper we present a systematic analysis of document retrieval using unstructured and structured queries within the score region algebra (SRA) structured retrieval frame- work. The behavior of difierent retrieval models, namely Boolean, tf.idf, GPX, language models, and Okapi, is tested using the transparent SRA framework in our three-level struc- tured retrieval system called TIJAH. The retrieval models are

Natural language processing for ontology learning, extraction of semantic relations, information extraction from structured documents.

We describe the type system component of a database management system that supports a multimedia news-ondemand application. The type system is an object-oriented one that represents document structure according to the SGML and HyTime... more

We describe the type system component of a database management system that supports a multimedia news-ondemand application. The type system is an object-oriented one that represents document structure according to the SGML and HyTime standards. End-users access the news database by using a visual query interface. We also describe current work to generalize the database type system to accommodate arbitrary SGML/HyTime compliant multimedia documents. Such a generalized type system would support a broad range of multimedia applications. Keywords: multimedia, news-on-demand, multimedia database, SGML, HyTime. 1. INTRODUCTION News-on-Demand is an application which provides subscribers with access to multimedia news articles that are inserted into a distributed database by news providers. Commercial news gathering/compiling organizations such as wire services, television networks, and newspapers are examples of news providers. The news items that they provide are annotated and organized i...

This paper presents a tool for prototyping ODE (Ordinary Differential Equations) based systems in the area of computational modeling. The models, tailored during the project step of the system development, are recorded in MathML, a markup... more

This paper presents a tool for prototyping ODE (Ordinary Differential Equations) based systems in the area of computational modeling. The models, tailored during the project step of the system development, are recorded in MathML, a markup language built upon XML. This design choice improves interoperability with other tools used for mathematical modeling, mainly considering that it is based on Web architecture. The resulting work is a Web portal that transforms an ODE model documented in MathML to a C++ API that offers numerical solutions for that model.

Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR... more

Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.

This paper attempts to provide an overview of the key metadata research issues and the current projects and initiatives that are investigating methods and developing technologies aimed at improving our ability to discover, access,... more

This paper attempts to provide an overview of the key metadata research issues and the current projects and initiatives that are investigating methods and developing technologies aimed at improving our ability to discover, access, retrieve and assimilate information on the Internet through the use of metadata.

Clustering techniques have been used by many intelligent software agents in order to retrieve, filter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related Web... more

Clustering techniques have been used by many intelligent software agents in order to retrieve, filter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related Web documents to automatically formulate queries and search for other similar documents on the Web. Traditional clustering algorithms either use a priori knowledge of document structures to define a distance or similarity among these documents, or use probabilistic techniques such as Bayesian classification. Many of these traditional algorithms, however, falter when the dimensionality of the feature space becomes high relative to the size of the document space. In this paper, we introduce two new clustering algorithms that can effectively cluster documents, even in the presence of a very high dimensional feature space. These clustering techniques, which are based on generalizations of graph partitioning, do not require pre-specified ad hoc distance functions, and are capable of automatically discovering document similarities or associations. We conduct several experiments on real Web data using various feature selection heuristics, and compare our clustering schemes to standard distance-based techniques, such as hierarchical agglomeration clustering, and Bayesian classification methods, such as AutoClass. q 0167-9236r99r$ -see front matter q 1999 Elsevier Science B.V. All rights reserved.

In this paper we propose a matching algorithm for measuring the structural similarity between an XML document and a DTD. The matching algorithm, by comparing the document structure against the one the DTD requires, is able to identify... more

In this paper we propose a matching algorithm for measuring the structural similarity between an XML document and a DTD. The matching algorithm, by comparing the document structure against the one the DTD requires, is able to identify commonalities and differences. Differences can be due to the presence of extra elements with respect to those the DTD requires and to the absence of required elements. The evaluation of commonalities and differences gives raise to a numerical rank of the structural similarity. Moreover, in the paper, some applications of the matching algorithm are discussed. Specifically, the matching algorithm is exploited for the classification of XML documents against a set of DTDs, the evolution of the DTD structure, the evaluation of structural queries, the selective dissemination of XML documents, and the protection of XML document contents. r

Precisely identifying entities in web documents is essential for document indexing, web search and data integration. Entity disambiguation is the challenge of determining the correct entity out of various candidate entities. Our novel... more

Precisely identifying entities in web documents is essential for document indexing, web search and data integration. Entity disambiguation is the challenge of determining the correct entity out of various candidate entities. Our novel method utilizes background knowledge in the form of a populated ontology. Additionally, it does not rely on the existence of any structure in a document or the appearance of data items that can provide strong evidence, such as e-mail addresses, for disambiguating authors for example. Originality of our method is demonstrated in the way it uses different relationships in a document as well as in the ontology to provide clues in determining the correct entity. We demonstrate the applicability of our method by disambiguating authors in a collection of DBWorld posts using a large scale, real-world ontology extracted from the DBLP. The precision and recall measurements provide encouraging results.

Clusters of multiple news stories related to the same topic exhibit a number of interesting properties. For example, when documents have been published at various points in time or by different authors or news agencies, one finds many... more

Clusters of multiple news stories related to the same topic exhibit a number of interesting properties. For example, when documents have been published at various points in time or by different authors or news agencies, one finds many instances of paraphrasing, information overlap and even contradiction. The current paper presents the Cross-document Structure Theory (CST) Bank, a collection of multi-document clusters in which pairs of sentences from different documents have been annotated for cross-document structure theory relationships. We will describe how we built the corpus, including our method for reducing the number of sentence pairs to be annotated by our hired judges, using lexical similarity measures. Finally, we will describe how CST and the CST Bank can be applied to different research areas such as multi-document summarization.

For TREC 10 we participated in the Named Page Finding Task and the Cross-Lingual Task. In the web track, we explored the use of linear combinations of term collections based on document structure. Our goal was to examine the effects of... more

For TREC 10 we participated in the Named Page Finding Task and the Cross-Lingual Task. In the web track, we explored the use of linear combinations of term collections based on document structure. Our goal was to examine the effects of different term collection statistics based on document structure in respect to known item retrieval. We parsed documents into structural components and built specific term indexes based on that document structure. Each of those indices have their own collection statistics for term weighting based on the type of language used for that structure in the collection. For producing a single ranked list, we examined a weighted linear combination approach to merging results. Our approach to known item retrieval was equal or above the median 58% of the time and 71% above the mean score of submitted runs. In the Arabic track we participated in Arabic Cross-language Information Retrieval (CLIR) and in Arabic monolingual information retrieval. For the monolingual retrieval, we examined the use of two stemming algorithms. The first is a deeper approach, and the second is a pattern-based approach. For the Arabic CLIR, we explored the retrieval effectiveness by using a machine translation (MT) system and translation probabilities obtained from parallel documents collection provided by the United Nations (UN).

This paper gives an overview of a project to generate literature reviews from a set of research papers, based on techniques drawn from human summarization behavior. For this study, we identify the key features of natural literature... more

This paper gives an overview of a project to generate literature reviews from a set of research papers, based on techniques drawn from human summarization behavior. For this study, we identify the key features of natural literature reviews through a macro-level and clause-level discourse analysis; we also identify human information selection strategies by mapping referenced information to source documents. Our preliminary results of discourse analysis have helped us characterize literature review writing styles based on their document structure and rhetorical structure. These findings will be exploited to design templates for automatic content generation.

Purpose – An issue of increased interest in metadata research concerns finding ways to store, in the metadata of an information resource, data regarding the resource's quality. The purpose of this paper is to present... more

Purpose – An issue of increased interest in metadata research concerns finding ways to store, in the metadata of an information resource, data regarding the resource's quality. The purpose of this paper is to present a metadata schema that facilitates representation and storage of data related to the quality of an e-commerce resource, the e-commerce evaluation metadata (ECEM) schema. Design/methodology/approach

Serotonin reuptake inhibitors and cognitive-behavior therapy (CBT) are considered first-line treatments for obsessive-compulsive disorder (OCD). However, little is known about their modulatory effects on regional brain morphology in OCD... more

Serotonin reuptake inhibitors and cognitive-behavior therapy (CBT) are considered first-line treatments for obsessive-compulsive disorder (OCD). However, little is known about their modulatory effects on regional brain morphology in OCD patients. We sought to document structural brain abnormalities in treatment-naive OCD patients and to determine the effects of pharmacological and cognitivebehavioral treatments on regional brain volumes. Treatment-naive patients with OCD (n ¼ 38) underwent structural magnetic resonance imaging scan before and after a 12-week randomized clinical trial with either fluoxetine or group CBT. Matched-healthy controls (n ¼ 36) were also scanned at baseline. Voxel-based morphometry was used to compare regional gray matter (GM) volumes of regions of interest (ROIs) placed in the orbitofrontal, anterior cingulate and temporolimbic cortices, striatum, and thalamus. Treatment-naive OCD patients presented smaller GM volume in the left putamen, bilateral medial orbitofrontal, and left anterior cingulate cortices than did controls (po0.05, corrected for multiple comparisons). After treatment with either fluoxetine or CBT (n ¼ 26), GM volume abnormalities in the left putamen were no longer detectable relative to controls. ROI-based within-group comparisons revealed that GM volume in the left putamen significantly increased (po0.012) in fluoxetine-treated patients (n ¼ 13), whereas no significant GM volume changes were observed in CBT-treated patients (n ¼ 13). This study supports the involvement of orbitofronto/cingulo-striatal loops in the pathophysiology of OCD and suggests that fluoxetine and CBT may have distinct neurobiological mechanisms of action.

Nous présentons dans cet article une analyse du fonctionnement textuel des syntagmes prépositionnels en selon X introducteurs de discours rapporté (selon énonciatifs). Dans la première partie, nous mentionnons les principaux critères... more

Nous présentons dans cet article une analyse du fonctionnement textuel des syntagmes prépositionnels en selon X introducteurs de discours rapporté (selon énonciatifs). Dans la première partie, nous mentionnons les principaux critères permettant de repérer les selon énonciatifs (parmi les divers emplois de selon), et parmi eux, ceux qui sont susceptibles de porter sur plusieurs phrases (introduisant ainsi des cadres de discours spécifiques dits univers énonciatifs). En second lieu, nous énumérons les principaux indices, ou réseaux d'indices signalant le plus efficacement la clôture des univers énonciatifs. Dans un troisième temps, nous montrons comment ces connaissances linguistiques sont exploitées dans notre plate-forme logicielle ContextO.

layout, functional programming Highly customised variable-data documents make automatic layout of the resulting publication hard. Architectures for defining and processing such documents can benefit if the repertoire of layout methods... more

layout, functional programming Highly customised variable-data documents make automatic layout of the resulting publication hard. Architectures for defining and processing such documents can benefit if the repertoire of layout methods available can be extended smoothly and easily to accommodate new styles of customisation. The Document Description Framework incorporates a model for declarative document layout and processing where documents are treated as functional programs. A canonical XML tree contains nodes describing layout instructions which will modify and combine their children component parts to build sections of the final presentation. Leaf components such as images, vector graphic fragments and text blocks are 'rendered ' to make consistent graphical atoms. These parts are then processed by layout agents, described and parameterised by their parent nodes, which can range from simple layouts like translations, flows, encapsulations and tables through to highly com...

An automated assembling of shredded/torn documents (2D) or broken pottery (3D) will support philologists, archaeologists and forensic experts. An automated solution for this task can be divided into shape based matching techniques... more

An automated assembling of shredded/torn documents (2D) or broken pottery (3D) will support philologists, archaeologists and forensic experts. An automated solution for this task can be divided into shape based matching techniques (apictorial) or techniques that analyze additionally the visual content of the fragments (pictorial). In the case of visual content techniques like texture based analysis are used. Depending on the application, shape matching techniques are suitable for entities of the puzzle problem with small numbers of pieces (e.g. up to 20). Also artefacts like broken and lost pieces or overlapping parts of fragments increase the error rate of shape based techniques since the matching of adjacent boundaries can fail. As a result additional features, e.g. color, document structure, have to be used. This paper presents an overview about current puzzle applications in Cultural Heritage, and introduces also the main problems in puzzle solving.

Text retrieval systems store a great variety of documents, from abstracts, newspaper articles, and web pages to journal articles, books, court transcripts, and legislation. Collections of diverse types of documents expose shortcomings in... more

Text retrieval systems store a great variety of documents, from abstracts, newspaper articles, and web pages to journal articles, books, court transcripts, and legislation. Collections of diverse types of documents expose shortcomings in current approaches to ranking. Use of short fragments of documents, called passages, instead of whole documents can overcome these shortcomings: passage ranking provides convenient units of

In this paper, we address the problems caused by changes in the regulations on the legal corpus and its calculative implementations. From the ontology of law suggested by R. van Kralingen, we construct a generic document structure, which... more

In this paper, we address the problems caused by changes in the regulations on the legal corpus and its calculative implementations. From the ontology of law suggested by R. van Kralingen, we construct a generic document structure, which renders explicit the semantics of legal documents. These are represented in a sufficiently fine and rigorous way to link together documents belonging to distinct levels of right. Thus we identify the nature of the relation which links semantics emerging from legal texts from distinct hierarchical levels. A generic matching structure is then proposed that can be instanciated in terms of hyperlinks. The whole of our proposals makes it possible to model not only the semantics of each legal document, but also the relations linking them vertically within the regulations. The semantics of the regulations are then clarified globally in the form of an hyperdocument, from the most general texts to the most specific texts of regulation implementing systems. T...

After establishing the need for cross-border dissemination of national (EC) law, this paper reviews already existing initiatives like JuriFast, Dec.Nat, Caselex and Jure. For future development three basic challenges are defined and... more

After establishing the need for cross-border dissemination of national (EC) law, this paper reviews already existing initiatives like JuriFast, Dec.Nat, Caselex and Jure. For future development three basic challenges are defined and elaborated. The lack of a unique and persistent case law identifier gives rise to many problems when citing and searching case law. The paper describes the way this issue is tackled in the Netherlands, by using a national identifier together with a publicly available reference-index. Inspired by the advantages of this system, a proposal is made for a European Case Law Identifier. To facilitate the inter-European search for, and exchange of case law the paper also discusses the need for a harmonized set of metadata, and the necessity to develop a European interchange format for court decisions.

In this paper the problem of indexing heterogeneous structured documents and of retrieving semistructured documents is considered. We propose a flexible paradigm for both indexing such documents and formulating user queries specifying... more

In this paper the problem of indexing heterogeneous structured documents and of retrieving semistructured documents is considered. We propose a flexible paradigm for both indexing such documents and formulating user queries specifying soft constraints on both documents' structure and content. At the indexing level we propose a model that achieves flexibility by constructing personalised document representations based on users' views of the documents. This is obtained by allowing users to specify their preferences on the documents' sections that they estimate to bear the most interesting information, as well as to linguistically quantify the number of sections which determine the global potential interest of the documents. At the query language level, a flexible query language for expressing soft selection conditions on both the documents' structure and content is proposed.

Physical and logical structure recovering from electronic documents is still an open issue. In this paper, we propose a flexible and efficient approach for recovering document structures from PDF files. After a brief introduction of the... more

Physical and logical structure recovering from electronic documents is still an open issue. In this paper, we propose a flexible and efficient approach for recovering document structures from PDF files. After a brief introduction of the PDF format and its major features, we report about our evaluation of different existing tools and works for PDF content extraction and analysis. To overcome the weaknesses of these systems, we propose a new analysis strategy, based on an intermediate representation, called XCDF, which enables representing physical structures in a canonical way. This paper then describes the PDF reverse engineering workflow and focuses on the document logical restructuring. Finally, the paper concludes with potential future improvements. 1.

When comparing documents images based on visual similarity it is difficult to determine the correct scale and features for document representation. We report on new form of multivariate granulometries based on rectangles of varying size... more

When comparing documents images based on visual similarity it is difficult to determine the correct scale and features for document representation. We report on new form of multivariate granulometries based on rectangles of varying size and aspect ratio. These rectangular granulometries are used to probe the layout structure of document images, and the rectangular size distributions derived from them are used as descriptors for document images. Feature selection is used to reduce the dimensionality and redundancy of the size distributions, while preserving the essence of the visual appearance of a document. Experimental results indicate that rectangular size distributions are an effective way to characterize visual similarity of document images and provide insightful interpretation of classification and retrieval results in the original image space rather than the abstract feature space.