Metadata Extraction Research Papers - Academia.edu (original) (raw)
2025, Lecture Notes in Computer Science
We solve the problem of record linkage between databases where record fields are mixed and permuted in different ways. The solution method uses a conditional random fields model to find matching terms in record pairs and uses matching... more
We solve the problem of record linkage between databases where record fields are mixed and permuted in different ways. The solution method uses a conditional random fields model to find matching terms in record pairs and uses matching terms in the duplicate detection process. Although records with permuted fields may have partly reordered terms, our method can still utilize local orders of terms for finding matching terms. We carried out experiments on several wellknown data sets in record linkage research, and our method showed its advantages on most of the data sets. We also did experiments on a synthetic data set, in which records combined fields in random order, and verified that it could handle even this data set.
2025, Computer Standards & Interfaces
This paper discusses standards-based approaches for secure data sharing across organizations. In particular, current standards as well as standardization trends for data integration, multimedia data management, active real-time data... more
This paper discusses standards-based approaches for secure data sharing across organizations. In particular, current standards as well as standardization trends for data integration, multimedia data management, active real-time data management, data warehousing and mining, expert data management, semantic web data management, knowledge management, visualization, metadata extraction and management, and security management for data sharing are discussed. We will illustrate the ideas with an example from emergency response and public health awareness application domain.
2025, Lecture Notes in Computer Science
This paper describes our efforts to develop a toolset and process for automated metadata extraction from large, diverse, and evolving document collections. A number of federal agencies, universities, laboratories, and companies are... more
This paper describes our efforts to develop a toolset and process for automated metadata extraction from large, diverse, and evolving document collections. A number of federal agencies, universities, laboratories, and companies are placing their collections online and making them searchable via metadata fields such as author, title, and publishing organization. Manually creating metadata for a large collection is an extremely time-consuming task, but is difficult to automate, particularly for collections consisting of documents with diverse layout and structure. Our automated process enables many more documents to be available online than would otherwise have been possible due to time and cost constraints. We describe our architecture and implementation and illustrate the effectiveness of the tool-set by providing experimental results on two major collections DTIC (Defense Technical Information Center) and NASA (National Aeronautics and Space Administration).
2025
A dynamic validation process is described for an application (metadata extraction from scanned documents) where a moderate failure rate is acceptable provided that instances of failures during operation could be identified. Lacking a... more
A dynamic validation process is described for an application (metadata extraction from scanned documents) where a moderate failure rate is acceptable provided that instances of failures during operation could be identified. Lacking a plausible exact oracle for the application, a series of statistical models of output characteristics is employed. Flexibility and adaptability is achieved by developing a customized scripting language describing how the various tests should be combined to obtain an overall measure of confidence in a program output. The suitability of the validator was demonstrated by an experiment measuring its ability to mimic human judgments as to which of several alternate outputs for the same document would be preferred.
2025
In this paper, we report on our experience with the creation of an automated, human-assisted process to extract metadata from documents in a large (>100,000), dynamically growing collection. Such a collection may be expected to be... more
In this paper, we report on our experience with the creation of an automated, human-assisted process to extract metadata from documents in a large (>100,000), dynamically growing collection. Such a collection may be expected to be heterogeneous, both statically heterogeneous (containing documents in a variety of formats) and dynamically heterogeneous (likely to acquire new documents in formats unlike any prior acquisitions). Eventually, we hope to be able to totally automate metadata extraction for 80% of the documents and reduce the time needed to generate the metadata for the remaining documents also by 80%. In this paper, we describe our process of first classifying documents into equivalence classes for which we can then use a rule-based approach to extract metadata. Our rule-based approach differs from others in as far as it separates the rule-interpreting engine from a template of rules. The templates vary among classes but the engine is the same. We have evaluated our approach on a test bed of 7413 randomly selected documents from the DTIC (Defense Technical Information Center) collection with encouraging results. Finally, we describe how we can use this process to generate an OAI (Open Archive Initiatives) -compliant digital library from a stream of incoming documents.
2024, International Conference on Computational Linguistics
In this paper 1 , we present a novel beam-search decoder for disfluency detection. We first propose node-weighted max-margin Markov networks (M3N) to boost the performance on words belonging to specific part-of-speech (POS) classes. Next,... more
In this paper 1 , we present a novel beam-search decoder for disfluency detection. We first propose node-weighted max-margin Markov networks (M3N) to boost the performance on words belonging to specific part-of-speech (POS) classes. Next, we show the importance of measuring the quality of cleaned-up sentences and performing multiple passes of disfluency detection. Finally, we propose using the beam-search decoder to combine multiple discriminative models such as M3N and multiple generative models such as language models (LM) and perform multiple passes of disfluency detection. The decoder iteratively generates new hypotheses from current hypotheses by making incremental corrections to the current sentence based on certain patterns as well as information provided by existing models. It then rescores each hypothesis based on features of lexical correctness and fluency. Our decoder achieves an edit-word F1 score higher than all previous published scores on the same data set, both with and without using external sources of information. This work is licensed under a Creative Commons Attribution 4.0 International Licence.
2024
In this paper 1 , we present a novel beam-search decoder for disfluency detection. We first propose node-weighted max-margin Markov networks (M3N) to boost the performance on words belonging to specific part-of-speech (POS) classes. Next,... more
In this paper 1 , we present a novel beam-search decoder for disfluency detection. We first propose node-weighted max-margin Markov networks (M3N) to boost the performance on words belonging to specific part-of-speech (POS) classes. Next, we show the importance of measuring the quality of cleaned-up sentences and performing multiple passes of disfluency detection. Finally, we propose using the beam-search decoder to combine multiple discriminative models such as M3N and multiple generative models such as language models (LM) and perform multiple passes of disfluency detection. The decoder iteratively generates new hypotheses from current hypotheses by making incremental corrections to the current sentence based on certain patterns as well as information provided by existing models. It then rescores each hypothesis based on features of lexical correctness and fluency. Our decoder achieves an edit-word F1 score higher than all previous published scores on the same data set, both with and without using external sources of information. This work is licensed under a Creative Commons Attribution 4.0 International Licence.
2024
The objective of the experiment described in the article was the evaluation of Polish digital resources searchable by the engine Europeana. Queries were created so as to represent proportional shares of common names, named entities and... more
The objective of the experiment described in the article was the evaluation of Polish digital resources searchable by the engine Europeana. Queries were created so as to represent proportional shares of common names, named entities and their combinations. They were manually enriched by students, experienced professionals and humanities-oriented educated persons. The system responses were then evaluated according to their Mean Average Precision. The average efficiency of information retrieval for Polish monolingual queries was weak: there were only 26.6% of highly relevant responses and as much as 73.5% of queries produced unsatisfactory results. MAP produced the best results for the automatic search (0,314), while enriched files contained less relevant results: for the expert users MAP value was 0,1795, for the educated users it was 0,1529 and for students 0,1279. The overall results proved that the IR process concerning Polish resources searchable by Europeana require significant i...
2024
The objective of the experiment described in the article was the evaluation of Polish digital resources searchable by the engine Europeana. Queries were created so as to represent proportional shares of common names, named entities and... more
The objective of the experiment described in the article was the evaluation of Polish digital resources searchable by the engine Europeana. Queries were created so as to represent proportional shares of common names, named entities and their combinations. They were manually enriched by students, experienced professionals and humanities-oriented educated persons. The system responses were then evaluated according to their Mean Average Precision. The average efficiency of information retrieval for Polish monolingual queries was weak: there were only 26.6% of highly relevant responses and as much as 73.5% of queries produced unsatisfactory results. MAP produced the best results for the automatic search (0,314), while enriched files contained less relevant results: for the expert users MAP value was 0,1795, for the educated users it was 0,1529 and for students 0,1279. The overall results proved that the IR process concerning Polish resources searchable by Europeana require significant i...
2024, Journal of emerging technologies and innovative research
The last ten years have been the witnesses of the emergence of any kind of video content. In the same time, certain individuals are deaf and occasionally cannot understand the meanings of such videos because there is not any text... more
The last ten years have been the witnesses of the emergence of any kind of video content. In the same time, certain individuals are deaf and occasionally cannot understand the meanings of such videos because there is not any text transcription available. Hence, it becomes important to make videos available to people who have these problems and even more to remove the gaps of native languages among them. This can be best done by providing subtitles of a video. However, downloading subtitles of any video from the internet is a tedious process. So, to generate subtitles automatically through the software itself and without the use of internet is the main concept of this paper. The objective of this paper is to provide an overview of generating subtitle on offline basis using CMUSPHINX4 java API. This system will first extract the audio, then recognize the extracted audio with CMUSPHINX4 java API. Later this system writes the recognized text to the text file with timestamp and saves it with .srt extension. Then, this .srt file can be opened in a media player to view the subtitles along with video.
2024, Lecture Notes in Computer Science
The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information. Major components include the user interface, geographical ontology, maintenance and... more
The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information. Major components include the user interface, geographical ontology, maintenance and retrieval functions for a test collection of web documents, textual and spatial indexes, relevance ranking and metadata extraction. Here we summarise the functionality and interaction between these components before focusing on the design of the geo-ontology and the development of spatio-textual indexing methods. The geo-ontology supports functionality for disambiguation, query expansion, relevance ranking and metadata extraction. Geographical place names are accompanied by multiple geometric footprints and qualitative spatial relationships. Spatial indexing of documents has been integrated with text indexing through the use of spatio-textual keys in which terms are concatenated with spatial cells to which they relate. Preliminary experiments demonstrate considerable performance benefits when compared with pure text indexing and with text indexing followed by a spatial filtering stage.
2024, International Journal of Document Analysis and Recognition (IJDAR)
EBORA (Digital AccEss to BOoks of the RenAissance) is a multidisciplinary European project aiming at digitizing and thus making rare sixteenth century books more accessible. End-users, librarians, historians, researchers in book history... more
EBORA (Digital AccEss to BOoks of the RenAissance) is a multidisciplinary European project aiming at digitizing and thus making rare sixteenth century books more accessible. End-users, librarians, historians, researchers in book history and computer scientists participated in the development of remote and collaborative access to digitized Renaissance books, necessary because of the reduced accessibility to digital libraries in image mode through the Internet. The size of files for the storage of images, the lack of a standard file format exchange suitable for progressive transmission, and limited querying possibilities currently limit remote access to digital libraries. To improve accessibility, historical documents must be digitized and retroconverted to extract a detailed description of the image contents suited to users' needs. Specialists of the Renaissance have described the metadata generally required by end-users and the ideal functionalities of the digital library. The retro-conversion of historical documents is a complex process that includes image capture, metadata extraction, image storage and indexing, automatic conversion in a reusable electronic form, publication on the Internet, and data compression for faster remote access. The steps of this process cannot be developed independently. DEBORA proposes a global approach to retro-conversion from the digitization to the final functionalities of the digital library centered on users' needs. The retro-conversion process is mainly based on a document image analysis system that simultaneously extracts the metadata and compresses the images. We also propose a file format to describe
2024, ACM/IEEE Joint Conference on Digital Libraries
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) began as an alternative to distributed searching of scholarly eprint repositories. The model embraced by the OAI-PMH is that of metadata harvesting, where value-added... more
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) began as an alternative to distributed searching of scholarly eprint repositories. The model embraced by the OAI-PMH is that of metadata harvesting, where value-added services (by a "service provider") are constructed on cached copies of the metadata extracted from the repositories of the harvester's choosing. While this model dispenses with the well known problems of distributed searching, it introduces the problem of synchronization. Stated simply, this problem arises when the service provider's copy of the metadata does not match the metadata currently at the constituent repositories. We define some metrics for describing the synchronization problem in the OAI-PMH. Based on these metrics, we study the synchronization problem of the OAI-PMH framework and propose several approaches for harvesters to implement better synchronization. In particular, if a repository knows its update frequency, it can publish it in an OAI-PMH Identify response using an optional About container that borrows from RDF Site Syndication (RSS) Format.
2024, Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
We introduce Enlil, an information extraction system that discovers the institutional affiliations of authors in scholarly papers. Enlil consists of two steps: one that first identifies authors and affiliations using a conditional random... more
We introduce Enlil, an information extraction system that discovers the institutional affiliations of authors in scholarly papers. Enlil consists of two steps: one that first identifies authors and affiliations using a conditional random field; and a second support vector machine that connects authors to their affiliations. We benchmark Enlil in three separate experiments drawn from three different sources: the ACL Anthology, the ACM Digital Library, and a set of cross-disciplinary scientific journal articles acquired by querying Google Scholar. Against a state-of-the-art production baseline, Enlil reports a statistically significant improvement in F1 of nearly 10% (p « 0.01). In the case of multidisciplinary articles from Google Scholar, Enlil is benchmarked over both clean input (F1 > 90%) and automatically-acquired input (F1 > 80%). We have deployed Enlil in a case study involving Asian genomics research publication patterns to understand how government sponsored collaborative links evolve. Enlil has enabled our team to construct and validate new metrics to quantify the facilitation of research as opposed to direct publication.
2024, 2016 IEEE International Carnahan Conference on Security Technology (ICCST)
MPEG media have been widely adopted and is very successful in promoting interoperable services that deliver video to consumers on a range of devices. However, media consumption is going beyond the mere playback of a media asset and is... more
MPEG media have been widely adopted and is very successful in promoting interoperable services that deliver video to consumers on a range of devices. However, media consumption is going beyond the mere playback of a media asset and is geared towards a richer user experience that relies on rich metadata and content description. This paper proposes a technique for extracting and analysing metadata from a video, followed by decision making related to the video content. The system uses sentiment analysis for such a classification. It is envisaged that the system when fully developed, is to be applied to determine the existence of illicit multimedia content on the web.
2024, Intelligent Communication Technologies and Virtual Mobile Networks
Algorithms are the crucial and important part for any research and developments. Algorithms are usually published in the scientific publications, journals, conference papers or thesis. Algorithms plays important role especially in the... more
Algorithms are the crucial and important part for any research and developments. Algorithms are usually published in the scientific publications, journals, conference papers or thesis. Algorithms plays important role especially in the computational and research areas where the researchers and developers look for the innovations. Therefore there is need for a search system which automatically searches for algorithms from the scholarly big data. Algo_Seer is been proposed as part of CiteSeer system which automatically searches for pseudo codes and algorithmic procedures and performs indexing, analysis and ranking to extract the algorithms. This work proposes a search system Algo_-Seer which utilizes a novel arrangement of procedures such as rule based method, machine learning methods to recognize, separate and extract the calculated algorithms from the academic reports. Particularly mixture troupe machine learning systems are utilized to obtain the efficient results.
2024, International Journal on Digital Libraries
In response to the proposal of digitizing the entire back-run of several European audio archives, many research projects have been carried out in order to discover the technical issues involved in making prestigious audio documents... more
In response to the proposal of digitizing the entire back-run of several European audio archives, many research projects have been carried out in order to discover the technical issues involved in making prestigious audio documents digitally available, which are related to the A/D transfer process and supervised metadata extraction. This article gives an innovative approach to metadata extraction from such a complex source material. This article also describes the protocols defined, the processes undertaken, the results ascertained from several audio documents preservation projects and the techniques used. In addition, a number of recommendations are given for the re-recording process, aimed at minimizing the information loss and to automatically measure the unintentional alterations introduced by the A/D equipment.
2024, arXiv (Cornell University)
In this article, we introduce a set of methods to naturalize text based on natural human speech. Voice-based interactions provide a natural way of interfacing with electronic systems and are seeing a widespread adaptation of late. These... more
In this article, we introduce a set of methods to naturalize text based on natural human speech. Voice-based interactions provide a natural way of interfacing with electronic systems and are seeing a widespread adaptation of late. These computerized voices can be naturalized to some degree by inserting pauses and filler words at appropriate positions. The first proposed text transformation method uses the frequency of bigrams in the training data to make appropriate insertions in the input sentence. It uses a probability distribution to choose the insertions from a set of all possible insertions. This method is fast and can be included before a Text-To-Speech module. The second method uses a Recurrent Neural Network to predict the next word to be inserted. It confirms the insertions given by the bigram method. Additionally, the degree of naturalization can be controlled in both these methods. On the conduction of a blind survey, we conclude that the output of these text transformation methods is comparable to natural speech.
2024, Lecture Notes in Computer Science
Current efforts on the semantic web are mainly focused on the creation of recommendations and standards for adding semantic descriptions to web resources. This situation represents a huge challenge to content creators that have to... more
Current efforts on the semantic web are mainly focused on the creation of recommendations and standards for adding semantic descriptions to web resources. This situation represents a huge challenge to content creators that have to construct manually such descriptions, implying high costs in material and human resources. This paper presents a multi-agent system that automates partially this task, i.e. the authoring of web documents, reducing content creators labor. This system automatically extracts descriptive information from a set of documents in Spanish language, and constructs two output (web) document collections from them. The first collection is a set of meta-information descriptions based on the Dublin Core specifications. The second output is a collection of XHTML documents for human visualizing and browsing. In order to build the two output collections, the proposed multi-agent system applies several intelligent text processing approaches. This paper describes these approaches, as well as, the methodology used to encode the extracted metadata. It also reports results from processing three document collections of about 45 MB of text, including their associated resources-descriptions and hypertext-generated by the system.
2024, Lecture Notes in Computer Science
The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information. Major components include the user interface, geographical ontology, maintenance and... more
The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information. Major components include the user interface, geographical ontology, maintenance and retrieval functions for a test collection of web documents, textual and spatial indexes, relevance ranking and metadata extraction. Here we summarise the functionality and interaction between these components before focusing on the design of the geo-ontology and the development of spatio-textual indexing methods. The geo-ontology supports functionality for disambiguation, query expansion, relevance ranking and metadata extraction. Geographical place names are accompanied by multiple geometric footprints and qualitative spatial relationships. Spatial indexing of documents has been integrated with text indexing through the use of spatio-textual keys in which terms are concatenated with spatial cells to which they relate. Preliminary experiments demonstrate considerable performance benefits when compared with pure text indexing and with text indexing followed by a spatial filtering stage.
2024, Proceedings of the 2007 ACM symposium on Document engineering
In spite of the high profile of media types such as video, audio and images, many multimedia presentations rely extensively on text content. Text can be used for incidental labels, or as subtitles or captions that accompany other media... more
In spite of the high profile of media types such as video, audio and images, many multimedia presentations rely extensively on text content. Text can be used for incidental labels, or as subtitles or captions that accompany other media objects. In a multimedia document, text content is not only constrained by the need to support presentation styles and layout, it is also constrained by the temporal context of the presentation. This involves intra-text and extra-text timing synchronization with other media objects. This paper describes a new timed-text representation language that is intended to be embedded in a non-text host language. Our format, which we call aText (for the Ambulant Text Format), balances the need for text styling with the requirement for an efficient representation that can be easily parsed and scheduled at runtime. aText, which can also be streamed, is defined as an embeddable text format for use within declarative XML languages. The paper presents a discussion of the requirements for the format, a description of the format and a comparison with other existing and emerging text formats. We also provide examples for aText when embedded within the SMIL and MLIF languages and discuss our implementation experiences of aText with the Ambulant Player.
2024, Journal of Cosmology and Astroparticle Physics
Dark matter particles will be captured in neutron stars if they undergo scattering interactions with nucleons or leptons. These collisions transfer the dark matter kinetic energy to the star, resulting in appreciable heating that is... more
Dark matter particles will be captured in neutron stars if they undergo scattering interactions with nucleons or leptons. These collisions transfer the dark matter kinetic energy to the star, resulting in appreciable heating that is potentially observable by forthcoming infrared telescopes. While previous work considered scattering only on nucleons, neutron stars contain small abundances of other particle species, including electrons and muons. We perform a detailed analysis of the neutron star kinetic heating constraints on leptophilic dark matter. We also estimate the size of loop induced couplings to quarks, arising from the exchange of photons and Z bosons. Despite having relatively small lepton abundances, we find that an observation of an old, cold, neutron star would provide very strong limits on dark matter interactions with leptons, with the greatest reach arising from scattering off muons. The projected sensitivity is orders of magnitude more powerful than current dark matter-electron scattering bounds from terrestrial direct detection experiments.
2024, Information Processing & Management
Institute of Mathematics of the Academy of Sciences of the Czech Republic provides access to digitized documents strictly for personal use. Each copy of any part of this document must contain these Terms of use.
2024, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Tables are used to present, list, summarize, and structure important data in documents. In scholarly articles, they are often used to present the relationships among data and highlight a collection of results obtained from experiments and... more
Tables are used to present, list, summarize, and structure important data in documents. In scholarly articles, they are often used to present the relationships among data and highlight a collection of results obtained from experiments and scientific analysis. In digital libraries, extracting this data automatically and understanding the structure and content of tables are very important to many applications. Automatic identification extraction, and search for the contents of tables can be made more precise with the help of metadata. In this paper, we propose a set of mediumindependent table metadata to facilitate the table indexing, searching, and exchanging. To extract the contents of tables and their metadata, an automatic table metadata extraction algorithm is designed and tested on PDF documents.
2024, Proceedings of the 16th international conference on World Wide Web
Tables are ubiquitous. Unfortunately, no search engine supports table search. In this paper, we propose a novel table specific searching engine, TableSeer, to facilitate the table extracting, indexing, searching, and sharing. In addition,... more
Tables are ubiquitous. Unfortunately, no search engine supports table search. In this paper, we propose a novel table specific searching engine, TableSeer, to facilitate the table extracting, indexing, searching, and sharing. In addition, we propose an extensive set of medium-independent metadata to precisely present tables. Given a query, TableSeer ranks the returned results using an innovative ranking algorithm-TableRank with a tailored vector space model and a novel term weighting scheme. Experimental results show that TableSeer outperforms existing search engines on table search. In addition, incorporating multiple weighting factors can significantly improve the ranking results.
2024, Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The... more
Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, the lack of a universal table metadata specification, and the limitation of the existing ranking schemes make the table search problem challenging. In this paper, we describe Ta-bleSeer, a search engine for tables. TableSeer crawls digital libraries, detects tables from documents, extracts tables metadata, indexes and ranks tables, and provides a userfriendly search interface. We propose an extensive set of medium-independent metadata for tables that scientists and other users can adopt for representing table information. In addition, we devise a novel page box-cutting method to improve the performance of the table detection. Given a query, TableSeer ranks the matched tables using an innovative ranking algorithm-TableRank. TableRank rates each <query, table> pair with a tailored vector space model and a specific term weighting scheme. Overall, TableSeer eliminates the burden of manually extract table data from digital libraries and enables users to automatically examine tables. We demonstrate the value of TableSeer with empirical studies on scientific documents.
2024
This paper will present the contribution of the European PrestoSpace project to the study and development of a Metadata Access and Delivery ( MAD) platform for multimedia and television broadcast archives. The MAD system aims at... more
This paper will present the contribution of the European PrestoSpace project to the study and development of a Metadata Access and Delivery ( MAD) platform for multimedia and television broadcast archives. The MAD system aims at generating, validating and delivering to archive users metadata created by automatic and semi- automatic information extraction processes. The MAD publication platform employs audiovisual content analysis, speech recognition (ASR) and semantic analysis tools. It then provides intelligent facilities to access the imported and newly produced metadata. The possibilities opened by the PrestoSpace framework to intelligent indexing and retrieval of multimedia objects within large scale archives apply as well to more general scenarios where semantic information is needed to cope with the complexity of the search process.
2024, Proceedings Of The Association For Information Science And Technology
2024
It has long been established that many workplace tasks are managed through email communication, and that these tasks involve the exchange of requests and commitments. Users would be better able to manage and monitor tasks in their email... more
It has long been established that many workplace tasks are managed through email communication, and that these tasks involve the exchange of requests and commitments. Users would be better able to manage and monitor tasks in their email if systems could identify the utterances which place responsibility for action on themselves or others. Such systems require a robust understanding of which utterances convey requests and commitments. Previous attempts to classify similar phenomena in email have mostly been at the message level and have lacked detailed and robust category definitions that allow unambiguous classification at the utterance level. To address this gap, this paper presents precise definitions for classifying requests and commitments in email, based on concepts from Speech Act Theory, and informed by the results of two independent manual annotation experiments using data from the Enron email corpus. The specific surface realisation of requests and commitments in email are also considered, with the aim of clarifying how a range of potentially difficult cases should be dealt with. This paper thus contributes a well-grounded definitional basis for the classification of task-oriented speech acts in email.
2024, 2011 IEEE International Conference on Consumer Electronics (ICCE)
In this paper, we propose a novel image indexing platform, so-called INVENIO (INdexing Visual ENvironment for multimedia Items and Objects). INVENIO offers to professional users both 2D and 3D content re-use facilities. Concerning the 2D... more
In this paper, we propose a novel image indexing platform, so-called INVENIO (INdexing Visual ENvironment for multimedia Items and Objects). INVENIO offers to professional users both 2D and 3D content re-use facilities. Concerning the 2D aspects, the system is entirely based on the ISO/MPEG-7 normative specification. INVENIO integrates visual metadata extraction engine, annotation tools, image databases management tools, as well as appropriated, ergonomic user interfaces. In the case of 3D graphical content, INVENIO makes it possible to exploit existing animation curves for generating new content and thus accelerating the content production process.
2024
The first contribution of this study is the description of the prosodic behavior of discourse markers present in two speech corpora of European Portuguese (EP) in different domains (university lectures, and map-task dialogues). The second... more
The first contribution of this study is the description of the prosodic behavior of discourse markers present in two speech corpora of European Portuguese (EP) in different domains (university lectures, and map-task dialogues). The second contribution is a multiclass classification to verify, given their prosodic features, which words in both corpora are classified as discourse markers, which are disfluencies, and which correspond to words that are neither markers nor disfluencies (chunks). Our goal is to automatically predict discourse markers and include them in rich transcripts, along with other structural metadata events (e.g., disfluencies and punctuation marks) that are already encompassed in the language models of our in-house speech recognizer. Results show that the automatic classification of discourse markers is better for the lectures corpus (87%) than for the dialogue corpus (84%). Nonetheless, in both corpora, discourse markers are more easily confused with chunks than with disfluencies.
2023, arXiv: Nuclear Theory
We derive an equation of state for magnetized charge neutral nuclear matter relevant for neutron star structure. The calculations are performed within an effective chiral model based on generalization of sigma model with nonlinear self... more
We derive an equation of state for magnetized charge neutral nuclear matter relevant for neutron star structure. The calculations are performed within an effective chiral model based on generalization of sigma model with nonlinear self interactions of the sigma mesons along with vector mesons and a rho−sigma\rho-\sigmarho−sigma cross-coupling term. The effective chiral model is extended by introducing the contributions of strong magnetic field on the charged particles of the model. The contributions arising from the effects of magnetic field on the Dirac sea of charged baryons are also included. The resulting equation of state for the magnetized dense matter is used to investigate the neutron star properties, like, mass-radius relation and tidal deformability. The dimensionless tidal deformability of 1.4M˜odot1.4~{M}_\odot1.4M˜odot NS is found to be Lambda1.4=526\Lambda_{1.4}=526Lambda1.4=526, which is consistent with recent observation of GW170817. The maximum mass of neutron star in presence of strong magnetic field is consistent with ...
2023, Computational Linguistics
We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units. We propose two methods for combining lexical and prosodic information using hidden Markov... more
We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units. We propose two methods for combining lexical and prosodic information using hidden Markov models and decision trees. Lexical information is obtained from a speech recognizer, and prosodic features are extracted automatically from speech waveforms. We evaluate our approach on the Broadcast News corpus, using the DARPA-TDT evaluation metrics. Results show that the prosodic model alone is competitive with word-based segmentation methods. Furthermore, we achieve a significant reduction in error by combining the prosodic and word-based knowledge sources.
2023, CiteSeer X (The Pennsylvania State University)
Information retrieval techniques for speech are based on those developed for text, and thus expect structured data as input. An essential task is to add sentence boundary information to the otherwise unannotated stream of words output by... more
Information retrieval techniques for speech are based on those developed for text, and thus expect structured data as input. An essential task is to add sentence boundary information to the otherwise unannotated stream of words output by automatic speech recognition systems. We analyze sentence segmentation performance as a function of feature types and transcription (manual versus automatic) for news speech, meetings, and a new corpus of broadcast conversations. Results show that: (1) overall, features for broadcast news transfer well to meetings and broadcast conversations; (2) pitch and energy features perform similarly across corpora, whereas other features (duration, pause, turn-based, and lexical) show di erences; (3) the e ect of speech recognition errors is remarkably stable over features types and corpora, with the exception of lexical features for meetings, and (4) broadcast conversations, a new type of data for speech technology, behave more like news speech than like meetings for this task. Implications for modeling of di erent speaking styles in speech segmentation are discussed.
2023
The practical availability of Audiovisual Processing tools to media scholars and heritage institutions remains limited, despite all the technical advancements of recent years. In this article we present the approach chosen in the CLARIAH... more
The practical availability of Audiovisual Processing tools to media scholars and heritage institutions remains limited, despite all the technical advancements of recent years. In this article we present the approach chosen in the CLARIAH project to increase this availability, we discuss the challenges encountered, and introduce the technical solutions we are implementing. Through three use cases focused on the enrichment of AV archives, Pose Analysis, and Automatic Speech Recognition, we demonstrate the potential and breadth of using Audiovisual Processing for archives and Digital Humanities research.
2023
This paper evaluates the performance of tools for the extraction of metadata from scientific articles. Accurate metadata extraction is an important task for automating the management of digital libraries. This comparative study is a guide... more
This paper evaluates the performance of tools for the extraction of metadata from scientific articles. Accurate metadata extraction is an important task for automating the management of digital libraries. This comparative study is a guide for developers looking to integrate the most suitable and effective metadata extraction tool into their software. We shed light on the strengths and weaknesses of seven tools in common use. In our evaluation using papers from the arXiv collection, GROBID delivered the best results, followed by Mendeley Desktop. SciPlore Xtract, PDFMeat, and SVMHeaderParse also delivered good results depending on the metadata type to be extracted.
2023, Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
We introduce Enlil, an information extraction system that discovers the institutional affiliations of authors in scholarly papers. Enlil consists of two steps: one that first identifies authors and affiliations using a conditional random... more
We introduce Enlil, an information extraction system that discovers the institutional affiliations of authors in scholarly papers. Enlil consists of two steps: one that first identifies authors and affiliations using a conditional random field; and a second support vector machine that connects authors to their affiliations. We benchmark Enlil in three separate experiments drawn from three different sources: the ACL Anthology, the ACM Digital Library, and a set of cross-disciplinary scientific journal articles acquired by querying Google Scholar. Against a state-of-the-art production baseline, Enlil reports a statistically significant improvement in F1 of nearly 10% (p « 0.01). In the case of multidisciplinary articles from Google Scholar, Enlil is benchmarked over both clean input (F1 > 90%) and automatically-acquired input (F1 > 80%). We have deployed Enlil in a case study involving Asian genomics research publication patterns to understand how government sponsored collaborative links evolve. Enlil has enabled our team to construct and validate new metrics to quantify the facilitation of research as opposed to direct publication.
2023, International Journal of Document Analysis and Recognition (IJDAR)
EBORA (Digital AccEss to BOoks of the RenAissance) is a multidisciplinary European project aiming at digitizing and thus making rare sixteenth century books more accessible. End-users, librarians, historians, researchers in book history... more
EBORA (Digital AccEss to BOoks of the RenAissance) is a multidisciplinary European project aiming at digitizing and thus making rare sixteenth century books more accessible. End-users, librarians, historians, researchers in book history and computer scientists participated in the development of remote and collaborative access to digitized Renaissance books, necessary because of the reduced accessibility to digital libraries in image mode through the Internet. The size of files for the storage of images, the lack of a standard file format exchange suitable for progressive transmission, and limited querying possibilities currently limit remote access to digital libraries. To improve accessibility, historical documents must be digitized and retroconverted to extract a detailed description of the image contents suited to users' needs. Specialists of the Renaissance have described the metadata generally required by end-users and the ideal functionalities of the digital library. The retro-conversion of historical documents is a complex process that includes image capture, metadata extraction, image storage and indexing, automatic conversion in a reusable electronic form, publication on the Internet, and data compression for faster remote access. The steps of this process cannot be developed independently. DEBORA proposes a global approach to retro-conversion from the digitization to the final functionalities of the digital library centered on users' needs. The retro-conversion process is mainly based on a document image analysis system that simultaneously extracts the metadata and compresses the images. We also propose a file format to describe
2023
The information on the Web increases tremendously. A number of search engines have been developed for searching Web information and retrieving relevant documents that satisfy the inquirers needs. Search engines provide inquirers... more
The information on the Web increases tremendously. A number of search engines have been developed for searching Web information and retrieving relevant documents that satisfy the inquirers needs. Search engines provide inquirers irrelevant documents among search results, since the search is text-based rather than semantic-based. Information retrieval research area has presented a number of approaches and methodologies such as profiling, feedback, query modification, human-computer interaction, etc for improving search results. Moreover, information retrieval has employed artificial intelligence techniques and strategies such as machine learning heuristics, tuning mechanisms, user and system vocabularies, logical theory, etc for capturing user's preferences and using them for guiding the search based on the semantic analysis rather than syntactic analysis. Although a valuable improvement has been recorded on search results, the survey has shown that still search engines users are...
2023
This paper will present the contribution of the European PrestoSpace project to the study and development of a Metadata Access and Delivery ( MAD) platform for multimedia and television broadcast archives. The MAD system aims at... more
This paper will present the contribution of the European PrestoSpace project to the study and development of a Metadata Access and Delivery ( MAD) platform for multimedia and television broadcast archives. The MAD system aims at generating, validating and delivering to archive users metadata created by automatic and semi- automatic information extraction processes. The MAD publication platform employs audiovisual content analysis, speech recognition (ASR) and semantic analysis tools. It then provides intelligent facilities to access the imported and newly produced metadata. The possibilities opened by the PrestoSpace framework to intelligent indexing and retrieval of multimedia objects within large scale archives apply as well to more general scenarios where semantic information is needed to cope with the complexity of the search process.
2023, Conference of the International Speech Communication Association
With the dramatic improvement in automated speech recognition (ASR) accuracy, a variety of machine learning (ML) and natural language processing (NLP) algorithms are designed for human conversation data. Supervised machine learning and... more
With the dramatic improvement in automated speech recognition (ASR) accuracy, a variety of machine learning (ML) and natural language processing (NLP) algorithms are designed for human conversation data. Supervised machine learning and particularly deep neural networks (DNNs) require large annotated datasets in order to train high quality models. In this paper we describe Gecko, a tool for annotation of speech and language features of conversations. Gecko allows efficient and effective segmentation of the voice signal by speaker as well as annotation of the linguistic content of the conversation. A key feature of Gecko is the presentation of the output of automatic segmentation and transcription systems in an intuitive user interface for editing. Gecko allows annotation of Voice Activity Detection (VAD), Diarization, Speaker Identification and ASR outputs on a large scale. Both annotators and data scientists have reported improvement in the speed and accuracy of work.
2023
We examine GetInfoArt-X, associate intelligent system designed with the goal of mechanically deed and organizing large scale collections of erudite documents from the WWW. From the attitude of automatic info extraction and modes of other... more
We examine GetInfoArt-X, associate intelligent system designed with the goal of mechanically deed and organizing large scale collections of erudite documents from the WWW. From the attitude of automatic info extraction and modes of other search, we have a tendency to examine varied functional aspects of this advanced system so as to investigate and explore current and future analysis developments. GetInfoArt-X aims to produce vital different means that of exploring profound knowledge, on the fare side ancient author or title-based question. so as to facilitate such depth, alternative informative aspects of publications, specifically algorithmic psuedocode and scientific figures, should be treated as potential target metadata. whereas these create bigger challenge for content processing, extracting and compartmentalization distinctive document components might yield intriguing ways that of gathering connected documents supported non-conventional criterion. It might encourage be a not...
2023, arXiv (Cornell University)
We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units. We propose two methods for combining lexical and prosodic information using hidden Markov... more
We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units. We propose two methods for combining lexical and prosodic information using hidden Markov models and decision trees. Lexical information is obtained from a speech recognizer, and prosodic features are extracted automatically from speech waveforms. We evaluate our approach on the Broadcast News corpus, using the DARPA-TDT evaluation metrics. Results show that the prosodic model alone is competitive with word-based segmentation methods. Furthermore, we achieve a significant reduction in error by combining the prosodic and word-based knowledge sources.
2023
The goal of the Collaboratory for the Multi-scale Chemical Sciences (CMCS) [1] is to develop an informatics-based approach to synthesizing multi-scale chemistry information to create knowledge in the chemical sciences. CMCS is using a... more
The goal of the Collaboratory for the Multi-scale Chemical Sciences (CMCS) [1] is to develop an informatics-based approach to synthesizing multi-scale chemistry information to create knowledge in the chemical sciences. CMCS is using a portal and metadata-aware content store as a base for building a system to support inter-domain knowledge exchange in chemical science. Key aspects of the system include configurable metadata extraction and translation, a core schema for scientific pedigree, and a suite of tools for managing data and metadata and visualizing pedigree relationships between data entries. CMCS metadata is represented using Dublin Core with metadata extensions that are useful to both the chemical science community and the science community in general. CMCS is working with several chemistry groups who are using the system to collaboratively assemble and analyze existing data to derive new chemical knowledge. In this paper we discuss the project's metadata-related requirements, the relevant software infrastructure, core metadata schema, and tools that use the metadata to enhance science.
2023
A new scheme of detecting fillers in spontaneous speech recognition process was developed. When a filler hypothesis appears during the 2nd pass decoding of a speech recognizer with two-pass configuration, a prosodic module checks the... more
A new scheme of detecting fillers in spontaneous speech recognition process was developed. When a filler hypothesis appears during the 2nd pass decoding of a speech recognizer with two-pass configuration, a prosodic module checks the morpheme which is hypothesized as a filler and outputs the likelihood score of the morpheme being a filler. When the likelihood score exceeds a threshold,
2023
The emerging Semantic Web has been attracted many researchers and developers. New applications have been developed on top of Semantic Web and many supporting tools introduced to improve its software development process. Metadata modeling... more
The emerging Semantic Web has been attracted many researchers and developers. New applications have been developed on top of Semantic Web and many supporting tools introduced to improve its software development process. Metadata modeling is one of development process where supporting tools exists. The existing tools are lack of readability and easiness for a domain knowledge expert to graphically models a problem in semantic model. In this paper, a metadata modeling tool called RDFGraph is proposed. This tool is meant to solve those problems. RDFGraph is also designed to work with modern database management systems that support RDF and to improve the performance of the query execution process. The testing result shows that the rules used in RDFGraph follows the W3C standard and the graphical model produced in this tool is properly translated and correct.
2023, Interdisciplinary Journal of e-Skills and Lifelong Learning
In the last years, the development of different Repositories of Learning Objects has been increased. Users can retrieve these resources for reuse and personalization through searches in web repositories. The importance of high quality... more
In the last years, the development of different Repositories of Learning Objects has been increased. Users can retrieve these resources for reuse and personalization through searches in web repositories. The importance of high quality metadata is key for a successful retrieval. Learning Objects are described with metadata usually in the standard IEEE LOM. We have designed and implemented a Learning Object Metadata ontology (LOM ontology) that establishes an intermediate layer offering a shared vocabulary that allows specifying restrictions and gives a common semantics for any application which uses Learning Objects metadata. Thus, every change in the LOM ontology will be reflected in the different applications that use this ontology with no need to modify their code. In this work, as a proof of concept, we present an assistant prototype to help users to load these Objects in repositories. This prototype automatically extracts, restricts and validates the Learning Objects metadata using the LOM ontology.
2023
In the current age, retrieval of relevant information from massive amount of data is a challenging job. Over the years, precise and relevant retrieval of information has attained high significance. There is a growing need in the market to... more
In the current age, retrieval of relevant information from massive amount of data is a challenging job. Over the years, precise and relevant retrieval of information has attained high significance. There is a growing need in the market to build systems, which can retrieve multimedia information that precisely meets the user's current needs. In this paper, we have introduced a framework for refining query results before showing it to the user, using ambient intelligence, user profile, group profile, user location, time, day, user device type and extracted features. A prototypic tool was also developed to demonstrate the efficiency of the proposed approach.
2023
Syntactic parsing of speech transcriptions faces the problem of the presence of disfluencies that break the syntactic structure of the utterances. We propose in this paper two solutions to this problem. The first one relies on a... more
Syntactic parsing of speech transcriptions faces the problem of the presence of disfluencies that break the syntactic structure of the utterances. We propose in this paper two solutions to this problem. The first one relies on a disfluencies predictor that detects disfluencies and removes them prior to parsing. The second one integrates the disfluencies in the syntactic structure of the utterances and train a disfluencies aware parser.
2023, IEEE Access
JPEG 2000 is a popular image compression technique that uses Discrete Wavelet Transform (DWT) for compression and subsequently provides many rich features for efficient storage and decompression. Though compressed images are preferred for... more
JPEG 2000 is a popular image compression technique that uses Discrete Wavelet Transform (DWT) for compression and subsequently provides many rich features for efficient storage and decompression. Though compressed images are preferred for archival and communication purposes, their processing becomes difficult due to the overhead of decompression and re-compression operations which are needed as many times the data needs to operate. Therefore in this research paper, the novel idea of direct operation over the JPEG 2000 compressed documents is proposed for extracting text and non-text regions without using any segmentation algorithm. The technique avoids full decompression of the compressed document in contrast to the conventional methods, where they fully decompress and then process. Moreover, JPEG 2000 features are explored in this research work to partially and intelligently decompress only the selected regions of interest at different resolutions and bitdepths to accomplish segmentation-less extraction of text and non-text regions. Finally Maximally Stable Extremal Regions (MSER) algorithm is used to extract the layout of segmented text and non-text regions for further analysis. Experiments have been carried out on the standard PRImA Layout Analysis Dataset leading to promising results and saving computational resources.