Metadata Extraction Research Papers - Academia.edu (original) (raw)

Metadata creation is a crucial aspect of the ingest of digital materials into digital libraries. Metadata needed to document and manage digital materials are extensive and manual creation of them expensive. The Digital Curation Centre... more

Metadata creation is a crucial aspect of the ingest of digital materials into digital libraries. Metadata needed to document and manage digital materials are extensive and manual creation of them expensive. The Digital Curation Centre (DCC) has undertaken research to automate this process for some classes of digital material. We have segmented the problem and this paper discusses results in genre classification as a first step toward automating metadata extraction from documents. Here we propose a classification method built on looking at the documents from five directions; as an object exhibiting a specific visual format, as a linear layout of strings with characteristic grammar, as an object with stylo-metric signatures, as an object with intended meaning and purpose, and as an object linked to previously classified objects and other external sources. The results of some experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-facetted approach.

Effective human and automatic processing of speech requires recovery of more than just the words. It also involves recovering phenomena such as sentence boundaries, filler words, and disfluencies, referred to as structural metadata. We... more

Effective human and automatic processing of speech requires recovery of more than just the words. It also involves recovering phenomena such as sentence boundaries, filler words, and disfluencies, referred to as structural metadata. We describe a metadata detection system that combines information from different types of textual knowledge sources with information from a prosodic classifier. We investigate maximum entropy and conditional random field models, as well as the predominant HMM approach, and find that discriminative models generally outperform generative models. We report system performance on both broadcast news and conversational telephone speech tasks, illustrating significant performance differences across tasks and as a function of recognizer performance. The results represent the state of the art, as assessed in the NIST RT-04F evaluation.

Text information in scanned documents becomes accessible only when extracted and interpreted by a text recognizer. For a recognizer to work successfully, it must have detailed location information about the regions of the document images... more

Text information in scanned documents becomes accessible only when extracted and interpreted by a text recognizer. For a recognizer to work successfully, it must have detailed location information about the regions of the document images that it is asked to analyse. It will need focus on page regions with text skipping non-text regions that include illustrations or photographs. However, text recogni-zers do not work as logical analyzers. Logical layout analysis automatically determines the function of a document text region, that is, it labels each region as a title, paragraph, or caption, and so on, and thus is an essential part of a document understanding system. In the past, rule-based algorithms have been used to conduct logical layout analysis, using limited size data sets. We here instead focus on supervised learning methods for logical layout analysis. We describe LABA, a system based on multiple support vector machines to perform logical Layout Analysis of scanned Books pages in Arabic. The system detects the function of a text region based on the analysis of various images features and a voting mechanism. For a baseline comparison, we implemented an older but state-of-the-art neural network method. We evaluated LABA using a data set of scanned pages from illustrated Arabic books and obtained high recall and precision values. We also found that the F-measure of LABA is higher for five of the tested six classes compared to the state-of-the-art method.

The increasing availability of on-line music has motivated a growing interest for organizing, commercializing, and delivering this kind of multimedia content. For it, the use of metadata is of utmost importance. Metadata permit... more

The increasing availability of on-line music has motivated a growing interest for organizing, commercializing, and delivering this kind of multimedia content. For it, the use of metadata is of utmost importance. Metadata permit organization, indexing, and retrieval of music contents. They are, therefore, a subject of research both from the design and automatic extraction approaches. The present work focuses on this second issue, providing an open source tool for metadata extraction from standard MIDI files. The tool is presented, the utilized metadata are explained, and some applications and experiments are described as examples of its capabilities.

The increasing availability of on-line music has motivated a growing interest for organizing, commercializing, and delivering this kind of multimedia content. For it, the use of metadata is of utmost importance. Metadata permit... more

The increasing availability of on-line music has motivated a growing interest for organizing, commercializing, and delivering this kind of multimedia content. For it, the use of metadata is of utmost importance. Metadata permit organization, indexing, and retrieval of music contents. They are, therefore, a subject of research both from the design and automatic extraction approaches. The present work focuses on this second issue, providing an open source tool for metadata extraction from standard MIDI files. The tool is presented, the utilized metadata are explained, and some applications and experiments are described as examples of its capabilities.

Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The... more

Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, the lack of a universal table metadata specification, and the limitation of the existing ranking schemes make the table search problem challenging. In this paper, we describe Ta-bleSeer, a search engine for tables. TableSeer crawls digital libraries, detects tables from documents, extracts tables metadata, indexes and ranks tables, and provides a userfriendly search interface. We propose an extensive set of medium-independent metadata for tables that scientists and other users can adopt for representing table information. In addition, we devise a novel page box-cutting method to improve the performance of the table detection. Given a query, TableSeer ranks the matched tables using an innovative ranking algorithm -TableRank. TableRank rates each <query, table> pair with a tailored vector space model and a specific term weighting scheme. Overall, TableSeer eliminates the burden of manually extract table data from digital libraries and enables users to automatically examine tables. We demonstrate the value of TableSeer with empirical studies on scientific documents.

The popularity of motion pictures in digital form has seen a dramatic increase in recent years, and the global entertainment market has driven demands for subtitles in multiple languages.This paper investigates the informational potential... more

The popularity of motion pictures in digital form has seen a dramatic increase in recent years, and the global entertainment market has driven demands for subtitles in multiple languages.This paper investigates the informational potential of aggregating a corpus of multilingual subtitles for a digital library. Subtitles are extracted from commercial DVD releases and downloaded from the internet. These subtitles and their bibliographic metadata are then incorporated in an XML-based database structure. A digital library prototype is developed to provide full-text search and browse of the subtitle text with single- or parallel-language displays. The resulting product includes a set of tools for subtitles acquisition and a web browser-based digital library prototype that is portable, extensible and interoperable across computing platforms. The functionalities of this prototype are discussed in comparison to another subtitles corpus created for computational linguistics studies. Several informational potentials of this digital library prototype are identified: as an educational tool for language learning, as a finding aid for citations, and as a gateway for additional temporal access points for video retrieval.

Many of the multimedia researchers have focused on the issue of the retrieval of images using indexed image collections. A number of spatial data structures based on Minimum Bounding Rectangles (MBRs) have been developed. Previously, we... more

Many of the multimedia researchers have focused on the issue of the retrieval of images using indexed image collections. A number of spatial data structures based on Minimum Bounding Rectangles (MBRs) have been developed. Previously, we presented ...

Purpose of this paper -The support for automation of the annotation process of large corpora of digital content. Design/methodology/approach -In this paper we first present and discuss an information extraction pipeline from digital... more

Purpose of this paper -The support for automation of the annotation process of large corpora of digital content. Design/methodology/approach -In this paper we first present and discuss an information extraction pipeline from digital document acquisition to information extraction, processing and management. An overall architecture that support such extraction pipeline is detailed and discussed. Findings -The proposed pipeline is implemented in a working prototype of an Autonomous Digital Library system -the ScienceTreks system -that: (1) support a broad range of methods for documents acquisition; (2) does not rely on any external information sources and is solely based on the existing information in the document itself and in the overall set in a given digital archive;

The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information. Major components include the user interface, a geographical ontology, maintenance and... more

The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information. Major components include the user interface, a geographical ontology, maintenance and retrieval functions for a test collection of web documents, textual and spatial indexes, relevance ranking and metadata extraction. Here we summarise the functionality and interaction between these components before focusing on the design of the geo-ontology and the development of spatio-textual indexing methods. The geo-ontology supports functionality for disambiguation, query expansion, relevance ranking and metadata extraction. Geographical place names are accompanied by multiple geometric footprints and qualitative spatial relationships. Spatial indexing of documents has been integrated with text indexing through the use of spatio-textual keys in which terms are concatenated with spatial cells to which they relate. Preliminary experiments demonstrate considerable performance benefits when compared with pure text indexing and with text indexing followed by a spatial filtering stage.

One of the biggest problems for implementation of a Spatial Data Infrastructure is the creation of metadata because it requires a high knowledge of sciences related to cartography and a good infrastructure of tools that allow to get... more

One of the biggest problems for implementation of a Spatial Data Infrastructure is the creation of metadata because it requires a high knowledge of sciences related to cartography and a good infrastructure of tools that allow to get information from the original Geographical Information and which carry out transformations or conversions of coordinates, transcription of multiple information, etc.. This fact has motivated a detailed study of file formats of geographical information. The objective was to identify the information which can be acquired from its own files and to study how they are related with the metadata standard of geographical information. The study has shown different techniques used by enterprises to store meta information in form of headers, directories and labels. Homogeneous groups of information which can be retrieved from the different categories of formats have been identified. This study has provided a high range of conclusions and perspectives which should b...

Scholarly digital libraries increasingly provide analytics to information within documents themselves. This includes information about the logical document structure of use to downstream components, such as search, navigation, and... more

Scholarly digital libraries increasingly provide analytics to information within documents themselves. This includes information about the logical document structure of use to downstream components, such as search, navigation, and summarization. In this paper, the authors describe SectLabel, a module that further develops existing software to detect the logical structure of a document from existing PDF files, using the formalism of conditional random fields. While previous work has assumed access only to the raw text ...

In this paper we propose a method to extract automatically metadata (title, authors, affiliation, email, references, etc) from science papers by combining the layout information of papers with rules which are defined by using JAPE Grammar... more

In this paper we propose a method to extract automatically metadata (title, authors, affiliation, email, references, etc) from science papers by combining the layout information of papers with rules which are defined by using JAPE Grammar rules of GATE. After metadata extracted automatically from digital documents, user can interact and correct them before they are exported to XML files. Developing a tool to extract metadata from digital documents is a very necessary and useful task for building collections, organizing and searching documents in digital libraries. The extraction method is tested on computer science paper collections selected from international journals, proceedings downloaded from digital libraries such as ACM, IEEE, Springer and CiteSeer.

Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The... more

Tables are ubiquitous in digital libraries. In scientific documents, tables are widely used to present experimental results or statistical data in a condensed fashion. However, current search engines do not support table search. The difficulty of automatic extracting tables from un-tagged documents, the lack of a universal table metadata specification, and the limitation of the existing ranking schemes make the table search problem challenging. In this paper, we describe Ta-bleSeer, a search engine for tables. TableSeer crawls digital libraries, detects tables from documents, extracts tables metadata, indexes and ranks tables, and provides a userfriendly search interface. We propose an extensive set of medium-independent metadata for tables that scientists and other users can adopt for representing table information. In addition, we devise a novel page box-cutting method to improve the performance of the table detection. Given a query, TableSeer ranks the matched tables using an innovative ranking algorithm -TableRank. TableRank rates each <query, table> pair with a tailored vector space model and a specific term weighting scheme. Overall, TableSeer eliminates the burden of manually extract table data from digital libraries and enables users to automatically examine tables. We demonstrate the value of TableSeer with empirical studies on scientific documents.

Both human and automatic processing of speech require recognition of more than just words. In this paper we provide a brief overview of research on structural metadata extraction in the DARPA EARS rich transcription program. Tasks include... more

Both human and automatic processing of speech require recognition of more than just words. In this paper we provide a brief overview of research on structural metadata extraction in the DARPA EARS rich transcription program. Tasks include detection of sentence boundaries, filler words, and disfluencies. Modeling approaches combine lexical, prosodic, and syntactic information, using various modeling techniques for knowledge source integration. The performance of these methods is evaluated by task, by data source (broadcast news versus spontaneous telephone conversations) and by whether transcriptions come from humans or from an (errorful) automatic speech recognizer. A representative sample of results shows that combining multiple knowledge sources (words, prosody, syntactic information) is helpful, that prosody is more helpful for news speech than for conversational speech, that word errors significantly impact performance, and that discriminative models generally provide benefit over maximum likelihood models. Important remaining issues, both technical and programmatic, are also discussed.

The video editing is a work to produce the final videos with certain duration by finding and selecting appropriate cuts from the material videos and connecting them. In or- der to produce the excellent videos, this process is gener- ally... more

The video editing is a work to produce the final videos with certain duration by finding and selecting appropriate cuts from the material videos and connecting them. In or- der to produce the excellent videos, this process is gener- ally conducted according to the special rules called "video grammar". The purpose of this study is to develop an in- telligent support system for video editing so that metadata are extracted automatically and then the video grammars are applied to the extracted metadata. In this paper, we de- scribe the metadata extraction such as camera work, camera tempo, camera direction, face and shot size.

This paper describes a metadata extraction technique based on natural language processing (NLP) which extracts personalized information from email communications between financial analysts and their clients. Personalized means connecting... more

This paper describes a metadata extraction technique based on natural language processing (NLP) which extracts personalized information from email communications between financial analysts and their clients. Personalized means connecting users with content in a personally meaningful way to create, grow, and retain online relationships. Personalization often results in the creation of user profiles that store individuals' preferences regarding goods or services offered by various e-commerce merchants. With the introduction of e-commerce, it has become more difficult to develop and maintain personalized information due to larger transaction volumes. <!metaMarker> is an NLP and Machine Learning (ML)-based automatic metadata extraction system designed to process textual data such as emails, discussion group postings, or chat group transcriptions. <!metaMarker> extracts both explicit and implicit metadata elements including proper names, numeric concepts, and topic/subject information. In addition, Speech Act Theory inspired metadata elements, which represent the message creators' intention, mood, and urgency are also extracted. In a typical dialogue between financial analysts and their clients, clients often discuss the items that they liked or have an interest. By extracting this information, <!metaMarker> constructs user profiles automatically. This system has been designed, implemented, and tested with real-world data. The overall accuracy and coverage of extracting explicit and implicit metadata is about 90%. In summary, the paper shows that an NLP-based metadata extraction system enables automatic user profiling with high effectiveness.

Tables are used to present, list, summarize, and structure important data in documents. In scholarly articles, they are often used to present the relationships among data and highlight a collection of results obtained from experiments and... more

Tables are used to present, list, summarize, and structure important data in documents. In scholarly articles, they are often used to present the relationships among data and highlight a collection of results obtained from experiments and scientific analysis. In digital libraries, extracting this data automatically and understanding the structure and content of tables are very important to many applications. Automatic identification extraction, and search for the contents of tables can be made more precise with the help of metadata. In this paper, we propose a set of mediumindependent table metadata to facilitate the table indexing, searching, and exchanging. To extract the contents of tables and their metadata, an automatic table metadata extraction algorithm is designed and tested on PDF documents.

on-line courses are often created by using existing learning objects found on the net. However, those learning objects cannot easily be reused for the creation of a new didactic work because they are usually proposed without information... more

on-line courses are often created by using existing learning objects found on the net. However, those learning objects cannot easily be reused for the creation of a new didactic work because they are usually proposed without information on their aims and the typology of users which they are destined to. Moreover, the contents are not clearly synthesized so that the reading of the whole object is often necessary to understand its relevance to the new course. To facilitate this task, we have created a system called SAXEF (System for Automatic eXtraction of lEearning object Features) which allows to automatically extract the basic indicators of any learning object (a sort of DNA) found in Internet. It provides a valuable help to a teacher who is in the process of creating a new on-line course because he/she can easily choose the most appropriate learning objects from the net just by looking at their basic indicators. SAXEF presents a modular structure and we have already developed some...

In this paper, a system, RitroveRAI, addressing the general problem of enriching a multimedia news stream with semantic metadata is presented. News metadata here are explicitly derived from transcribed sentences or implicitly expressed... more

In this paper, a system, RitroveRAI, addressing the general problem of enriching a multimedia news stream with semantic metadata is presented. News metadata here are explicitly derived from transcribed sentences or implicitly expressed into a topical category automatically detected. The enrichment process is accomplished by searching the same news expressed by different agencies reachable over the Web. Metadata extraction from the alternative sources (i.e. Web pages) is similarly applied and finally integration of the sources (according to some heuristic of pertinence) is carried out. Performance evaluation of the current system prototype has been carried out on a large scale. It confirms the viability of the RitroveRAI approach for realistic (i.e. 24 hours) applications and continuous monitoring and metadata extraction from multimedia news data.

In this article, we discuss the potential benefits, the requirements and the challenges involved in patent image retrieval and subsequently, we propose a framework that encompasses advanced image analysis and indexing techniques to... more

In this article, we discuss the potential benefits, the requirements and the challenges involved in patent image retrieval and subsequently, we propose a framework that encompasses advanced image analysis and indexing techniques to address the need for content-based patent image search and retrieval. The proposed framework involves the application of document image pre-processing, image feature and textual metadata extraction in order to support effectively content-based image retrieval in the patent domain. To evaluate the capabilities of our proposal, we implemented a patent image search engine. Results based on a series of interaction modes, comparison with existing systems and a quantitative evaluation of our engine provide evidence that image processing and indexing technologies are currently sufficiently mature to be integrated in real-world patent retrieval applications.

When a user performs a web search, the first query entered will frequently not return the required information. Thus, one needs to review the initial set of links and then to modify the query or construct a new one. This incremental... more

When a user performs a web search, the first query entered will frequently not return the required information. Thus, one needs to review the initial set of links and then to modify the query or construct a new one. This incremental process is particularly frustrating and difficult to manage for a mobile user due to the device limitations (e.g. keyboard, display). We present a query formulation architecture that employs the notion of context in order to automatically construct queries, where context refers to the article currently being viewed by the user. The proposed system uses semantic metadata extracted from the web page being consumed to automatically generate candidate queries. Novel methods are proposed to create and validate candidate queries. Further two variants of query expansion and a post-expansion validation technique are described. Finally, insights into the effectiveness of our system are provided based on evaluation tests of its individual components.

This paper describes a cooperative distributed system for outdoor surveillance based on fixed and mobile cameras. In order to continuously monitor the entire scene, a fixed unidirectional sensor has been used mounted on the roof of a... more

This paper describes a cooperative distributed system for outdoor surveillance based on fixed and mobile cameras. In order to continuously monitor the entire scene, a fixed unidirectional sensor has been used mounted on the roof of a building in front of the guarded area. To obtain higher resolution images of a particular region in the scene, an active pan-tilt-zoom camera has been used. The low resolution images are used to detect and locate moving objects in a scene. The estimated object position is used in order to evaluate pan-tilt movements that are necessary in order to focus the attention of the mobile-head camera on the considered object at a higher zoom level. Implemented system is able to provide automatic change detection at multiple zoom levels as main feature. Video shot with small zoom factor is used to monitor the entire scene from fixed camera, while medium and high zoom factor are used to improve the interpretation of the scene. The use of a mobile camera allows one to exceed the limitations about the bounded field of view of the sensor imposed by a fixed camera. In this case it is not possible to provide an a priori knowledge about the background of the scene. The proposed method for solving the non-fixed background problem for mobile cameras consists in the realization of a multilevel structure obtained by the acquisition of several images. The panoramic image of the whole scene is generated by using mosaicing technique. Both sensors are used to detect and estimate the precise location of a given object at different zoom levels in order to obtain a better position estimation. The results presented in the paper show the validity of the proposed approach in terms of probabilities of false alarm and misdetection of the system and algorithms computational complexity and mean processing time.

Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a Support Vector Machine... more

Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a Support Vector Machine classification-based method for metadata extraction from header part of research papers and show that it outperforms other machine learning methods on the same task. The method first classifies each line of the header into one or more of 15 classes. An iterative convergence procedure is then used to improve the line classification by using the predicted class labels of its neighbor lines in the previous round. Further metadata extraction is done by seeking the best chunk boundaries of each line. We found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance. An appropriate feature normalization also greatly improves the classification performance. Our metadata extraction method was originally designed to improve the metadata extraction quality of the digital libraries Citeseer and EbizSearch . We believe it can be generalized to other digital libraries. 1

The integration of bibliographical information on scholarly publications available on the Internet is an important task in the academic community. Accurate reference metadata extraction from such publications is essential for the... more

The integration of bibliographical information on scholarly publications available on the Internet is an important task in the academic community. Accurate reference metadata extraction from such publications is essential for the integration of metadata from heterogeneous reference sources. In this paper, we propose a hierarchical template-based reference metadata extraction method for scholarly publications. We adopt a hierarchical knowledge representation framework called INFOMAP, which automatically extracts metadata. The experimental results show that, by using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different kinds of reference styles with a high degree of precision. The overall average accuracy is 92.39% for the six major reference styles compared in this study.

Metadata are necessary to allow discovery and description of data and service resources within a Spatial Data Infrastructure, however current manual metadata editing workflows are tedious and under-utilized. We discuss on-going... more

Metadata are necessary to allow discovery and description of data and service resources within a Spatial Data Infrastructure, however current manual metadata editing workflows are tedious and under-utilized. We discuss on-going developments for semi-automatic metadata extraction from wellknown imagery and cartographic data sources, being implemented within an open source software project in Spain. Internal metadata are collected automatically and the user can then choose to add external metadata, and to publish the final metadata record to catalogues. The next step will be to extract implicit metadata using Google-like methods.

The goal of the Collaboratory for the Multi-scale Chemical Sciences (CMCS) [1] is to develop an informatics-based approach to synthesizing multi-scale chemistry information to create knowledge in the chemical sciences. CMCS is using a... more

The goal of the Collaboratory for the Multi-scale Chemical Sciences (CMCS) [1] is to develop an informatics-based approach to synthesizing multi-scale chemistry information to create knowledge in the chemical sciences. CMCS is using a portal and metadata-aware content store as a base for building a system to support inter-domain knowledge exchange in chemical science. Key aspects of the system include configurable metadata extraction and translation, a core schema for scientific pedigree, and a suite of tools for managing data and metadata and visualizing pedigree relationships between data entries. CMCS metadata is represented using Dublin Core with metadata extensions that are useful to both the chemical science community and the science community in general. CMCS is working with several chemistry groups who are using the system to collaboratively assemble and analyze existing data to derive new chemical knowledge. In this paper we discuss the project's metadata-related requirements, the relevant software infrastructure, core metadata schema, and tools that use the metadata to enhance science.

With a very modest investment in computer hardware and the open-source local data manager (LDM) software from University Corporation for Atmospheric Research (UCAR) Unidata Program Center, a researcher can receive a variety of NEXRAD... more

With a very modest investment in computer hardware and the open-source local data manager (LDM) software from University Corporation for Atmospheric Research (UCAR) Unidata Program Center, a researcher can receive a variety of NEXRAD Level III rainfall products and the unprocessed Level II data in real-time from most NEXRAD radars in the USA. Alternatively, one can receive such data from the National Climatic Data Center in Ashville, NC. Still, significant obstacles remain in order to unlock the full potential of the data. One set of obstacles is related to effective management of multi-terabyte datasets. A second set of obstacles, for hydrologists and hydrometeorologists in particular, is that the NEXRAD Level III products are not well suited for applications in hydrology. There is a strong need for the generation of high-quality products directly from the Level II data with well-documented steps that include quality control, removal of false echoes, rainfall estimation algorithms,...

New on-line courses are often created by using existing learning objects found on the net. However, those learning objects cannot easily be reused for the creation of a new didactic work because they are usually proposed without... more

New on-line courses are often created by using existing learning objects found on the net. However, those learning objects cannot easily be reused for the creation of a new didactic work because they are usually proposed without information on their aims and the typology of users which they are destined to. Moreover, the contents are not clearly synthesized so that the reading of the whole object is often necessary to understand its relevance to the new course. To facilitate this task, we have created a system called SAXEF (System for Automatic eXtraction of lEearning object Features) which allows to automatically extract the basic indicators of any learning object (a sort of DNA) found in Internet. It provides a valuable help to a teacher who is in the process of creating a new on-line course because he/she can easily choose the most appropriate learning objects from the net just by looking at their basic indicators. SAXEF presents a modular structure and we have already developed some modules and are in the process of implementing the rest of the system. This paper presents the main architecture of SAXEF and the details of the text analysis module for extracting main and secondary topics of a learning object.

Although oral culture has been part of our history for thousands of years, we have only fairly recently been enabled to record and preserve that part of our heritage. Over the past century millions of hours of audiovisual data have been... more

Although oral culture has been part of our history for thousands of years, we have only fairly recently been enabled to record and preserve that part of our heritage. Over the past century millions of hours of audiovisual data have been collected. 2 Typically, audiovisual (A/V) archival institutes are the keepers of these collections, a significant part of which contains spoken word materials, such as interviews, speeches and radio broadcasts.

We present a way of building ontologies that proceeds in a bottom-up fashion, defining concepts as clusters of concrete XML objects. Our rough bottom-up ontologies are based on simple relations like association and inheritance, as well as... more

We present a way of building ontologies that proceeds in a bottom-up fashion, defining concepts as clusters of concrete XML objects. Our rough bottom-up ontologies are based on simple relations like association and inheritance, as well as on value restrictions, and can be used to enrich and update existing upper ontologies. Then, we show how automatically generated assertions based on our bottom-up ontologies can be associated with a flexible degree of trust by nonintrusively collecting user feedback in the form of implicit and explicit votes. Dynamic trust-based views on assertions automatically filter out imprecisions and substantially improve metadata quality in the long run

This paper builds on the work presented at the ECDL 2006 in automated genre classification as a step toward automating metadata extraction from digital documents for ingest into digital repositories such as those run by archives,... more

This paper builds on the work presented at the ECDL 2006 in automated genre classification as a step toward automating metadata extraction from digital documents for ingest into digital repositories such as those run by archives, libraries and eprint services . We have previously proposed dividing features of a document into five types (features for visual layout, language model features, stylometric features, features for semantic structure, and contextual features as an object linked to previously classified objects and other external sources) and have examined visual and language model features. The current paper compares results from testing classifiers based on image and stylometric features in a binary classification to show that certain genres have strong image features which enable effective separation of documents belonging to the genre from a large pool of other documents.

Enterprises provide professionally authored content about their products/services in different languages for use in web sites and customer care. For customer care, personalization/personalized information delivery is becoming important... more

Enterprises provide professionally authored content about their products/services in different languages for use in web sites and customer care. For customer care, personalization/personalized information delivery is becoming important since it re-encourages users to return to the service provider. Personalization usually requires both contextual and descriptive metadata. But current metadata authored by content developers is usually quite simple. In this paper, we introduce an automatic metadata extraction framework, which can extract multilingual metadata from the enterprise content, for a personalized information retrieval system. We introduce two new ontologies for metadata creation and a novel semi-automatic topic vocabulary extraction algorithm. We demonstrate and evaluate our approach on the English and German Symantec Norton 360 technical content. Evaluations indicate that the proposed approach produces rich and high quality metadata for a personalized information retrieval system.

EBORA (Digital AccEss to BOoks of the RenAissance) is a multidisciplinary European project aiming at digitizing and thus making rare sixteenth century books more accessible. End-users, librarians, historians, researchers in book history... more

EBORA (Digital AccEss to BOoks of the RenAissance) is a multidisciplinary European project aiming at digitizing and thus making rare sixteenth century books more accessible. End-users, librarians, historians, researchers in book history and computer scientists participated in the development of remote and collaborative access to digitized Renaissance books, necessary because of the reduced accessibility to digital libraries in image mode through the Internet. The size of files for the storage of images, the lack of a standard file format exchange suitable for progressive transmission, and limited querying possibilities currently limit remote access to digital libraries. To improve accessibility, historical documents must be digitized and retroconverted to extract a detailed description of the image contents suited to users' needs. Specialists of the Renaissance have described the metadata generally required by end-users and the ideal functionalities of the digital library. The retro-conversion of historical documents is a complex process that includes image capture, metadata extraction, image storage and indexing, automatic conversion in a reusable electronic form, publication on the Internet, and data compression for faster remote access. The steps of this process cannot be developed independently. DEBORA proposes a global approach to retro-conversion from the digitization to the final functionalities of the digital library centered on users' needs. The retro-conversion process is mainly based on a document image analysis system that simultaneously extracts the metadata and compresses the images. We also propose a file format to describe compressed books as heterogeneous data (images/text/ links/ annotation/physical layout and logical structure) suitable for progressive transmission, editing, and annotation. DEBORA is an exploratory project that aims at demonstrating the feasibility of the concepts by developing prototypes tested by end-users.

Event detection is a crucial part for soccer video searching and querying. The event detection could be done by video content itself or from a structured or semi structured text files gathered from sports web sites. In this paper, we... more

Event detection is a crucial part for soccer video searching and querying. The event detection could be done by video content itself or from a structured or semi structured text files gathered from sports web sites. In this paper, we present an approach of metadata extraction from match reports for soccer domain. The UEFA Cup and UEFA Champions League Match Reports are downloaded from the web site of UEFA by a web-crawler. Using regular expressions we annotate these match reports and then extract events from annotated match reports. Extracted events are saved in an MPEG-7 file. We present an interface that is used to query the events in the MPEG-7 match corpus. If an associated match video is available, the video portions that correspond to the found events could be played.

Both human and automatic processing of speech require recogniz- ing more than just the words. We describe a state-of-the-art sys- tem for automatic detection of "metadata" (information beyond the words) in both broadcast news... more

Both human and automatic processing of speech require recogniz- ing more than just the words. We describe a state-of-the-art sys- tem for automatic detection of "metadata" (information beyond the words) in both broadcast news and spontaneous telephone conver- sations, developed as part of the DARPA EARS Rich Transcription program. System tasks include sentence boundary detection, filler word detection, and detection/correction of disfluencies. To achieve best performance, we combine information from different types of language models (based on words, part-of-speech classes, and au- tomatically induced classes) with information from a prosodic clas- sifier. The prosodic classifier employs bagging and ensemble ap- proaches to better estimate posterior probabilities. We use confu- sion networks to improve robustness to speech recognition errors. Most recently, we have investigated a maximum entropy approach for the sentence boundary detection task, yielding a gain over our st...

Effective human and automatic processing of speech requires recovery of more than just the words. It also involves recovering phenomena such as sentence boundaries, filler words, and disfluencies, referred to as structural metadata. We... more

Effective human and automatic processing of speech requires recovery of more than just the words. It also involves recovering phenomena such as sentence boundaries, filler words, and disfluencies, referred to as structural metadata. We describe a metadata detection system that combines information from different types of textual knowledge sources with information from a prosodic classifier. We investigate maximum entropy and conditional random field models, as well as the predominant HMM approach, and find that discriminative models generally outperform generative models. We report system performance on both broadcast news and conversational telephone speech tasks, illustrating significant performance differences across tasks and as a function of recognizer performance. The results represent the state of the art, as assessed in the NIST RT-04F evaluation.

The World Wide Web is a continuously evolving network of contents (eg Web pages, images, sound files, etc.) and an interconnecting link structure. Hence, an archivist may never be sure if the contents collected so far are still consistent... more

The World Wide Web is a continuously evolving network of contents (eg Web pages, images, sound files, etc.) and an interconnecting link structure. Hence, an archivist may never be sure if the contents collected so far are still consistent with those contents she ...

Today's digital libraries (DLs) archive vast amounts of information in the form of text, videos, images, data measurements, etc. User access to DL content can rely on similarity between metadata elements, or similarity between the... more

Today's digital libraries (DLs) archive vast amounts of information
in the form of text, videos, images, data measurements,
etc. User access to DL content can rely on similarity
between metadata elements, or similarity between the data
itself (content-based similarity). We consider the problem of
exploratory search in large DLs of time-oriented data. We
propose a novel approach for overview- rst exploration of
data collections based on user-selected metadata properties.
In a 2D layout representing entities of the selected property
are laid out based on their similarity with respect to the underlying
data content. The display is enhanced by compact
summarizations of underlying data elements, and forms the
basis for exploratory navigation of users in the data space.
The approach is proposed as an interface for visual exploration,
leading the user to discover interesting relationships
between data items relying on content-based similarity between
data items and their respective metadata labels. We
apply the method on real data sets from the earth observation
community, showing its applicability and usefulness.

The main purpose of this book is to provide an overview of the current trends in the field of digitization of cultural heritage as well as to present recent research done within the framework of the project D002-308 funded by Bulgarian... more

The main purpose of this book is to provide an overview of the current trends in the field of digitization of cultural heritage as well as to present recent research done within the framework of the project D002-308 funded by Bulgarian National Science Fund. The main contributions of the work presented are in organizing digital content, metadata generation, and methods for enhancing resource discovery. The parts of the book can be downloaded here:

Metadata are necessary to allow discovery and description of data and service resources within a Spatial Data Infrastructure, however current manual metadata editing workflows are tedious and under-utilized. We discuss ongoing... more

Metadata are necessary to allow discovery and description of data and service resources within a Spatial Data Infrastructure, however current manual metadata editing workflows are tedious and under-utilized. We discuss ongoing developments for semi-automatic metadata extraction from wellknown imagery and cartographic data sources, being implemented within an open source software project in Spain. Internal metadata are collected automatically and the user can then choose to add external metadata, and to publish the final metadata record to catalogues. The next step will be to extract implicit metadata using Google-like methods.

Abstract. In this paper, we address the problem of integrating Wikipedia, an online encyclopedia, and G-Portal, a web-based digital library, in the geography domain. The integration facilitates the sharing of data and services between the... more

Abstract. In this paper, we address the problem of integrating Wikipedia, an online encyclopedia, and G-Portal, a web-based digital library, in the geography domain. The integration facilitates the sharing of data and services between the two web applications that are of great value in learning. We first present an overall system architecture for supporting such an integration and address the metadata extraction problem associated with it. In metadata extraction, we focus on extracting and constructing metadata for geo-political ...

Mobile devices such as cellular phones are now capable of storing a significant amount of multimedia files and personal data. However these devices still use traditional directory browsing which offers little in terms of usability for... more

Mobile devices such as cellular phones are now capable of storing a significant amount of multimedia files and personal data. However these devices still use traditional directory browsing which offers little in terms of usability for searching and retrieving specific files. In this paper we design and implement a prototype media search engine for a mobile phone. We use modified

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) began as an alternative to distributed searching of scholarly eprint repositories. The model embraced by the OAI-PMH is that of metadata harvesting, where value-added... more

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) began as an alternative to distributed searching of scholarly eprint repositories. The model embraced by the OAI-PMH is that of metadata harvesting, where value-added services (by a "service provider") are constructed on cached copies of the metadata extracted from the repositories of the harvester's choosing. While this model dispenses with the well known problems of distributed searching, it introduces the problem of synchronization. Stated simply, this problem arises when the service provider's copy of the metadata does not match the metadata currently at the constituent repositories. We define some metrics for describing the synchronization problem in the OAI-PMH. Based on these metrics, we study the synchronization problem of the OAI-PMH framework and propose several approaches for harvesters to implement better synchronization. In particular, if a repository knows its update frequency, it can publish it in an OAI-PMH Identify response using an optional About container that borrows from RDF Site Syndication (RSS) Format.

In this paper, we present Infoshare: a flexible and scalable information authoring and sharing architecture for multimedia digital signage systems. Proposed architecture internally analyze and classify content, design and system... more

In this paper, we present Infoshare: a flexible and scalable information authoring and sharing architecture for multimedia digital signage systems. Proposed architecture internally analyze and classify content, design and system infrastructure into three different layers using metadata extracted by user input and creates a highly scalable and easy to share digital signage environment. Furthermore, this architecture enables decentralized management by defining different user roles to handle each above mentioned layers. We believe that Infoshare architecture can make an efficient Creation, Distribution and Installation (CDI) cycle for multimedia digital signage content while allowing smooth scalability and management of the system. We applied this architecture to a prototype digital signage system. Paper describes the system implementation, key features and future development directions.

We present a way of building ontologies that proceeds in a bottom-up fashion, defining concepts as clusters of concrete XML objects. Our rough bottom-up ontologies are based on simple relations like association and inheritance, as well as... more

We present a way of building ontologies that proceeds in a bottom-up fashion, defining concepts as clusters of concrete XML objects. Our rough bottom-up ontologies are based on simple relations like association and inheritance, as well as on value restrictions, and can be used to enrich and update existing upper ontologies. Then, we show how automatically generated assertions based on our bottom-up ontologies can be associated with a flexible degree of trust by nonintrusively collecting user feedback in the form of implicit and explicit votes. Dynamic trust-based views on assertions automatically filter out imprecisions and substantially improve metadata quality in the long run