Apache Solr Research Papers - Academia.edu (original) (raw)
- by
- •
- Apache Solr
In recent years, multiple solutions have become available providing search on huge amounts of plain text and metadata. Scalable searchability on annotated text however still appears to be problematic. With Mtas, an acronym for Multi-Tier... more
In recent years, multiple solutions have become available providing search on huge amounts of plain text and metadata. Scalable searchability on annotated text however still appears to be problematic. With Mtas, an acronym for Multi-Tier Annotation Search, we add annotation layers and structure to the existing Lucene approach of creating and searching indexes, and furthermore present an implementation as Solr plugin providing both searchability and scalability. We present a configurable indexation process, supporting multiple document formats, and providing extended search options on both metadata and annotated text, such as advanced statistics, faceting, grouping and keyword-in-context. Mtas is currently used in production environments, with up to 15 million documents and 9.5 billion words. Mtas is available from GitHub
In recent years, multiple solutions have become available providing search on huge amounts of plain text and metadata. Scalable searchability on annotated text however still appears to be problematic. With Mtas, an acronym for Multi-Tier... more
In recent years, multiple solutions have become available providing search on huge amounts of plain text and metadata. Scalable searchability on annotated text however still appears to be problematic. With Mtas, an acronym for Multi-Tier Annotation Search, we add annotation layers and structure to the existing Lucene approach of creating and searching indexes, and furthermore present an implementation as Solr plugin providing both searchability and scalability. We present a configurable indexation process, supporting multiple document formats, and providing extended search options on both metadata and annotated text, such as advanced statistics, faceting, grouping and keyword-in-context. Mtas is currently used in production environments, with up to 15 million documents and 9.5 billion words. Mtas is available from GitHub.
- by GRISHMA SHARMA
- •
- Video, OCR, Search Engine, Searching
The Nederlab project aims to bring together all digitized texts relevant to Dutch history and language, both in terms of metadata and full-text content. Given that the data comes from a plethora of data providers, we present a technical... more
The Nederlab project aims to bring together all digitized texts relevant to Dutch history and language, both in terms of metadata and full-text content. Given that the data comes from a plethora of data providers, we present a technical solution to deal with the heterogeneity of datasets for access, which we call the Broker. It is an extra pivotal layer between the back-end and front-end of the data infrastructure to query and retrieve massive amounts of humanities data. Moreover, extra services can be embedded in the Broker, such as lexicon service for automated query expansion.
- by Matthijs Brouwer and +1
- •
- PHP websites development, Query Expansion, Apache Solr, JSON
Data storage and information retrieval are some of the most important aspects when it comes to the development of a language corpus. Currently most corpora use either relational databases or indexed file systems. When selecting a data... more
Data storage and information retrieval are some of the most important aspects when it comes to the development of a language corpus. Currently most corpora use either relational databases or indexed file systems. When selecting a data storage system, most important facts to consider are the speeds of data insertion and information retrieval. Other than the aforementioned two approaches, currently there are various database systems which have different strengths that can be more useful. This paper compares the performance of data storage and retrieval mechanisms which use relational databases, graph databases, column store databases and indexed file systems for various steps such as inserting data into corpus and retrieving information from it, and tries to suggest an optimal storage architecture for a language corpus.
- by Nisansa de Silva and +2
- •
- Corpus Linguistics, Graph Database, NoSQL, Cassandra