Paavo Arvola - Academia.edu (original) (raw)

Papers by Paavo Arvola

Research paper thumbnail of Browsing patterns in retrieved documents

Research paper thumbnail of When is the Structural Context Effective

Research paper thumbnail of Contextualization using hyperlinks and internal hierarchical structure of Wikipedia documents

Research paper thumbnail of Count-Min Sketch

Encyclopedia of Database Systems

Research paper thumbnail of Targeted Query Expansions as a Method for Searching Mixed Quality Digitized Cultural Heritage Documents

Digitization of cultural heritage is a huge ongoing effort in many countries. In digitized histor... more Digitization of cultural heritage is a huge ongoing effort in many countries. In digitized historical documents, words may occur in different surface forms due to three types of variation - morphological variation, historical variation, and errors in optical character recognition (OCR). Because individual documents may differ significantly from each other regarding the level of such variations, digitized collections may contain documents of mixed quality. Such different types of documents may require different types of retrieval methods. We suggest using targeted query expansions (QE) to access documents in mixed-quality text collections. In QE the user-given search term is replaced by a set of expansion keys (search words); in targeted QE the selection of expansion terms is based on the type of surface level variation occurring in the particular text searched. We illustrate our approach in a highly inflectional compounding language, Finnish while the variation occur across all natu...

Research paper thumbnail of Path Expressions in SQL

Journal of Database Management, 2016

This article focuses on testing a path-oriented querying approach to hierarchical data in relatio... more This article focuses on testing a path-oriented querying approach to hierarchical data in relational databases. The authors execute a user study to compare the path-oriented approach and traditional SQL from two perspectives: correctness of queries and time spent in querying. They also analyze what kinds of errors are typical in path-oriented SQL. Path-oriented query languages are popular in the context of object-orientation and XML. However, relational databases are the most common paradigm for storing data and SQL is most common for manipulating data. When querying hierarchical data in SQL, the user must specify join conditions explicitly between hierarchy levels. Path-oriented SQL is a new alternative for expressing hierarchical queries in relational databases. In the authors' study, the users spent significantly less time in writing path-oriented SQL queries and made fewer errors in query formulation.

Research paper thumbnail of Classification

Encyclopedia of Database Systems, 2009

Research paper thumbnail of Report on INEX 2010

ACM SIGIR Forum, 2011

INEX investigates focused retrieval from structured documents by providing large test collections... more INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2010 evaluation campaign, which consisted of a wide range of tracks: Ad Hoc, Book, Data Centric, Interactive, QA, Link the Wiki, Relevance Feedback, Web Service Discovery and XML Mining.

Research paper thumbnail of Constraint Query Languages

Encyclopedia of Database Systems, 2009

Research paper thumbnail of Generating Variant Keyword Forms for a Morphologically Complex Language Leads to Successful Information Retrieval with Finnish

Lecture Notes in Computer Science, 2012

ABSTRACT This paper discusses information retrieval of Finnish and keyword variation management b... more ABSTRACT This paper discusses information retrieval of Finnish and keyword variation management by generating inflected variant keyword forms. Finnish is a highly inflectional language, and thus keyword variation management of queries and query indexes is of utter importance for successful Finnish full-text retrieval. In the paper we show that generation of a quite small number of variant keyword forms leads to good retrieval performance using a probabilistic best-match retrieval system (Lemur). Generation of almost the full paradigm of inflected nominal forms improves the results slightly. We have also interesting results with regards to different index types: our evaluation shows that generated inflected queries behave extremely well in a lemmatized index, which is supposedly not suitable for this query type. We also show that in a research environment even inexact generation that produces lots of incorrect inflected forms achieves high precision-recall performance without considerable loss in query throughput effectiveness. We use two different word form generators and their variants and compare the results to commonly used reductive word form variation management methods, stemming and lemmatization. The paper includes also a short discussion about usage of the variant keyword method with Web search engines.

Research paper thumbnail of XML tiedonhaku

Research paper thumbnail of When is the Structural Context Effective?

Research paper thumbnail of Focused access to sparsely and densely relevant documents

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, 2010

Research paper thumbnail of Overview of the INEX 2008 Book Track

Lecture Notes in Computer Science, 2009

This paper provides an overview of the INEX 2008 Book Track. Now in its second year, the track ai... more This paper provides an overview of the INEX 2008 Book Track. Now in its second year, the track aimed at broadening its scope by investigating topics of interest in the fields of information retrieval, human computer interaction, digital libraries, and eBooks. The main topics of investigation were defined around challenges for supporting users in reading, searching, and navigating the full texts of digitized books. Based on these themes, four tasks were defined: 1) The Book Retrieval task aimed at comparing traditional and book-specific retrieval approaches, 2) the Page in Context task aimed at evaluating the value of focused retrieval approaches for searching books, 3) the Structure Extraction task aimed to test automatic techniques for deriving structure from OCR and layout information, and 4) the Active Reading task aimed to explore suitable user interfaces for eBooks enabling reading, annotation, review, and summary across multiple books. We report on the setup and results of each of these tasks. Runs Assessed topics

Research paper thumbnail of Kinship contextualization

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, 2013

Research paper thumbnail of Browsing patterns in retrieved documents

Proceedings of the 5th Information Interaction in Context Symposium, 2014

Research paper thumbnail of Contextualization from the bibliographic structure

Larsen et al.[8], Apr 1, 2012

Bibliographic or citation structure in a document contains a wealth of useful but implicit inform... more Bibliographic or citation structure in a document contains a wealth of useful but implicit information. This rich source of information should be exploited not only to understand what and where to find the important documents, but also as a contextual evidence surrounding the important and not so important documents. This paper measures the effects of contextual evidences accumulated from the bibliographic structure of documents on retrieval effectiveness. We propose a re-weighting model to contextualize ...

Research paper thumbnail of Contextualization using hyperlinks and internal hierarchical structure of Wikipedia documents

Proceedings of the 21st ACM international conference on Information and knowledge management, 2012

Research paper thumbnail of Irvilab

Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Information retrieval (IR) evaluation can be considered as a form of competition in matching docu... more Information retrieval (IR) evaluation can be considered as a form of competition in matching documents and queries. This paper introduces a learning environment based on gamification of query construction for document retrieval, called IRVILAB (Information Retrieval Virtual Lab). The lab has modules for creating standard evaluation settings, one for topic creation including relevance assessments and another for performance evaluation of user queries. In addition, multilingual Wikipedia online collection enables a module, where relevance assessments are translated to other languages. The underlying game utilizes IR performance metrics to measure and give feedback on participants' information retrieval performance. It aims to improve participants' search skills, subject knowledge and contributes to science education by introducing an experimental method. Distinctive features of the system include algorithmic relevance assessments and automatic recall base translation.

Research paper thumbnail of Generating Variant Keyword Forms for a Morphologically Complex Language Leads to Successful Information Retrieval with Finnish

This paper discusses information retrieval of Finnish and keyword variation management by generat... more This paper discusses information retrieval of Finnish and keyword variation management by generating inflected variant keyword forms. Finnish is a highly inflectional language, and thus keyword variation management of queries and query indexes is of utter importance for successful Finnish full-text
retrieval. In the paper we show that generation of a quite small number of variant keyword forms leads to good retrieval performance using a probabilistic best-match retrieval system (Lemur). Generation of almost the full paradigm of inflected nominal forms improves the results slightly. We have also interesting results with regards to different index types: our evaluation shows that generated inflected queries behave extremely well in a lemmatized index, which is supposedly not suitable for this query type. We also show that in a research
environment even inexact generation that produces lots of incorrect inflected forms achieves high precision-recall performance without considerable loss in query throughput effectiveness. We use two different word form generators and
their variants and compare the results to commonly used reductive word form variation management methods, stemming and lemmatization. The paper includes also a short discussion about usage of the variant keyword method with Web search engines.

Research paper thumbnail of Browsing patterns in retrieved documents

Research paper thumbnail of When is the Structural Context Effective

Research paper thumbnail of Contextualization using hyperlinks and internal hierarchical structure of Wikipedia documents

Research paper thumbnail of Count-Min Sketch

Encyclopedia of Database Systems

Research paper thumbnail of Targeted Query Expansions as a Method for Searching Mixed Quality Digitized Cultural Heritage Documents

Digitization of cultural heritage is a huge ongoing effort in many countries. In digitized histor... more Digitization of cultural heritage is a huge ongoing effort in many countries. In digitized historical documents, words may occur in different surface forms due to three types of variation - morphological variation, historical variation, and errors in optical character recognition (OCR). Because individual documents may differ significantly from each other regarding the level of such variations, digitized collections may contain documents of mixed quality. Such different types of documents may require different types of retrieval methods. We suggest using targeted query expansions (QE) to access documents in mixed-quality text collections. In QE the user-given search term is replaced by a set of expansion keys (search words); in targeted QE the selection of expansion terms is based on the type of surface level variation occurring in the particular text searched. We illustrate our approach in a highly inflectional compounding language, Finnish while the variation occur across all natu...

Research paper thumbnail of Path Expressions in SQL

Journal of Database Management, 2016

This article focuses on testing a path-oriented querying approach to hierarchical data in relatio... more This article focuses on testing a path-oriented querying approach to hierarchical data in relational databases. The authors execute a user study to compare the path-oriented approach and traditional SQL from two perspectives: correctness of queries and time spent in querying. They also analyze what kinds of errors are typical in path-oriented SQL. Path-oriented query languages are popular in the context of object-orientation and XML. However, relational databases are the most common paradigm for storing data and SQL is most common for manipulating data. When querying hierarchical data in SQL, the user must specify join conditions explicitly between hierarchy levels. Path-oriented SQL is a new alternative for expressing hierarchical queries in relational databases. In the authors' study, the users spent significantly less time in writing path-oriented SQL queries and made fewer errors in query formulation.

Research paper thumbnail of Classification

Encyclopedia of Database Systems, 2009

Research paper thumbnail of Report on INEX 2010

ACM SIGIR Forum, 2011

INEX investigates focused retrieval from structured documents by providing large test collections... more INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2010 evaluation campaign, which consisted of a wide range of tracks: Ad Hoc, Book, Data Centric, Interactive, QA, Link the Wiki, Relevance Feedback, Web Service Discovery and XML Mining.

Research paper thumbnail of Constraint Query Languages

Encyclopedia of Database Systems, 2009

Research paper thumbnail of Generating Variant Keyword Forms for a Morphologically Complex Language Leads to Successful Information Retrieval with Finnish

Lecture Notes in Computer Science, 2012

ABSTRACT This paper discusses information retrieval of Finnish and keyword variation management b... more ABSTRACT This paper discusses information retrieval of Finnish and keyword variation management by generating inflected variant keyword forms. Finnish is a highly inflectional language, and thus keyword variation management of queries and query indexes is of utter importance for successful Finnish full-text retrieval. In the paper we show that generation of a quite small number of variant keyword forms leads to good retrieval performance using a probabilistic best-match retrieval system (Lemur). Generation of almost the full paradigm of inflected nominal forms improves the results slightly. We have also interesting results with regards to different index types: our evaluation shows that generated inflected queries behave extremely well in a lemmatized index, which is supposedly not suitable for this query type. We also show that in a research environment even inexact generation that produces lots of incorrect inflected forms achieves high precision-recall performance without considerable loss in query throughput effectiveness. We use two different word form generators and their variants and compare the results to commonly used reductive word form variation management methods, stemming and lemmatization. The paper includes also a short discussion about usage of the variant keyword method with Web search engines.

Research paper thumbnail of XML tiedonhaku

Research paper thumbnail of When is the Structural Context Effective?

Research paper thumbnail of Focused access to sparsely and densely relevant documents

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, 2010

Research paper thumbnail of Overview of the INEX 2008 Book Track

Lecture Notes in Computer Science, 2009

This paper provides an overview of the INEX 2008 Book Track. Now in its second year, the track ai... more This paper provides an overview of the INEX 2008 Book Track. Now in its second year, the track aimed at broadening its scope by investigating topics of interest in the fields of information retrieval, human computer interaction, digital libraries, and eBooks. The main topics of investigation were defined around challenges for supporting users in reading, searching, and navigating the full texts of digitized books. Based on these themes, four tasks were defined: 1) The Book Retrieval task aimed at comparing traditional and book-specific retrieval approaches, 2) the Page in Context task aimed at evaluating the value of focused retrieval approaches for searching books, 3) the Structure Extraction task aimed to test automatic techniques for deriving structure from OCR and layout information, and 4) the Active Reading task aimed to explore suitable user interfaces for eBooks enabling reading, annotation, review, and summary across multiple books. We report on the setup and results of each of these tasks. Runs Assessed topics

Research paper thumbnail of Kinship contextualization

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, 2013

Research paper thumbnail of Browsing patterns in retrieved documents

Proceedings of the 5th Information Interaction in Context Symposium, 2014

Research paper thumbnail of Contextualization from the bibliographic structure

Larsen et al.[8], Apr 1, 2012

Bibliographic or citation structure in a document contains a wealth of useful but implicit inform... more Bibliographic or citation structure in a document contains a wealth of useful but implicit information. This rich source of information should be exploited not only to understand what and where to find the important documents, but also as a contextual evidence surrounding the important and not so important documents. This paper measures the effects of contextual evidences accumulated from the bibliographic structure of documents on retrieval effectiveness. We propose a re-weighting model to contextualize ...

Research paper thumbnail of Contextualization using hyperlinks and internal hierarchical structure of Wikipedia documents

Proceedings of the 21st ACM international conference on Information and knowledge management, 2012

Research paper thumbnail of Irvilab

Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Information retrieval (IR) evaluation can be considered as a form of competition in matching docu... more Information retrieval (IR) evaluation can be considered as a form of competition in matching documents and queries. This paper introduces a learning environment based on gamification of query construction for document retrieval, called IRVILAB (Information Retrieval Virtual Lab). The lab has modules for creating standard evaluation settings, one for topic creation including relevance assessments and another for performance evaluation of user queries. In addition, multilingual Wikipedia online collection enables a module, where relevance assessments are translated to other languages. The underlying game utilizes IR performance metrics to measure and give feedback on participants' information retrieval performance. It aims to improve participants' search skills, subject knowledge and contributes to science education by introducing an experimental method. Distinctive features of the system include algorithmic relevance assessments and automatic recall base translation.

Research paper thumbnail of Generating Variant Keyword Forms for a Morphologically Complex Language Leads to Successful Information Retrieval with Finnish

This paper discusses information retrieval of Finnish and keyword variation management by generat... more This paper discusses information retrieval of Finnish and keyword variation management by generating inflected variant keyword forms. Finnish is a highly inflectional language, and thus keyword variation management of queries and query indexes is of utter importance for successful Finnish full-text
retrieval. In the paper we show that generation of a quite small number of variant keyword forms leads to good retrieval performance using a probabilistic best-match retrieval system (Lemur). Generation of almost the full paradigm of inflected nominal forms improves the results slightly. We have also interesting results with regards to different index types: our evaluation shows that generated inflected queries behave extremely well in a lemmatized index, which is supposedly not suitable for this query type. We also show that in a research
environment even inexact generation that produces lots of incorrect inflected forms achieves high precision-recall performance without considerable loss in query throughput effectiveness. We use two different word form generators and
their variants and compare the results to commonly used reductive word form variation management methods, stemming and lemmatization. The paper includes also a short discussion about usage of the variant keyword method with Web search engines.