Pavel Rychlý | Masaryk University (original) (raw)
Papers by Pavel Rychlý
Zenodo (CERN European Organization for Nuclear Research), Dec 10, 2021
Cross-lingual word embeddings facilitate the transfer of lexical knowledge across languages, and ... more Cross-lingual word embeddings facilitate the transfer of lexical knowledge across languages, and they are mainly used for finding translation equivalents. Translation equivalents obtained in this way are usually evaluated with the help of ground truth dictionaries. However, the evaluation process, including the ground truth dictionaries, differs from model to model, impeding the correct interpretation of the results. Therefore, in this paper, we provide a thorough analysis of the English-Slovak ground truth dictionary and employ our analysis in evaluating two cross-lingual word embedding models. We show that word pairs choice is an important factor when accurately reflecting the model's performance.
This paper presents tools for building large corpora and examples of their results. It also descr... more This paper presents tools for building large corpora and examples of their results. It also describes several tools for computer lexicography created at Faculty of Informatics, Masaryk University.
DESAM je morfologicky oznackovaný korpus ceských textů v rozsahu 2 689 dokumentů (tj. 48 687 vět,... more DESAM je morfologicky oznackovaný korpus ceských textů v rozsahu 2 689 dokumentů (tj. 48 687 vět, 1 042 446 tokenů)
RASLAN, Nov 25, 2015
The fast evaluation of complex queries on big text corpora is an important feature of corpus mana... more The fast evaluation of complex queries on big text corpora is an important feature of corpus managers. The aim of this paper is to apply approaches of concurrent processing to the query evaluation in the corpus management system Manatee. The work contains an evaluation of the query processing speed using various number of cores available, and also a comparison of the length of the source code between the original and the concurrent implementation.
RASLAN, 2008
Finding collocation candidates is one of the most important and widely used feature of corpus lin... more Finding collocation candidates is one of the most important and widely used feature of corpus linguistics tools. There are many statistical association measures used to identify good collocations. Most of these measures define a formula of a association score which indicates amount of statistical association between two words. The score is computed for all possible word pairs and the word pairs with the highest score are presented as collocation candidates. The same scores are used in many other algorithms in corpus linguistics. The score values are usually meaningless and corpus specific, they cannot be used to compare words (or word pairs) of different corpora. But endusers want an interpretation of such scores and want a score's stability. This paper present a modification of a well known association score which has a reasonable interpretation and other good features.
Proceedings of the First Workshop on Text, Speech, …
On selecting a constituent part of MU the "Overview of publishing activities" page will... more On selecting a constituent part of MU the "Overview of publishing activities" page will be displayed with information relevant to the selected constituent part. The "Overview of publishing activities" page is not available for non-activated items. ... RYCHLÝ, Pavel. The ...
Oromo web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
A set of 5 corpora for 4 Ethiopian languages: Amharic, Oromo, Somali and Tigrinya. The Amharic WI... more A set of 5 corpora for 4 Ethiopian languages: Amharic, Oromo, Somali and Tigrinya. The Amharic WIC corpus is a reprocessed existing corpus with part of speech annotation. The released version contains cleaning (especially numeric expressions) and unification of two versions with different scripts (Geez and SERA transliteration). The web corpora were built using automatic tools from Internet texts. They contain from 2.5 million words (Tigrinya) to 80 million words (Somali)
RASLAN, Nov 25, 2015
This report describes the tools and resources developed to support Corpus Pattern Analysis (CPA)-... more This report describes the tools and resources developed to support Corpus Pattern Analysis (CPA)-a corpus-based method for building patterns dictionaries. The tools are an annotation of concordance in Sketch Engine, a special CPA editor for editing Pattern Dictionary of English Verbs (PDEV), dedicated servlets based on the Dictionary Editing and Browsing platform and a public interface for browsing the PDEV. The resources are SemEval 2015 Task 15 dataset and LEMON API.
Zenodo (CERN European Organization for Nuclear Research), Dec 10, 2021
Since 2014, Teiresiás Centre at Masaryk University is coordinating the project to create the mult... more Since 2014, Teiresiás Centre at Masaryk University is coordinating the project to create the multilingual sign language dictionary. Natural Language Processing Centre is developing the editing and browsing web application for the dictionary. Originally, the application was based on the DEB dictionary platform with Sedna XML database for data storage. In course of the project, more languages were added, entry structure is more complex, larger teams from several countries are working on the dictionary creation, and website design was not working very well with modern web browsers. We realized that in order to increase the response speed of the application we need to refactor the whole technology platform. In 2020 and 2021, completely new application was designed and developed. This paper this describes the overall structure of the platform, technologies used to build the application and the process of data migration to the new database system.
Zenodo (CERN European Organization for Nuclear Research), 2019
Nearest neighbor queries in high-dimensional spaces are expensive. In this article, we propose a ... more Nearest neighbor queries in high-dimensional spaces are expensive. In this article, we propose a method of building and querying a stand-alone data structure, SiLi (Similarity List) Index, which supports approximating the results of k-NN queries in high-dimensional spaces, while using a significantly reduced amount of system memory and processor time compared to the usual brute-force search methods.
The paper describes automatic definition finding implemented within the leading corpus query and ... more The paper describes automatic definition finding implemented within the leading corpus query and management tool, Sketch Engine. The implementation exploits complex pattern-matching queries in the corpus query language (CQL) and the indexing mechanism of word sketches for finding and storing definition candidates throughout the corpus. The approach is evaluated for Czech and English corpora, showing that the results are usable in practice: precision of the tool ranges between 30 and 75 percent (depending on the major corpus text types) and we were able to extract nearly 2 million definition candidates from an English corpus with 1.4 billion words. The feature is embedded into the interface as a concordance filter, so that users can search for definitions of any query to the corpus, including very specific multi-word queries. The results also indicate that ordinary texts (unlike explanatory texts) contain rather low number of definitions, which is perhaps the most important problem w...
Very many state-of-the-art solutions in language technology owe their success to the right balanc... more Very many state-of-the-art solutions in language technology owe their success to the right balance between a wide range of linguistic introspection and theory neutral computer engineering. And Sketch Engine is undoubtedly one of them. In this chapter we elaborate on both the theoretical and practical issues we have faced in the thirteen years of Sketch Engine development and argue for both the linguistic and computer science oriented decisions we have taken. We also discuss Sketch Engine's current challenges from which many can be extrapolated to any language technology software aiming at industrial strength impact.
Finding two-word collocations is a well-studied task within natural language processing. The resu... more Finding two-word collocations is a well-studied task within natural language processing. The result of this task for a given headword is usually a list of collocations sorted by a salience score. In corpus manager Sketch Engine, these pairs are extracted from data using a word sketch grammar relation rules and log-dice statistics resulting in a sorted list of triples 'head- word, grammar-relation, collocate'. The longest–commonest match is a straightforward ex- tension of these two-word collocations into multiword expressions. The resulting expressions are also very useful for representing the most common realisation of the collocational pair and to facilitate the interpretation of the raw triplet because sometimes, for such a triple, it is not clear from what texts it comes. We present here an algorithm behind the longest–commonest match together with a simple evaluation. The longest–commonest match is already imple- mented in Sketch Engine.
Zenodo (CERN European Organization for Nuclear Research), Dec 10, 2021
Cross-lingual word embeddings facilitate the transfer of lexical knowledge across languages, and ... more Cross-lingual word embeddings facilitate the transfer of lexical knowledge across languages, and they are mainly used for finding translation equivalents. Translation equivalents obtained in this way are usually evaluated with the help of ground truth dictionaries. However, the evaluation process, including the ground truth dictionaries, differs from model to model, impeding the correct interpretation of the results. Therefore, in this paper, we provide a thorough analysis of the English-Slovak ground truth dictionary and employ our analysis in evaluating two cross-lingual word embedding models. We show that word pairs choice is an important factor when accurately reflecting the model's performance.
This paper presents tools for building large corpora and examples of their results. It also descr... more This paper presents tools for building large corpora and examples of their results. It also describes several tools for computer lexicography created at Faculty of Informatics, Masaryk University.
DESAM je morfologicky oznackovaný korpus ceských textů v rozsahu 2 689 dokumentů (tj. 48 687 vět,... more DESAM je morfologicky oznackovaný korpus ceských textů v rozsahu 2 689 dokumentů (tj. 48 687 vět, 1 042 446 tokenů)
RASLAN, Nov 25, 2015
The fast evaluation of complex queries on big text corpora is an important feature of corpus mana... more The fast evaluation of complex queries on big text corpora is an important feature of corpus managers. The aim of this paper is to apply approaches of concurrent processing to the query evaluation in the corpus management system Manatee. The work contains an evaluation of the query processing speed using various number of cores available, and also a comparison of the length of the source code between the original and the concurrent implementation.
RASLAN, 2008
Finding collocation candidates is one of the most important and widely used feature of corpus lin... more Finding collocation candidates is one of the most important and widely used feature of corpus linguistics tools. There are many statistical association measures used to identify good collocations. Most of these measures define a formula of a association score which indicates amount of statistical association between two words. The score is computed for all possible word pairs and the word pairs with the highest score are presented as collocation candidates. The same scores are used in many other algorithms in corpus linguistics. The score values are usually meaningless and corpus specific, they cannot be used to compare words (or word pairs) of different corpora. But endusers want an interpretation of such scores and want a score's stability. This paper present a modification of a well known association score which has a reasonable interpretation and other good features.
Proceedings of the First Workshop on Text, Speech, …
On selecting a constituent part of MU the "Overview of publishing activities" page will... more On selecting a constituent part of MU the "Overview of publishing activities" page will be displayed with information relevant to the selected constituent part. The "Overview of publishing activities" page is not available for non-activated items. ... RYCHLÝ, Pavel. The ...
Oromo web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
A set of 5 corpora for 4 Ethiopian languages: Amharic, Oromo, Somali and Tigrinya. The Amharic WI... more A set of 5 corpora for 4 Ethiopian languages: Amharic, Oromo, Somali and Tigrinya. The Amharic WIC corpus is a reprocessed existing corpus with part of speech annotation. The released version contains cleaning (especially numeric expressions) and unification of two versions with different scripts (Geez and SERA transliteration). The web corpora were built using automatic tools from Internet texts. They contain from 2.5 million words (Tigrinya) to 80 million words (Somali)
RASLAN, Nov 25, 2015
This report describes the tools and resources developed to support Corpus Pattern Analysis (CPA)-... more This report describes the tools and resources developed to support Corpus Pattern Analysis (CPA)-a corpus-based method for building patterns dictionaries. The tools are an annotation of concordance in Sketch Engine, a special CPA editor for editing Pattern Dictionary of English Verbs (PDEV), dedicated servlets based on the Dictionary Editing and Browsing platform and a public interface for browsing the PDEV. The resources are SemEval 2015 Task 15 dataset and LEMON API.
Zenodo (CERN European Organization for Nuclear Research), Dec 10, 2021
Since 2014, Teiresiás Centre at Masaryk University is coordinating the project to create the mult... more Since 2014, Teiresiás Centre at Masaryk University is coordinating the project to create the multilingual sign language dictionary. Natural Language Processing Centre is developing the editing and browsing web application for the dictionary. Originally, the application was based on the DEB dictionary platform with Sedna XML database for data storage. In course of the project, more languages were added, entry structure is more complex, larger teams from several countries are working on the dictionary creation, and website design was not working very well with modern web browsers. We realized that in order to increase the response speed of the application we need to refactor the whole technology platform. In 2020 and 2021, completely new application was designed and developed. This paper this describes the overall structure of the platform, technologies used to build the application and the process of data migration to the new database system.
Zenodo (CERN European Organization for Nuclear Research), 2019
Nearest neighbor queries in high-dimensional spaces are expensive. In this article, we propose a ... more Nearest neighbor queries in high-dimensional spaces are expensive. In this article, we propose a method of building and querying a stand-alone data structure, SiLi (Similarity List) Index, which supports approximating the results of k-NN queries in high-dimensional spaces, while using a significantly reduced amount of system memory and processor time compared to the usual brute-force search methods.
The paper describes automatic definition finding implemented within the leading corpus query and ... more The paper describes automatic definition finding implemented within the leading corpus query and management tool, Sketch Engine. The implementation exploits complex pattern-matching queries in the corpus query language (CQL) and the indexing mechanism of word sketches for finding and storing definition candidates throughout the corpus. The approach is evaluated for Czech and English corpora, showing that the results are usable in practice: precision of the tool ranges between 30 and 75 percent (depending on the major corpus text types) and we were able to extract nearly 2 million definition candidates from an English corpus with 1.4 billion words. The feature is embedded into the interface as a concordance filter, so that users can search for definitions of any query to the corpus, including very specific multi-word queries. The results also indicate that ordinary texts (unlike explanatory texts) contain rather low number of definitions, which is perhaps the most important problem w...
Very many state-of-the-art solutions in language technology owe their success to the right balanc... more Very many state-of-the-art solutions in language technology owe their success to the right balance between a wide range of linguistic introspection and theory neutral computer engineering. And Sketch Engine is undoubtedly one of them. In this chapter we elaborate on both the theoretical and practical issues we have faced in the thirteen years of Sketch Engine development and argue for both the linguistic and computer science oriented decisions we have taken. We also discuss Sketch Engine's current challenges from which many can be extrapolated to any language technology software aiming at industrial strength impact.
Finding two-word collocations is a well-studied task within natural language processing. The resu... more Finding two-word collocations is a well-studied task within natural language processing. The result of this task for a given headword is usually a list of collocations sorted by a salience score. In corpus manager Sketch Engine, these pairs are extracted from data using a word sketch grammar relation rules and log-dice statistics resulting in a sorted list of triples 'head- word, grammar-relation, collocate'. The longest–commonest match is a straightforward ex- tension of these two-word collocations into multiword expressions. The resulting expressions are also very useful for representing the most common realisation of the collocational pair and to facilitate the interpretation of the raw triplet because sometimes, for such a triple, it is not clear from what texts it comes. We present here an algorithm behind the longest–commonest match together with a simple evaluation. The longest–commonest match is already imple- mented in Sketch Engine.