Tommi A Pirinen | University of Hamburg (original) (raw)

Papers by Tommi A Pirinen

Research paper thumbnail of Guest editors’ note

Acta Linguistica Academica, 2017

Guest editors' note in the special issue of Acta Linguistica Academica on computational linguisti... more Guest editors' note in the special issue of Acta Linguistica Academica on computational linguistics for Uralic languages.

Research paper thumbnail of Weighting finite-state morphological analyzers using hfst tools

In a language with very productive compounding and a rich inflectional system, e.g. Finnish, new ... more In a language with very productive compounding and a rich inflectional system, e.g. Finnish, new words are to a large extent formed by compounding. In order to disambiguate between the possible compound segmentations, a probabilistic strategy has been found effective by Lindén and Pirinen . In this article, we present a method for implementing the probabilistic framework as a separate process which can be combined through composition with a lexical transducer to create a weighted morphological analyzer. To implement the analyzer, we use the HFST-LexC and related command line tools which are part of the open source Helsinki Finite-State Technology package. Using Finnish as a test language, we show how to use the weighted finite-state lexicon for building a simple unigram tagger with 96-98 % precision for Finnish words and word segments belonging to the vocabulary.

Research paper thumbnail of Programme committee

Septentrio Conference Series, 2015

Research paper thumbnail of Organisers

Septentrio Conference Series, 2015

Research paper thumbnail of Weighted Finite-State Morphological Analysis of Finnish Inflection and Compounding

has a very productive compounding and a rich inflectional system, which causes ambiguity in the m... more has a very productive compounding and a rich inflectional system, which causes ambiguity in the morphological segmentation of compounds made with finite state transducer methods.

Research paper thumbnail of Weighted Finite-State Morphological Analysis of Finnish Compounding with HFST-LEXC

Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009. Editors: Kr... more Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009. Editors: Kristiina Jokinen and Eckhard Bick. NEALT Proceedings Series, Vol. 4 (2009), 89-95. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/9206 .

Research paper thumbnail of State-of-the-Art in Weighted Finite-State Spell-Checking

Lecture Notes in Computer Science, 2014

ABSTRACT The following claims can be made about finite-state methods for spell-checking: 1) Finit... more ABSTRACT The following claims can be made about finite-state methods for spell-checking: 1) Finite-state language models provide support for morphologically complex languages that word lists, affix stripping and similar approaches do not provide; 2) Weighted finite-state models have expressive power equal to other, state-of-the-art string algorithms used by contemporary spell-checkers; and 3) Finite-state models are at least as fast as other string algorithms for lookup and error correction. In this article, we use some contemporary non-finite-state spell-checking methods as a baseline and perform tests in light of the claims, to evaluate state-of-the-art finite-state spell-checking methods. We verify that finite-state spell-checking systems outperform the traditional approaches for English. We also show that the models for morphologically complex languages can be made to perform on par with English systems.

Research paper thumbnail of HFST — A System for Creating NLP Tools

Communications in Computer and Information Science, 2013

Research paper thumbnail of Weighting finite-state morphological analyzers using hfst tools

Proceedings of the Finite-State Methods and Natural Language Processing, 2009

In a language with very productive compounding and a rich inflectional system, e.g. Finnish, new ... more In a language with very productive compounding and a rich inflectional system, e.g. Finnish, new words are to a large extent formed by compounding. In order to disambiguate between the possible compound segmentations, a probabilistic strategy has been found effective by Lindén and Pirinen . In this article, we present a method for implementing the probabilistic framework as a separate process which can be combined through composition with a lexical transducer to create a weighted morphological analyzer. To implement the analyzer, we use the HFST-LexC and related command line tools which are part of the open source Helsinki Finite-State Technology package. Using Finnish as a test language, we show how to use the weighted finite-state lexicon for building a simple unigram tagger with 96-98 % precision for Finnish words and word segments belonging to the vocabulary.

Research paper thumbnail of Report on the Second International Workshop on Computational Linguistics for Uralic Languages

The Second International Workshop on Computational Linguistics for Uralic Languages (SIWCLUL) was... more The Second International Workshop on Computational Linguistics for Uralic Languages (SIWCLUL) was held in Szeged in January 20⒗ The goals of the conference series include increased co-operation between the researchers, universities and research centres working on Uralic languages. The event gathered a number of participants from all over Eurasia, including Finland, Hungary, Estonia, Ireland, Germany, Austria and Norway among others. The conference also marked a start of an Association for Computational Linguistics’ Special Interest Group for Uralic Languages (ACLSIGUR).

Research paper thumbnail of  Building an open-source development infrastructure for language technology projects

The article presents the Giellatekno & Divvun language technology resources, more specifically th... more The article presents the Giellatekno & Divvun language technology resources, more specifically the effort to utilise open-source tools to improve the build infrastructure, and the solutions to help adapt to best practices for software development. The article especially discusses how the infrastructure has been remade to cope with an increasing number of languages without incurring extra overhead for the maintainers, and at the same time let the linguists concentrate on the linguistic work. Finally, the article discusses how a uniform infrastructure like the one presented can be used to easily compare languages in terms of morphological or computational complexity, coverage or for cross-lingual applications.

Research paper thumbnail of Building an open-source development infrastructure for language technology projects

This article presents a novel way of combining finite-state transducers (FSTs) with electronic di... more This article presents a novel way of combining finite-state transducers (FSTs) with electronic dictionaries, thereby creating efficient reading comprehension dictionaries. We compare a North Saami - Norwegian and a South Saami - Norwegian dictionary, both enriched with an FST, with existing, available dictionaries containing pre-generated paradigms, and show the advantages of our approach. Being more flexible, the FSTs may also adjust the dictionary to different contexts. The finite state transducer analyses the word to be looked up, and the dictionary itself conducts the actual lookup. The FST part is crucial for morphology-rich languages, where as little as 10% of the wordforms in running text actually consists of lemma forms. If a compound or derived word, or a word with an enclitic particle is not found in the dictionary, the FST will give the stems and derivation affixes of the wordform, and each of the stems will be given a separate translation. In this way, the coverage of th...

Research paper thumbnail of Improving Predictive Entry of Finnish Text Messages using IRC Logs

Abstract—We describe a predictive text entry system for Finnish combining an open source morpholo... more Abstract—We describe a predictive text entry system for Finnish combining an open source morphological analyzer Omorfi and a lexical model compiled from Internet Relay Chat (IRC) logs. The system is implemented as a weighted finitestate transducer (WFST) using the freely available WFST library HFST. We show that using IRC logs to train the system gives substantial improvement in recall from a baseline system using word frequencies computed from the Finnish Wikipedia.

Research paper thumbnail of Finite-state spell-checking with weighted language and error models

In this paper we present simple methods for construction and evaluation of finite-state spell-che... more In this paper we present simple methods for construction and evaluation of finite-state spell-checking tools using an existing finite-state lexical automaton, freely available finite-state tools and Internet corpora acquired from projects such as Wikipedia. As an example, we use a freely available open-source implementation of Finnish morphology, made with traditional finite-state morphology tools, and demonstrate rapid building of Northern Sámi and English spell checkers from tools and resources available from the Internet.

Research paper thumbnail of HFST tool for morphology; An efficient open-source

Morphological analysis of a wide range of languages can be implemented efficiently using finite-s... more Morphological analysis of a wide range of languages can be implemented efficiently using finite-state transducer technologies. Over the last 30 years, a number of attempts have been made to create tools for computational morphologies. The two main competing approaches have been parallel vs. cascaded rule application. The parallel rule application was originally introduced by Koskenniemi [7] and implemented in tools like TwolC and LexC. Currently many applications of morphologies could use dictionaries encoding the a priori likelihoods of words and expressions as well as the likelihood of relations to other representations or languages. We have made the choice to create open-source tools and language descriptions in order to let as many as possible participate in the effort. The current article presents some of the main tools that we have created such as HFST-LexC, HFST-TwolC and HFST-Compose-Intersect. We evaluate their efficiency in comparison to some similar tools and libraries. In particular, we evaluate them using several full-fledged morphological descriptions. Our tools compare well with similar open source tools, even if we still have some challenges ahead before we can catch up with the commercial tools. We demonstrate that for various reasons a parallel rule approach still seems to be more efficient than a cascaded rule approach when developing finite-state morphologies.

Research paper thumbnail of HFST tool for morphology

Research paper thumbnail of HFST Library

Research paper thumbnail of HFST - an Environment for Creating Language Technology Applications

Research paper thumbnail of Using hfst for creating computational linguistic applications

… Applications, Studies in …, 2012

HFST -Helsinki Finite-State Technology (hfst.sf.net) is a framework for compiling and applying li... more HFST -Helsinki Finite-State Technology (hfst.sf.net) is a framework for compiling and applying linguistic descriptions with finitestate methods. HFST currently collects some of the most important finite-state tools for creating morphologies and spellcheckers into one open-source platform and supports extending and improving the descriptions with weights to accommodate the modeling of statistical information. HFST offers a path from language descriptions to efficient language applications. In this article, we focus on aspects of HFST that are new to the end user, i.e. new tools, new features in existing tools, or new language applications, in addition to some revised algorithms that increase performance.

Research paper thumbnail of Modularisation of finnish finite-state language description—towards wide collaboration in open source development of a morphological analyser

Pedersen, BS, Nešpore, G., Inguna Skadi n.(eds.) …, 2011

In this paper we present an open source implementation for Finnish morphological parser. We short... more In this paper we present an open source implementation for Finnish morphological parser. We shortly evaluate it against contemporary criticism towards monolithic and unmaintainable finite-state language description. We use it to demonstrate way of writing finite-state language description that is used for varying set of projects, that typically need morphological analyser, such as POS tagging, morphological analysis, hyphenation, spell checking and correction, rule-based machine translation and syntactic analysis. The language description is done using available open source methods for building finitestate descriptions coupled with autotoolsstyle build system, which is de facto standard in open source projects. 1

Research paper thumbnail of Guest editors’ note

Acta Linguistica Academica, 2017

Guest editors' note in the special issue of Acta Linguistica Academica on computational linguisti... more Guest editors' note in the special issue of Acta Linguistica Academica on computational linguistics for Uralic languages.

Research paper thumbnail of Weighting finite-state morphological analyzers using hfst tools

In a language with very productive compounding and a rich inflectional system, e.g. Finnish, new ... more In a language with very productive compounding and a rich inflectional system, e.g. Finnish, new words are to a large extent formed by compounding. In order to disambiguate between the possible compound segmentations, a probabilistic strategy has been found effective by Lindén and Pirinen . In this article, we present a method for implementing the probabilistic framework as a separate process which can be combined through composition with a lexical transducer to create a weighted morphological analyzer. To implement the analyzer, we use the HFST-LexC and related command line tools which are part of the open source Helsinki Finite-State Technology package. Using Finnish as a test language, we show how to use the weighted finite-state lexicon for building a simple unigram tagger with 96-98 % precision for Finnish words and word segments belonging to the vocabulary.

Research paper thumbnail of Programme committee

Septentrio Conference Series, 2015

Research paper thumbnail of Organisers

Septentrio Conference Series, 2015

Research paper thumbnail of Weighted Finite-State Morphological Analysis of Finnish Inflection and Compounding

has a very productive compounding and a rich inflectional system, which causes ambiguity in the m... more has a very productive compounding and a rich inflectional system, which causes ambiguity in the morphological segmentation of compounds made with finite state transducer methods.

Research paper thumbnail of Weighted Finite-State Morphological Analysis of Finnish Compounding with HFST-LEXC

Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009. Editors: Kr... more Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009. Editors: Kristiina Jokinen and Eckhard Bick. NEALT Proceedings Series, Vol. 4 (2009), 89-95. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/9206 .

Research paper thumbnail of State-of-the-Art in Weighted Finite-State Spell-Checking

Lecture Notes in Computer Science, 2014

ABSTRACT The following claims can be made about finite-state methods for spell-checking: 1) Finit... more ABSTRACT The following claims can be made about finite-state methods for spell-checking: 1) Finite-state language models provide support for morphologically complex languages that word lists, affix stripping and similar approaches do not provide; 2) Weighted finite-state models have expressive power equal to other, state-of-the-art string algorithms used by contemporary spell-checkers; and 3) Finite-state models are at least as fast as other string algorithms for lookup and error correction. In this article, we use some contemporary non-finite-state spell-checking methods as a baseline and perform tests in light of the claims, to evaluate state-of-the-art finite-state spell-checking methods. We verify that finite-state spell-checking systems outperform the traditional approaches for English. We also show that the models for morphologically complex languages can be made to perform on par with English systems.

Research paper thumbnail of HFST — A System for Creating NLP Tools

Communications in Computer and Information Science, 2013

Research paper thumbnail of Weighting finite-state morphological analyzers using hfst tools

Proceedings of the Finite-State Methods and Natural Language Processing, 2009

In a language with very productive compounding and a rich inflectional system, e.g. Finnish, new ... more In a language with very productive compounding and a rich inflectional system, e.g. Finnish, new words are to a large extent formed by compounding. In order to disambiguate between the possible compound segmentations, a probabilistic strategy has been found effective by Lindén and Pirinen . In this article, we present a method for implementing the probabilistic framework as a separate process which can be combined through composition with a lexical transducer to create a weighted morphological analyzer. To implement the analyzer, we use the HFST-LexC and related command line tools which are part of the open source Helsinki Finite-State Technology package. Using Finnish as a test language, we show how to use the weighted finite-state lexicon for building a simple unigram tagger with 96-98 % precision for Finnish words and word segments belonging to the vocabulary.

Research paper thumbnail of Report on the Second International Workshop on Computational Linguistics for Uralic Languages

The Second International Workshop on Computational Linguistics for Uralic Languages (SIWCLUL) was... more The Second International Workshop on Computational Linguistics for Uralic Languages (SIWCLUL) was held in Szeged in January 20⒗ The goals of the conference series include increased co-operation between the researchers, universities and research centres working on Uralic languages. The event gathered a number of participants from all over Eurasia, including Finland, Hungary, Estonia, Ireland, Germany, Austria and Norway among others. The conference also marked a start of an Association for Computational Linguistics’ Special Interest Group for Uralic Languages (ACLSIGUR).

Research paper thumbnail of  Building an open-source development infrastructure for language technology projects

The article presents the Giellatekno & Divvun language technology resources, more specifically th... more The article presents the Giellatekno & Divvun language technology resources, more specifically the effort to utilise open-source tools to improve the build infrastructure, and the solutions to help adapt to best practices for software development. The article especially discusses how the infrastructure has been remade to cope with an increasing number of languages without incurring extra overhead for the maintainers, and at the same time let the linguists concentrate on the linguistic work. Finally, the article discusses how a uniform infrastructure like the one presented can be used to easily compare languages in terms of morphological or computational complexity, coverage or for cross-lingual applications.

Research paper thumbnail of Building an open-source development infrastructure for language technology projects

This article presents a novel way of combining finite-state transducers (FSTs) with electronic di... more This article presents a novel way of combining finite-state transducers (FSTs) with electronic dictionaries, thereby creating efficient reading comprehension dictionaries. We compare a North Saami - Norwegian and a South Saami - Norwegian dictionary, both enriched with an FST, with existing, available dictionaries containing pre-generated paradigms, and show the advantages of our approach. Being more flexible, the FSTs may also adjust the dictionary to different contexts. The finite state transducer analyses the word to be looked up, and the dictionary itself conducts the actual lookup. The FST part is crucial for morphology-rich languages, where as little as 10% of the wordforms in running text actually consists of lemma forms. If a compound or derived word, or a word with an enclitic particle is not found in the dictionary, the FST will give the stems and derivation affixes of the wordform, and each of the stems will be given a separate translation. In this way, the coverage of th...

Research paper thumbnail of Improving Predictive Entry of Finnish Text Messages using IRC Logs

Abstract—We describe a predictive text entry system for Finnish combining an open source morpholo... more Abstract—We describe a predictive text entry system for Finnish combining an open source morphological analyzer Omorfi and a lexical model compiled from Internet Relay Chat (IRC) logs. The system is implemented as a weighted finitestate transducer (WFST) using the freely available WFST library HFST. We show that using IRC logs to train the system gives substantial improvement in recall from a baseline system using word frequencies computed from the Finnish Wikipedia.

Research paper thumbnail of Finite-state spell-checking with weighted language and error models

In this paper we present simple methods for construction and evaluation of finite-state spell-che... more In this paper we present simple methods for construction and evaluation of finite-state spell-checking tools using an existing finite-state lexical automaton, freely available finite-state tools and Internet corpora acquired from projects such as Wikipedia. As an example, we use a freely available open-source implementation of Finnish morphology, made with traditional finite-state morphology tools, and demonstrate rapid building of Northern Sámi and English spell checkers from tools and resources available from the Internet.

Research paper thumbnail of HFST tool for morphology; An efficient open-source

Morphological analysis of a wide range of languages can be implemented efficiently using finite-s... more Morphological analysis of a wide range of languages can be implemented efficiently using finite-state transducer technologies. Over the last 30 years, a number of attempts have been made to create tools for computational morphologies. The two main competing approaches have been parallel vs. cascaded rule application. The parallel rule application was originally introduced by Koskenniemi [7] and implemented in tools like TwolC and LexC. Currently many applications of morphologies could use dictionaries encoding the a priori likelihoods of words and expressions as well as the likelihood of relations to other representations or languages. We have made the choice to create open-source tools and language descriptions in order to let as many as possible participate in the effort. The current article presents some of the main tools that we have created such as HFST-LexC, HFST-TwolC and HFST-Compose-Intersect. We evaluate their efficiency in comparison to some similar tools and libraries. In particular, we evaluate them using several full-fledged morphological descriptions. Our tools compare well with similar open source tools, even if we still have some challenges ahead before we can catch up with the commercial tools. We demonstrate that for various reasons a parallel rule approach still seems to be more efficient than a cascaded rule approach when developing finite-state morphologies.

Research paper thumbnail of HFST tool for morphology

Research paper thumbnail of HFST Library

Research paper thumbnail of HFST - an Environment for Creating Language Technology Applications

Research paper thumbnail of Using hfst for creating computational linguistic applications

… Applications, Studies in …, 2012

HFST -Helsinki Finite-State Technology (hfst.sf.net) is a framework for compiling and applying li... more HFST -Helsinki Finite-State Technology (hfst.sf.net) is a framework for compiling and applying linguistic descriptions with finitestate methods. HFST currently collects some of the most important finite-state tools for creating morphologies and spellcheckers into one open-source platform and supports extending and improving the descriptions with weights to accommodate the modeling of statistical information. HFST offers a path from language descriptions to efficient language applications. In this article, we focus on aspects of HFST that are new to the end user, i.e. new tools, new features in existing tools, or new language applications, in addition to some revised algorithms that increase performance.

Research paper thumbnail of Modularisation of finnish finite-state language description—towards wide collaboration in open source development of a morphological analyser

Pedersen, BS, Nešpore, G., Inguna Skadi n.(eds.) …, 2011

In this paper we present an open source implementation for Finnish morphological parser. We short... more In this paper we present an open source implementation for Finnish morphological parser. We shortly evaluate it against contemporary criticism towards monolithic and unmaintainable finite-state language description. We use it to demonstrate way of writing finite-state language description that is used for varying set of projects, that typically need morphological analyser, such as POS tagging, morphological analysis, hyphenation, spell checking and correction, rule-based machine translation and syntactic analysis. The language description is done using available open source methods for building finitestate descriptions coupled with autotoolsstyle build system, which is de facto standard in open source projects. 1