Mika Hämäläinen - Profile on Academia.edu (original) (raw)

Thesis Chapters by Mika Hämäläinen

Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 2019

This paper presents multiple methods for normalizing the most deviant and infrequent historical s... more This paper presents multiple methods for normalizing the most deviant and infrequent historical spellings in a corpus consisting of personal correspondence from the 15th to the 19th century. The methods include machine translation (neural and statistical), edit distance and rule-based FST. Different normalization methods are compared and evaluated. All of the methods have their own strengths in word normalization. This calls for finding ways of combining the results from these methods to leverage their individual strengths.

Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts, 2018

We present our ongoing development of a synchronized XML-MediaWiki dictionary to solve the proble... more We present our ongoing development of a synchronized XML-MediaWiki dictionary to solve the problem of XML dictionaries in the context of small Uralic languages. XML is good at representing structured data, but it does not fare well in a situation where multiple users are editing the dictionary simultaneously. Furthermore, XML is overly complicated for non-technical users due to its strict syntax that has to be maintained valid at all times. Our system solves these problems by making a synchronized editing of the same dictionary data possible both in a MediaWiki environment and XML files in an easy fashion. In addition, we describe how the dictionary knowledge in the MediaWiki-based dictionary can be enhanced by an additional Semantic MediaWiki layer for more effective searches in the data. In addition, an API access to the lexical information in the dictionary and morphological tools in the form of an open source Python library is presented.

Proceedings of the Ninth International Conference on Computational Creativity, 2018

This paper presents a new, NLG based approach to poetry generation in Finnish for use as a part o... more This paper presents a new, NLG based approach to poetry generation in Finnish for use as a part of a bigger Poem Machine system the objective of which is to provide a platform for human computer co-creativity. The approach divides generation into a linguistically solid system for producing grammatical Finnish and higher level systems for producing a poem structure and choosing the lexical items used in the poems. An automatically extracted open-access semantic repository tailored for poem generation is developed for the system. Finally, the resulting poems are evaluated and compared with the state of the art in Finnish poem generation.

Papers by Mika Hämäläinen

Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages, 2021

There are a lot of tools and resources available for processing Finnish. In this paper, we survey... more There are a lot of tools and resources available for processing Finnish. In this paper, we survey recent papers focusing on Finnish NLP related to many different subcategories of NLP such as parsing, generation, semantics and speech. NLP research is conducted in many different research groups in Finland, and it is frequently the case that NLP tools and models resulting from academic research are made available for others to use on platforms such as Github. Tiivistelmä Suomen kielen koneelliseen käsittelyyn on tarjolla paljon valmiita työkaluja ja resursseja. Tässä artikkelissa tarkastelemme viimeaikoina julkaistuja tieteellisiä artikkeleita, joissa keskitytään suomen kielen kieliteknologiaan. Tarkastelemme kieliteknologian eri alaluokkia, kuten jäsentämistä, tuottamista, semantiikkaa ja puheetta. kieliteknologista tutkimusta tehdään Suomessa monissa eri tutkimusryhmissä, ja usein akateemisen tutkimuksen tuloksena tuotetut kieliteknologian työkalut ja mallit julkaistaan muiden käytettäväksi esimerkiksi Githubissa.

Linha D’Água, 2021

Presentamos nuestra infraestructura para la documentación de lenguas urálicas, que consiste en he... more Presentamos nuestra infraestructura para la documentación de lenguas urálicas, que consiste en herramientas para redactar diccionarios de tal forma que las entradas sean estructuradas en el formato XML (Extensible Markup Language). Desde los diccionarios en XML podemos generar código para analizadores morfológicos que son útiles para todo tipo de actividades de PLN. En este artículo mostramos las ventajas que una documentación digital y legible por máquina tiene. Describimos, también, el sistema en el contexto de lenguas urálicas amenazadas.

Proceedings of the Digital Humanities in the Nordic Countries 5th Conference, 2020

Research in the digital humanities and computational social sciences requires overcoming complexi... more Research in the digital humanities and computational social sciences requires overcoming complexity in research data, methodology, and research questions. In this article, we show through case studies of three different digital humanities and computational social science projects, that these problems are prevalent, multiform, as well as laborious to counter. Yet, without facilities for acknowledging , detecting, handling and correcting for such bias, any results based on the material will be faulty. Therefore, we argue for the need for a wider recognition and acknowledgement of the problematic nature of many DH/CSS datasets, and correspondingly of the amount of work required to render such data usable for research. These arguments have implications both for evaluating feasibility and allocation of funding with respect to project proposals, but also in assigning academic value and credit to the labour of cleaning up and documenting datasets of interest.

3rd International Workshop for Computational Linguistics of Uralic Languages (IWCLUL 2017), 2017

Open-source analyzer dictionary development is being implemented for Skolt Sami, Ingrian, Moksha-... more Open-source analyzer dictionary development is being implemented for Skolt Sami, Ingrian, Moksha-Mordvin, etc. in the Helsinki CSC infrastructure; home of the Finnish Kielipankki 'Language Bank' and Termipankki 'Term Bank'. The proximity of minority-language corpora in need of annotation and the multiple usage of controlled wikimedia-type dictionaries make CSC an attractive site for synchronized transducer dictionary development. The open-source FST development of Uralic and other minority languages at Giellatekno-Divvun in Tromsø demonstrates a vast potential for reusage of FST-s, only augmented by open-source work in OmorFi, Apertium and Universal Dependency <http://univer-saldependencies.org/#language-urj>. The initial idea is to allow synchronized editing of Giellatekno XML and CSC Wiki structures via github. In addition to allowing for simple lexc LEMMA:STEM CONTINUATION_LEXICON "TRANS-LATION" ; line exports, the parallel dictionaries will provide for documentation of derivation, morpho-syntactic information on valency and government, semantics and etymology.

Proceedings of the 8th International Conference on Computational Creativity (ICCC'17), 2017

Many linguistic creativity applications rely heavily on knowledge of nouns and their properties. ... more Many linguistic creativity applications rely heavily on knowledge of nouns and their properties. However, such knowledge sources are scarce and limited. We present a graph-based approach for expanding and weighting properties of nouns with given initial, non-weighted properties. In this paper, we focus on famous characters, either real or fictional, and categories of people , such as Actor, Hero, Child etc. In our case study, we started with an average of 11 and 25 initial properties for characters and categories, for which the method found 63 and 132 additional properties, respectively. An empirical evaluation shows that the expanded properties and weights are consistent with human judgement. The resulting knowledge base can be utilized in creation of figurative language. For instance, metaphors based on famous characters can be used in various applications including story generation, creative writing, advertising and comic generation.

The 11th International Conference on Natural Language Generation : Proceedings of the Conference, 2018

We present Poem Machine, an interactive online tool for co-authoring Finnish poetry with a comput... more We present Poem Machine, an interactive online tool for co-authoring Finnish poetry with a computationally creative agent. Poem Machine can produce poetry of its own and assist the user in authoring poems. The main target group for the system is primary school children, and its use as a part of teaching is currently under study.

The Proceedings of Logic and Engineering of Natural Language Semantics 15 (LENLS15), 2018

This paper introduces the second version of SemFi, a semantic database for Finnish with syntactic... more This paper introduces the second version of SemFi, a semantic database for Finnish with syntactic relations. The previous version of SemFi has been used in poem generation, and thus it has application area in NLG applications. In addition to extending SemFi, this paper describes and evaluates its translation into four endangered Uralic languages , Skolt Sami, Erzya, Moksha and Komi-Zyrian, all of which are greatly under-resourced. The translated dataset is known as SemUr.

Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages, 2018

We present an open source Python library to automatically produce syntactically correct Finnish s... more We present an open source Python library to automatically produce syntactically correct Finnish sentences when only lemmas and their relations are provided. The tool resolves automatically morphosyntax in the sentence such as agreement and government rules and uses Omorfi to produce the correct morphological forms. In this paper, we discuss how case government can be learned automatically from a corpus and incorporated as a part of the natural language generation tool. We also present how agreement rules are modelled in the system and discuss the use cases of the tool such as its initial use as part of a computational creativity system, called Poem Machine. Tiivistelmä Tässä artikkelissa esittelemme avoimen lähdekoodin Python-kirjaston kielio-pillisten lauseiden automaattista tuottamista varten suomen kielelle. Kieliopilli-set rakenteet pystytään tuottamaan pelkkien lemmojen ja niiden välisten suh-teiden avulla. Työkalu ratkoo vaadittavan morfosyntaktiset vaatimukset kuten kongruenssin ja rektion automaattisesti ja tuottaa morfologisesti oikean muodon Omorfin avulla. Esittelemme tavan, jolla verbien rektiot voidaan poimia auto-maattisesti korpuksesta ja yhdistää osaksi NLG-järjestelmää. Esittelemme, miten kongruenssi on mallinnettu osana järjestelmää ja kuvaamme työkalun alkuperäi-sen käyttötarkoituksen osana laskennallisesti luovaa Runokone-järjestelmää.

The 11th International Conference on Natural Language Generation : Proceedings of the Conference, 2018

Satire has played a role in indirectly expressing critique towards an authority or a person from ... more Satire has played a role in indirectly expressing critique towards an authority or a person from time immemorial. We present an autonomously creative master-apprentice approach consisting of a genetic algorithm and an NMT model to produce humorous and culturally apt satire out of movie titles automatically. Furthermore , we evaluate the approach in terms of its creativity and its output. We provide a solid definition for creativity to maximize the objectiveness of the evaluation.

Journal of open source software, 2019

This paper presents UralicNLP, a Python library, the goal of which is to mask the actual implemen... more This paper presents UralicNLP, a Python library, the goal of which is to mask the actual implementation behind a Python interface. This not only lowers the threshold to use the tools provided in the Giellatekno infrastructure but also makes it easier to incorporate them as a part of research code written in Python.

Электронная Письменность Народов Российской Федерации : Опыт, Проблемы И Перспективы, 2019

Мы представляем открытую онлайн-инфраструктуру для редактирования и визуализации сло- варей разны... more Мы представляем открытую онлайн-инфраструктуру для редактирования и визуализации сло- варей разных уральских языков (например, эрзя, мокша, скольт-саамский и коми-зырянский). Наша инфраструктура полностью интегрируется в существующую Giellatekno с точки зрения словарей XML и морфологии FST. Наш код в открытом источнике.

Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 2019

This paper studies the use of NMT (neural machine translation) as a normalization method for an e... more This paper studies the use of NMT (neural machine translation) as a normalization method for an early English letter corpus. The corpus has previously been normalized so that only less frequent deviant forms are left out without normalization. This paper discusses different methods for improving the normalization of these deviant forms by using different approaches. Adding features to the training data is found to be unhelpful, but using a lexicographical resource to filter the top candidates produced by the NMT model together with lemmatization improves results.

AsiaLex 2019 : Proceedings of the 13th Conference of the Asian Association for Lexicography, 2019

In this paper, we identify the need for a standardized formalism for the structured XML dictionar... more In this paper, we identify the need for a standardized formalism for the structured XML dictionaries of endangered Uralic languages in the Giella infrastructure. For this purpose, we have decided to use TEI formalism as it is a standardized way of representing data and its commonly used in the field of lexicography. This paper focuses on describing the issues and challenges faced in the conversion of the Giella XML into TEI. A full conversion scheme is introduced in this paper contrasting the peculiarities of the two XML formalisms. We incorporate the new TEI-based XML structure into our existing online dictionary system as an output format.

22nd Nordic Conference on Computational Linguistics (NoDaLiDa) : Proceedings of the Conference, 2019

Endangered Uralic languages present a high variety of inflectional forms in their morphology. Thi... more Endangered Uralic languages present a high variety of inflectional forms in their morphology. This results in a high number of homonyms in inflections, which introduces a lot of morphological ambiguity in sentences. Previous research has employed constraint grammars to address this problem, however CGs are often unable to fully disambiguate a sentence, and their development is labour intensive. We present an LSTM based model for automatically ranking morphological readings of sentences based on their quality. This ranking can be used to evaluate the existing CG disambiguators or to directly morphologically disambiguate sentences. Our approach works on a morphological abstraction and it can be trained with a very small dataset.

Proceedings of the 10th International Conference on Computational Creativity, 2019

This paper presents work on modelling the social psychological aspect of socialization in the cas... more This paper presents work on modelling the social psychological aspect of socialization in the case of a com-putationally creative master-apprentice system. In each master-apprentice pair, the master, a genetic algorithm, is seen as a parent for its apprentice, which is an NMT based sequence-to-sequence model. The effect of different parenting styles on the creative output of each pair is in the focus of this study. This approach brings a novel view point to computational social creativity, which has mainly focused in the past on computation-ally creative agents being on a socially equal level, whereas our approach studies the phenomenon in the context of a social hierarchy.

12th International Conference on Natural Language Generation : Proceedings of the Conference, 2019

We present a creative poem generator for the morphologically rich Finnish language. Our method fa... more We present a creative poem generator for the morphologically rich Finnish language. Our method falls into the master-apprentice paradigm, where a computationally creative genetic algorithm teaches a BRNN model to generate poetry. We model several parts of poetic aesthetics in the fitness function of the genetic algorithm, such as sonic features, semantic coherence, imagery and metaphor. Furthermore , we justify the creativity of our method based on the FACE theory on computational creativity and take additional care in evaluating our system by automatic metrics for concepts together with human evaluation for aesthetics, framing and expressions.

Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 2019

Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts, 2018

Proceedings of the Ninth International Conference on Computational Creativity, 2018

Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages, 2021

Linha D’Água, 2021

Proceedings of the Digital Humanities in the Nordic Countries 5th Conference, 2020

3rd International Workshop for Computational Linguistics of Uralic Languages (IWCLUL 2017), 2017

Proceedings of the 8th International Conference on Computational Creativity (ICCC'17), 2017

The 11th International Conference on Natural Language Generation : Proceedings of the Conference, 2018

The Proceedings of Logic and Engineering of Natural Language Semantics 15 (LENLS15), 2018

Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages, 2018

The 11th International Conference on Natural Language Generation : Proceedings of the Conference, 2018

Journal of open source software, 2019

Электронная Письменность Народов Российской Федерации : Опыт, Проблемы И Перспективы, 2019

Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 2019

AsiaLex 2019 : Proceedings of the 13th Conference of the Asian Association for Lexicography, 2019

22nd Nordic Conference on Computational Linguistics (NoDaLiDa) : Proceedings of the Conference, 2019

Proceedings of the 10th International Conference on Computational Creativity, 2019

12th International Conference on Natural Language Generation : Proceedings of the Conference, 2019

2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing : Proceedings of the Conference, 2019

We present a novel approach for generating poetry automatically for the morphologically rich Finn... more We present a novel approach for generating poetry automatically for the morphologically rich Finnish language by using a genetic algorithm. The approach improves the state of the art of the previous Finnish poem generators by introducing a higher degree of freedom in terms of structural creativity. Our approach is evaluated and described within the paradigm of computational creativity, where the fitness functions of the genetic algorithm are assimilated with the notion of aesthetics. The output is considered to be a poem 81.5% of the time by human evaluators.

Proceedings of Recent Advances in Natural Language Processing, 2019

A great deal of historical corpora suffer from errors introduced by the OCR (optical character re... more A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitiza-tion process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.

Proceedings of the 3rd Workshop on Computational Methods in the Study of Endangered Languages : (Volume 1) Papers, 2019

We approach the problem of expanding the set of cognate relations with a sequence-to-sequence NMT... more We approach the problem of expanding the set of cognate relations with a sequence-to-sequence NMT model. The language pair of interest, Skolt Sami and North Sami, has too limited a set of parallel data for an NMT model as such. We solve this problem on the one hand, by training the model with North Sami cognates with other Uralic languages and, on the other, by generating more synthetic training data with an SMT model. The cognates found using our method are made publicly available in the Online Dictionary of Uralic Languages.

The Fifth Workshop on Noisy User-generated Text (W-NUT 2019) : Proceedings of the Workshop, 2019

We compare different LSTMs and transformer models in terms of their effectiveness in normalizing ... more We compare different LSTMs and transformer models in terms of their effectiveness in normalizing dialectal Finnish into the normative standard Finnish. As dialect is the common way of communication for people online in Finnish, such a normalization is a necessary step to improve the accuracy of the existing Finnish NLP tools that are tailored for norma-tive Finnish text. We work on a corpus consisting of dialectal data from 23 distinct Finnish dialect varieties. The best functioning BRNN approach lowers the initial word error rate of the corpus from 52.89 to 5.73.