Statistical models for unsupervised learning of morphology and POS tagging (original) (raw)
Related papers
Unsupervised Acquiring of Morphological Paradigms from Tokenized Text
Lecture Notes in Computer Science, 2008
This paper describes a rather simplistic method of unsupervised morphological analysis of words in an unknown language. All what is needed is a raw text corpus in the given language. The algorithm looks at words, identifies repeatedly occurring stems and suffixes, and constructs probable morphological paradigms. The paper also describes how this method has been applied to solve the Morpho Challenge 2007 task, and gives the Morpho Challenge results. Although the present work was originally a student project without any connection or even knowledge of related work, its simple approach outperformed, to our surprise, several others in most morpheme segmentation subcompetitions. We believe that there is enough room for improvements that can put the results even higher. Errors are discussed in the paper; together with suggested adjustments in future research.
FROM PARADIGM STRUCTURE TO NATURAL LANGUAGE MORPHOLOGY INDUCTION
2000
Most of the world's natural languages have complex morphology. But the expense of building morphological analyzers by hand has prevented the development of morphological analysis systems for the large majority of languages. Unsupervised induction techniques, that learn from unannotated text data, can facilitate the development of computational morphology systems for new languages. Such unsupervised morphological analysis systems have been shown to help natural language processing tasks including speech recognition and information retrieval . This thesis describes ParaMor, an unsupervised induction algorithm for learning morphological paradigms from large collections of words in any natural language. Paradigms are sets of mutually substitutable morphological operations that organize the inflectional morphology of natural languages. ParaMor focuses on the most common morphological process, suffixation.
Inducing the morphological lexicon of a natural language from unannotated text
2005
This work presents an algorithm for the unsupervised learning, or induction, of a simple morphology of a natural language. A probabilistic maximum a posteriori model is utilized, which builds hierarchical representations for a set of morphs, which are morpheme-like units discovered from unannotated text corpora. The induced morph lexicon stores parameters related to both the "meaning" and "form" of the morphs it contains. These parameters affect the role of the morphs in words. The model is implemented in a task of unsupervised morpheme segmentation of Finnish and English words. Very good results are obtained for Finnish and almost as good results are obtained in the English task.
Probabilistic hierarchical clustering of morphological paradigms
We propose a novel method for learning morphological paradigms that are structured within a hierarchy. The hierarchical structuring of paradigms groups morphologically similar words close to each other in a tree structure. This allows detecting morphological similarities easily leading to improved morphological segmentation. Our evaluation using (Kurimo et al., 2011a; Kurimo et al., 2011b) dataset shows that our method performs competitively when compared with current state-of-art systems.
A paradigm-based morphological analyzer
1985
Computational morphology has advanced by leaps in the past few years. Since the p i o n e e r i n g work of Kay (e.g. Kay 1977), major contributions have been submitted especially by Karttunen (Karttunen & al. 1981) and Koskenniemi (1983). A common linguis tic trait of this line of work has been a f a i r l y strict a d h e r e n c e to the basic principles of generative phonology and morphology ( e s p e c i a l l y of the IP type). The th eo ri es and m o d e l s p r o p o s e d have been decisively based on the notion of rules relating dif ferent levels of representation. Typically, the rules describe morphophonological alternations by which surface-level word-forms deviate from postulated lexical or underlying forms. Central con cepts h a v e a l s o been the r e p r e s e n t a t i o n of l e x i c o n s as tree structures, minilexicons for describing morphotactic structure in terms of pointers to subsequent classes of allowed morpholo gical categories (e.g. Karttunen & al....
Clustering Morphological Paradigms Using Syntactic Categories
Lecture Notes in Computer Science, 2010
We propose a new clustering algorithm for the induction of the morphological paradigms. Our method is unsupervised and exploits the syntactic categories of the words acquired by an unsupervised syntactic category induction algorithm [1]. Previous research [2,3] on joint learning of morphology and syntax has shown that both types of knowledge affect each other making it possible to use one type of knowledge to help learn the other one.
Unsupervised learning of morphology by using syntactic categories
2009
This paper presents a method for unsupervised learning of morphology that exploits the syntactic categories of words. Previous research [12] on learning of morphology and syntax has shown that both kinds of knowledge affect each other making it possible to use one type of knowledge to help the other. In this work, we make use of syntactic information i.e. Part-of-Speech (PoS) tags of words to aid morphological analysis. We employ an existing unsupervised PoS tagging algorithm for inducing the PoS categories. A distributional clustering algorithm is developed for inducing morphological paradigms.
Unsupervised morphology induction for part-of-speech-tagging
Proceedings of the 29th Annual Penn Linguistics Colloquium, 2006
In this paper we discuss a specific approach and the role of unsupervised morphology induction for induction of lexical properties and part> of speech (PoS) tagging. There is a clear intuition among native speakers that PoS classification of words depends on various factors, eg distributional properties or the words in context, as well as morphological structure of the particular tokens. Induction, of word types was modeled in various approaches by mapping contextual and distributional properties (eg Mintz et al. 2002, Lee 1997) on vector space models and clustering on the basis of vector similarities. Various PoS tagging algorithms make either use of manually coded contextual and morphological rules, or use learning and training approaches to exploit such information contained in large corpora via n-gram models and morphological classification, cf. Brants (2000), Lee et al.(2002)....
Dirichlet Processes for Joint Learning of Morphology and PoS Tags
This paper presents a joint model for learning morphology and part-of-speech (PoS) tags simultaneously. The proposed method adopts a finite mixture model that groups words having similar contextual features thereby assigning the same PoS tag to those words. While learning PoS tags, words are analysed morphologically by exploiting similar morphological features of the learned PoS tags. The results show that morphology and PoS tags can be learned jointly in a fully unsupervised setting.
Unsupervised morphological segmentation and clustering with document boundaries
Many approaches to unsupervised morphology acquisition incorporate the frequency of character sequences with respect to each other to identify word stems and affixes. This typically involves heuristic search procedures and calibrating multiple arbitrary thresholds. We present a simple approach that uses no thresholds other than those involved in standard application of χ 2 significance testing. A key part of our approach is using document boundaries to constrain generation of candidate stems and affixes and clustering morphological variants of a given word stem. We evaluate our model on English and the Mayan language Uspanteko; it compares favorably to two benchmark systems which use considerably more complex strategies and rely more on experimentally chosen threshold values.