A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek (original) (raw)
Related papers
Part of Speech Tagging for Ancient Greek
In this article we report the results for five POS taggers, i.e., the Mate tagger, the Hunpos tagger, RFTagger, the OpenNLP tagger, and NLTK Unigram tagger, tested on the data of the Ancient Greek Dependency Treebank. This is done in order to find the most efficient POS tagger to use for pre-annotation of new treebank data. A corrected 1-run 10-fold cross validation t test shows that the Mate tagger outperforms all the other taggers, with an accuracy score of 88%.
AMP: A SYSTEM FOR AUTOMATED MORPHOLOGICAL PROCESSING OF ANCIENT GREEK
linguist-uoi.gr
The present article describes AMP, a system for automated morphological processing of Ancient Greek word forms. It is considered a hybrid approach, combining pattern recognition techniques with limited linguistic knowledge to achieve accurate segmentation into stem and ending, and is expected to substantially contribute to the creation and/or enrichment of Greek morphological lexica. Though the current implementation concerns Attic dialect word forms, its modularity ensures its extensibility to other dialects and/or synchronies of the Greek language with minor modifications.
A Neural Network Approach to Ellipsis Detection in Ancient Greek
Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), 2023
In the present article, five neural networks models for prediction of the number of elliptical nodes in Ancient Greek sentences are compared. The models are trained on dependency treebank data, where elliptical nodes are introduced if and only if they govern nodes that would otherwise become orphans. As exact word forms of elliptical nodes cannot often be identified (and therefore be annotated) in Ancient Greek, the task is modeled as a multiclass classification one, where each sentence is associated with zero, one, two, or more than two elliptical nodes. The study shows that pretrained BERT token embeddings allow achievement of the best performance. A model, which is the first of its kind, is made available for further research.
A Greek Morphological Lexicon and Its Exploitation by Natural Language Processing Applications
Advances in Informatics, 2003
This paper presents a large-scale Greek morphological lexicon, developed at the Software & Knowledge Engineering Laboratory (SKEL) of NCSR "Demokritos". The paper describes the lexicon architecture and the procedure to develop and update it. The morphological lexicon was used to develop a lemmatiser and a morphological analyser that were exploited in various natural language processing applications for Greek. The paper presents these applications (controlled language checker, information extraction, information filtering) and discusses further research issues and how we plan to address them.
A Probabilistic Morphological Analyzer for Syriac
2010
We define a probabilistic morphological analyzer using a data-driven approach for Syriac in order to facilitate the creation of an annotated corpus. Syriac is an under-resourced Semitic language for which there are no available language tools such as morphological analyzers. We introduce novel probabilistic models for segmentation, dictionary linkage, and morphological tagging and connect them in a pipeline to create a probabilistic morphological analyzer requiring only labeled data. We explore the performance of models with varying amounts of training data and find that with about 34,500 labeled tokens, we can outperform a reasonable baseline trained on over 99,000 tokens and achieve an accuracy of just over 80%. When trained on all available training data, our joint model achieves 86.47% accuracy, a 29.7% reduction in error rate over the baseline.
Preprocessing Greek Papyri for Linguistic Annotation
J. Data Min. Digit. Humanit., 2016
Greek documentary papyri form an important direct source for Ancient Greek. It has been exploited surprisingly little in Greek linguistics due to a lack of good tools for searching linguistic structures. This article presents a new tool and digital platform, “Sematia”, which enables transforming the digital texts available in TEI EpiDoc XML format to a format which can be morphologically and syntactically annotated (treebanked), and where the user can add new metadata concerning the text type, writer and handwriting of each act of writing. An important aspect in this process is to take into account the original surviving writing vs. the standardization of language and supplements made by the editors. This is performed by creating two different layers of the same text. The platform is in its early development phase. Ongoing and future developments, such as tagging linguistic variation phenomena as well as queries performed within Sematia, are discussed at the end of the article.
Medieval Social Media: Manual and Automatic Annotation of Byzantine Greek Marginal Writing
Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)
In this paper, we present the interim results of a transformer-based annotation pipeline for Ancient and Medieval Greek. As the texts in the Database of Byzantine Book Epigrams have not been normalised, they pose more challenges for manual and automatic annotation than Ancient Greek, normalised texts do. As a result, the existing annotation tools perform poorly. We compiled three data sets for the development of an automatic annotation tool and carried out an inter-annotator agreement study, with a promising agreement score. The experimental results show that our part-of-speech tagger yields accuracy scores that are almost 50 percentage points higher than the widely used rule-based system Morpheus. In addition, error analysis revealed problems related to phenomena also occurring in current social media language.