Mirella De Sisto | Tilburg University (original) (raw)
Uploads
Conference Presentations by Mirella De Sisto
LREC2022 Proceedings, 2022
Sign Languages (SLs) are the primary means of communication for at least half a million people in... more Sign Languages (SLs) are the primary means of communication for at least half a million people in Europe alone. However, the
development of SL recognition and translation tools is slowed down by a series of obstacles concerning resource scarcity and
standardization issues in the available data. The former challenge relates to the volume of data available for machine learning as well
as the time required to collect and process new data. The latter obstacle is linked to the variety of the data, i.e., annotation formats are
not unified and vary amongst different resources. The available data formats are often not suitable for machine learning, obstructing
the provision of automatic tools based on neural models. In the present paper, we give an overview of these challenges by comparing
various SL corpora and SL machine learning datasets. Furthermore, we propose a framework to address the lack of standardization at
format level, unify the available resources and facilitate SL research for different languages. Our framework takes ELAN files as inputs
and returns textual and visual data ready to train SL recognition and translation models. We present a proof of concept, training neural
translation models on the data produced by the proposed framework.
Entre las montañas que alumbra la luna traza un aquelarre su ronda nocturna, A
24/06/2015. Talk at the Dialect Meeting 2015 and CIDSM X, Leiden University Centre for Linguisti... more 24/06/2015. Talk at the Dialect Meeting 2015 and CIDSM X, Leiden University Centre for Linguistics.
13/06/2014. Talk at the RomTin Workshop, Universiteit Leiden.
02/06/2014 Talk at the Workshop on Raddoppiamento Fonosintattico, Universiteit Leiden.
Papers by Mirella De Sisto
Journal of the Association for Information Science and Technology, 2021
The rise in artificial intelligence and natural language processing techniques has increased cons... more The rise in artificial intelligence and natural language processing techniques has increased considerably in the last few decades. Historically, the focus has been primarily on texts expressed in prose form, leaving mostly aside figurative or poetic expressions of language due to their rich semantics and syntactic complexity. The creation and analysis of poetry have been commonly carried out by hand, with a few computer‐assisted approaches. In the Spanish context, the promise of machine learning is starting to pan out in specific tasks such as metrical annotation and syllabification. However, there is a task that remains unexplored and underdeveloped: stanza classification. This classification of the inner structures of verses in which a poem is built upon is an especially relevant task for poetry studies since it complements the structural information of a poem. In this work, we analyzed different computational approaches to stanza classification in the Spanish poetic tradition. Th...
Linköping electronic conference proceedings, Jul 9, 2024
arXiv (Cornell University), Apr 26, 2024
All poetic forms come from somewhere. Prosodic templates can be copied for generations, altered b... more All poetic forms come from somewhere. Prosodic templates can be copied for generations, altered by individuals, imported from foreign traditions, or fundamentally changed under the pressures of language evolution. Yet these relationships are notoriously difficult to trace across languages and times. This paper introduces an unsupervised method for detecting structural similarities in poems using local sequence alignment. The method relies on encoding poetic texts as strings of prosodic features using a four-letter alphabet; these sequences are then aligned to derive a distance measure based on weighted symbol (mis)matches. Local alignment allows poems to be clustered according to emergent properties of their underlying prosodic patterns. We evaluate method performance on a meter recognition tasks against strong baselines and show its potential for cross-lingual and historical research using three short case studies: 1) mutations in quantitative meter in classical Latin, 2) European diffusion of the Renaissance hendecasyllable, and 3) comparative alignment of modern meters in 18-19th century Czech, German and Russian. We release an implementation of the algorithm as a Python package with an open license.
Zenodo (CERN European Organization for Nuclear Research), Oct 16, 2023
Digital scholarship in the humanities, Feb 7, 2024
Studia Metrica et Poetica, Sep 10, 2023
The present paper introduces a corpus of Dutch Renaissance poetry which was automatically annotat... more The present paper introduces a corpus of Dutch Renaissance poetry which was automatically annotated by using neural networks. The analysis of the annotations provides a clear picture of the process of implementing the new poetic form into Dutch poetic tradition, and of its different stages. The development of iambic metre was a gradual process that required various attempts; this can be well observed when comparing Dutch poems from a 100 year time window. While syllabic instances can be observed among the first attempts, most of the earlier poems are not isosyllabic and have a rather varied syllable length. This study shows that isosyllabicity developed together with iambicity. Finally, automatic poetry annotation allows for testing and validating theoretical hypotheses and for investigating literary questions with the aid of large amount of data.
Zenodo (CERN European Organization for Nuclear Research), Jun 9, 2023
arXiv (Cornell University), Apr 18, 2023
While quality estimation (QE) can play an important role in the translation process, its effectiv... more While quality estimation (QE) can play an important role in the translation process, its effectiveness relies on the availability and quality of training data. For QE in particular, high-quality labeled data is often lacking due to the high cost and effort associated with labeling such data. Aside from the data scarcity challenge, QE models should also be generalizable; i.e., they should be able to handle data from different domains, both generic and specific. To alleviate these two main issues-data scarcity and domain mismatch-this paper combines domain adaptation and data augmentation in a robust QE system. Our method first trains a generic QE model and then fine-tunes it on a specific domain while retaining generic knowledge. Our results show a significant improvement for all the language pairs investigated, better cross-lingual inference, and a superior performance in zero-shot learning scenarios as compared to state-of-the-art baselines.
Isogloss, Mar 14, 2024
Campanian dialects such as Neapolitan feature a so-called 'second form of the infinitive' (SFI), ... more Campanian dialects such as Neapolitan feature a so-called 'second form of the infinitive' (SFI), a form consisting of the bare verbal stem, which can be used after functional verbs. This paper addresses the microvariation concerning the construction by analysing novel data from the Valle Caudina, located to the northeast of Naples. The SFI is frequently found specifically with the imperative va 'go!'. In Neapolitan, the form has been reanalysed as an imperatival form in this context, yielding an asyndetic imperative. At a first glance, the use of the SFI in Valle Caudina looks very similar to its Neapolitan counterpart, but unlike Neapolitan, the SFI in these varieties has remained non-finite and has not been reanalysed as an imperative. These dialects can thus be considered a previous stage of the development described for Neapolitan by Ledgeway (1997, 2007, 2009). This claim finds support in the absence of metaphonetic forms-which have appeared in Neapolitan, as a consequence of the reanalysis-as well as the presence of clitic climbing. Finally, unlike Neapolitan, the SFI is becoming less productive in the varieties of the Valle Caudina.
ILLA - Nuove Ricerche Umanistiche, 2021
28th Manchester Phonology Meeting, May 1, 2021
Moderna Sprak, 2020
In southern Italian dialects, possessives have an enclitic variant typically associated with kins... more In southern Italian dialects, possessives have an enclitic variant typically associated with kinship nouns (Rohlfs 1967, Sotiri 2007, Ledgeway 2009, D’Alessandro & Migliori 2017) (e.g. [ˈfratə-mə] ˈbrother myˈ). The most common strategy to avoid violations of the three-syllable window is to avoid the enclitic form of the possessive, or stress shift, as in Lucanian (e.g. [ˌiennəˈru-mə] cf. [ˈiennərə], Lüdtke 1979:31). In the dialects of Airola and Boiano, a different strategy is attested: with proparoxytonic nouns (e.g. [ˈjennərə] ˈson-in-lawˈ in both varieties and [ˈsɔtʃəra]/[ˈswotʃəra] ˈmother/father-in-lawˈ in Boiano), the last unstressed syllable of the host is deleted (e.g. [ˈjennə-mə], [ˈsɔtʃə-mə], [ˈswotʃə-mə]). We claim that possessive enclitics in Airola and Boiano are internal clitics, that is, they amalgamate with the prosodic word that contains the host noun. We further propose that both proparoxytonic stress and the three-syllable-window derive from internally layered ternary feet (Martínez-Paricio 2013). These feet need to be aligned with the right edge of their containing prosodic word. When a possessive enclitic is incorporated, the most optimal strategy to comply with this alignment requirement is to build an internally layered ternary foot and delete the last syllable of the host noun, stress shift being excluded.
The idiosyncrasy of literary studies has been an obstacle to its technological improvement for ye... more The idiosyncrasy of literary studies has been an obstacle to its technological improvement for years, especially to represent their knowledge in a machine-readable format. The richness, variety, and different study`s perspectives that scholars find in their studies make this task a highly complex challenge. This complexity is even more noticed in the poetry genre, where each poetic tradition has independently developed its analytical terminology and methodology. In this work, we have addressed the construction of a poetry ontology to express the scholar ́s knowledge spread out in isolated databases or works. Ontopoetry ontology has been developed following Neon methodology, and it has been structured in three modules: a) core, b) poetic analysis and c) transmission, covering the essential aspects in a poetry literary study. Ontopoetry core module has been aligned with FRBRoo ontology guaranteeing its interoperability. This paper is focused on the description of the core module, its ...
The main aim of Poetry Standardization and Linked Open Data Project, POSTDATA, is to provide mean... more The main aim of Poetry Standardization and Linked Open Data Project, POSTDATA, is to provide means for researchers on European poetry to publish and consume semantically-enriched data. Thus, developing a poetry ontology is a pillar of its semantic domain. This ontology tries to enhance interoperability in the European poetry community and capture the European poetry domain knowledge.
LREC2022 Proceedings, 2022
Sign Languages (SLs) are the primary means of communication for at least half a million people in... more Sign Languages (SLs) are the primary means of communication for at least half a million people in Europe alone. However, the
development of SL recognition and translation tools is slowed down by a series of obstacles concerning resource scarcity and
standardization issues in the available data. The former challenge relates to the volume of data available for machine learning as well
as the time required to collect and process new data. The latter obstacle is linked to the variety of the data, i.e., annotation formats are
not unified and vary amongst different resources. The available data formats are often not suitable for machine learning, obstructing
the provision of automatic tools based on neural models. In the present paper, we give an overview of these challenges by comparing
various SL corpora and SL machine learning datasets. Furthermore, we propose a framework to address the lack of standardization at
format level, unify the available resources and facilitate SL research for different languages. Our framework takes ELAN files as inputs
and returns textual and visual data ready to train SL recognition and translation models. We present a proof of concept, training neural
translation models on the data produced by the proposed framework.
Entre las montañas que alumbra la luna traza un aquelarre su ronda nocturna, A
24/06/2015. Talk at the Dialect Meeting 2015 and CIDSM X, Leiden University Centre for Linguisti... more 24/06/2015. Talk at the Dialect Meeting 2015 and CIDSM X, Leiden University Centre for Linguistics.
13/06/2014. Talk at the RomTin Workshop, Universiteit Leiden.
02/06/2014 Talk at the Workshop on Raddoppiamento Fonosintattico, Universiteit Leiden.
Journal of the Association for Information Science and Technology, 2021
The rise in artificial intelligence and natural language processing techniques has increased cons... more The rise in artificial intelligence and natural language processing techniques has increased considerably in the last few decades. Historically, the focus has been primarily on texts expressed in prose form, leaving mostly aside figurative or poetic expressions of language due to their rich semantics and syntactic complexity. The creation and analysis of poetry have been commonly carried out by hand, with a few computer‐assisted approaches. In the Spanish context, the promise of machine learning is starting to pan out in specific tasks such as metrical annotation and syllabification. However, there is a task that remains unexplored and underdeveloped: stanza classification. This classification of the inner structures of verses in which a poem is built upon is an especially relevant task for poetry studies since it complements the structural information of a poem. In this work, we analyzed different computational approaches to stanza classification in the Spanish poetic tradition. Th...
Linköping electronic conference proceedings, Jul 9, 2024
arXiv (Cornell University), Apr 26, 2024
All poetic forms come from somewhere. Prosodic templates can be copied for generations, altered b... more All poetic forms come from somewhere. Prosodic templates can be copied for generations, altered by individuals, imported from foreign traditions, or fundamentally changed under the pressures of language evolution. Yet these relationships are notoriously difficult to trace across languages and times. This paper introduces an unsupervised method for detecting structural similarities in poems using local sequence alignment. The method relies on encoding poetic texts as strings of prosodic features using a four-letter alphabet; these sequences are then aligned to derive a distance measure based on weighted symbol (mis)matches. Local alignment allows poems to be clustered according to emergent properties of their underlying prosodic patterns. We evaluate method performance on a meter recognition tasks against strong baselines and show its potential for cross-lingual and historical research using three short case studies: 1) mutations in quantitative meter in classical Latin, 2) European diffusion of the Renaissance hendecasyllable, and 3) comparative alignment of modern meters in 18-19th century Czech, German and Russian. We release an implementation of the algorithm as a Python package with an open license.
Zenodo (CERN European Organization for Nuclear Research), Oct 16, 2023
Digital scholarship in the humanities, Feb 7, 2024
Studia Metrica et Poetica, Sep 10, 2023
The present paper introduces a corpus of Dutch Renaissance poetry which was automatically annotat... more The present paper introduces a corpus of Dutch Renaissance poetry which was automatically annotated by using neural networks. The analysis of the annotations provides a clear picture of the process of implementing the new poetic form into Dutch poetic tradition, and of its different stages. The development of iambic metre was a gradual process that required various attempts; this can be well observed when comparing Dutch poems from a 100 year time window. While syllabic instances can be observed among the first attempts, most of the earlier poems are not isosyllabic and have a rather varied syllable length. This study shows that isosyllabicity developed together with iambicity. Finally, automatic poetry annotation allows for testing and validating theoretical hypotheses and for investigating literary questions with the aid of large amount of data.
Zenodo (CERN European Organization for Nuclear Research), Jun 9, 2023
arXiv (Cornell University), Apr 18, 2023
While quality estimation (QE) can play an important role in the translation process, its effectiv... more While quality estimation (QE) can play an important role in the translation process, its effectiveness relies on the availability and quality of training data. For QE in particular, high-quality labeled data is often lacking due to the high cost and effort associated with labeling such data. Aside from the data scarcity challenge, QE models should also be generalizable; i.e., they should be able to handle data from different domains, both generic and specific. To alleviate these two main issues-data scarcity and domain mismatch-this paper combines domain adaptation and data augmentation in a robust QE system. Our method first trains a generic QE model and then fine-tunes it on a specific domain while retaining generic knowledge. Our results show a significant improvement for all the language pairs investigated, better cross-lingual inference, and a superior performance in zero-shot learning scenarios as compared to state-of-the-art baselines.
Isogloss, Mar 14, 2024
Campanian dialects such as Neapolitan feature a so-called 'second form of the infinitive' (SFI), ... more Campanian dialects such as Neapolitan feature a so-called 'second form of the infinitive' (SFI), a form consisting of the bare verbal stem, which can be used after functional verbs. This paper addresses the microvariation concerning the construction by analysing novel data from the Valle Caudina, located to the northeast of Naples. The SFI is frequently found specifically with the imperative va 'go!'. In Neapolitan, the form has been reanalysed as an imperatival form in this context, yielding an asyndetic imperative. At a first glance, the use of the SFI in Valle Caudina looks very similar to its Neapolitan counterpart, but unlike Neapolitan, the SFI in these varieties has remained non-finite and has not been reanalysed as an imperative. These dialects can thus be considered a previous stage of the development described for Neapolitan by Ledgeway (1997, 2007, 2009). This claim finds support in the absence of metaphonetic forms-which have appeared in Neapolitan, as a consequence of the reanalysis-as well as the presence of clitic climbing. Finally, unlike Neapolitan, the SFI is becoming less productive in the varieties of the Valle Caudina.
ILLA - Nuove Ricerche Umanistiche, 2021
28th Manchester Phonology Meeting, May 1, 2021
Moderna Sprak, 2020
In southern Italian dialects, possessives have an enclitic variant typically associated with kins... more In southern Italian dialects, possessives have an enclitic variant typically associated with kinship nouns (Rohlfs 1967, Sotiri 2007, Ledgeway 2009, D’Alessandro & Migliori 2017) (e.g. [ˈfratə-mə] ˈbrother myˈ). The most common strategy to avoid violations of the three-syllable window is to avoid the enclitic form of the possessive, or stress shift, as in Lucanian (e.g. [ˌiennəˈru-mə] cf. [ˈiennərə], Lüdtke 1979:31). In the dialects of Airola and Boiano, a different strategy is attested: with proparoxytonic nouns (e.g. [ˈjennərə] ˈson-in-lawˈ in both varieties and [ˈsɔtʃəra]/[ˈswotʃəra] ˈmother/father-in-lawˈ in Boiano), the last unstressed syllable of the host is deleted (e.g. [ˈjennə-mə], [ˈsɔtʃə-mə], [ˈswotʃə-mə]). We claim that possessive enclitics in Airola and Boiano are internal clitics, that is, they amalgamate with the prosodic word that contains the host noun. We further propose that both proparoxytonic stress and the three-syllable-window derive from internally layered ternary feet (Martínez-Paricio 2013). These feet need to be aligned with the right edge of their containing prosodic word. When a possessive enclitic is incorporated, the most optimal strategy to comply with this alignment requirement is to build an internally layered ternary foot and delete the last syllable of the host noun, stress shift being excluded.
The idiosyncrasy of literary studies has been an obstacle to its technological improvement for ye... more The idiosyncrasy of literary studies has been an obstacle to its technological improvement for years, especially to represent their knowledge in a machine-readable format. The richness, variety, and different study`s perspectives that scholars find in their studies make this task a highly complex challenge. This complexity is even more noticed in the poetry genre, where each poetic tradition has independently developed its analytical terminology and methodology. In this work, we have addressed the construction of a poetry ontology to express the scholar ́s knowledge spread out in isolated databases or works. Ontopoetry ontology has been developed following Neon methodology, and it has been structured in three modules: a) core, b) poetic analysis and c) transmission, covering the essential aspects in a poetry literary study. Ontopoetry core module has been aligned with FRBRoo ontology guaranteeing its interoperability. This paper is focused on the description of the core module, its ...
The main aim of Poetry Standardization and Linked Open Data Project, POSTDATA, is to provide mean... more The main aim of Poetry Standardization and Linked Open Data Project, POSTDATA, is to provide means for researchers on European poetry to publish and consume semantically-enriched data. Thus, developing a poetry ontology is a pillar of its semantic domain. This ontology tries to enhance interoperability in the European poetry community and capture the European poetry domain knowledge.
The study of the poetic features of text, especially their rhythmic structure when forming verses... more The study of the poetic features of text, especially their rhythmic structure when forming verses, pertains to the different traditions, whose scholars established the rules that might govern poetry. Within this context, the POSTDATA Project formalized a network of ontologies able to express any poetic expression and its analysis at the European level, enabling scholars all over Europe to interchange their data using Linked Open Data. However, varied research interests result in corpora that might not share the same facets of an analysis. To alleviate this concern and foster the completeness of the interchanged corpora, our team set out to build a software toolkit to assist in the analysis of poetry. This paper introduces PoetryLab, an extensible open source toolkit for syllabification, scansion (extraction of stress patterns), enjambment detection (syntactical units split in two lines), rhyme detection, and historical named entity recognition for Spanish poetry. Our toolkit achieve...
The creation and analysis of poetry have been commonly carried out by hand; with only a few compu... more The creation and analysis of poetry have been commonly carried out by hand; with only a few computer-assisted approaches appearing over the years. In the Spanish context, the promise of machine learning is starting to pan out in specific tasks such as metrical annotation and rhythm extraction. Among the possible tasks that comprise the analysis of a poem, identifying the type of a stanza remains underexplored. The classification of the inner structures of verses in which a poem is built upon is an especially relevant task for poetry studies since it complements the structural information of a poem. In this work, we analyzed different computational approaches to stanza classification in the Spanish poetic tradition. We collected a corpus of 5005 stanzas of 46 different types, and created a baseline expert system on a set of rules defined by poetry scholars. We show that this task continues to be hard for computers systems even when leveraging the best performing embeddings. However, ...
Approaches to Metaphony in the Languages of Italy
LOT Publications, 2020
This dissertation investigates the interface between phonological and metrical structure. The int... more This dissertation investigates the interface between phonological and metrical structure. The interaction between phonology and metrics is explored from two perspectives: one looks at poetic aspects as evidence for phonological characteristics; the other explores to what extent phonology conditions the development of poetic tradition and by what means the metrical template is filled by phonological material. The case study is Renaissance metre and its implementation in a set of Romance and West-Germanic languages. A comparison of the different ways in which the same source metre was incorporated in various European poetic traditions sheds light on the role played by phonology in the process of adaptation. When a metre is borrowed, this needs to be adapted to the metrical structure which mirrors the phonology of the recipient language. In particular, the metrical template selects a macroparameter based on the macroparameter selected by phonology. The phonological macroparameter defines which prosodic domain (i.e. phrase or word) plays a prominent role in the language; consequently, metrics selects which of its layers (i.e. colon or foot) is going to play a prominent role in the poetic form. In addition, this work argues that the relationship between the two structures is bidirectional: on the one hand, phonology sees metrical structure and fills it with its elements; on the other hand, the metrical structure can stretch the possibilities of phonological material. The interaction is based on a series of matches and mismatches between the two structures, in a game of tension managed by metrics.
by Petr Plecháč, Robert Kolár, Anne-Sophie Bories, Jakub Říha, Jan Macutek, Helena Bermúdez Sabel, Laura Hernández-Lorenzo, Mirella De Sisto, Szilvia Maróthy, Levente Selaf, and Anastasia Belousova
In Tackling the Toolkit, we focus on the methodological innovations, challenges, obstacles and ev... more In Tackling the Toolkit, we focus on the methodological innovations, challenges, obstacles and even shortcomings associated with applying quantitative methods to poetry specifically and poetics more broadly. Using tools including natural language processing, web ontologies, similarity detection devices and machine learning, our contributors explore not only metres, stanzas, stresses and rhythms but also genres, subgenres, lexical material and cognitive processes. Whether they are testing old theories and laws, making complex concepts machine-readable or developing new lines of textual analysis, their works challenge standard descriptions of norms and variations.
by Petr Plecháč, Helena Bermúdez Sabel, Robert Kolár, Anastasia Belousova, James K Tauber, Mirella De Sisto, Kristina V Litvintseva, Andrew Cooper, Vera Polilova (Вера Полилова), Ksenia Tveryanovich, Александр Костюк, and Igor Pilshchikov
This volume presents a wide range of quantitative approaches to versification. It comprises vario... more This volume presents a wide range of quantitative approaches to versification. It comprises various methodological perspectives ranging from simple descriptive statistics to advanced machine learning methods (such as support vector machines, random forests or neural networks) as well as material covering a large span of time and lan -
guages: from very ancient versifications (Sumerian, Akkadian, Hittie; Ancient Greek), through medieval (Old English, Old Icelandic, Old Saxon) and Renaissance verse to modern experiments (free verse, concrete poetry); from English and Russian through Spanish and German to Portuguese and Catalan. Not only written, but also spoken poetry has been analyzed.
In the southern Italian dialect of Airola (Campania) feminine plural and masculine plural are dis... more In the southern Italian dialect of Airola (Campania) feminine plural and masculine plural are distinguished by means of two phonological processes: metaphony and Raddoppiamento Fonosintattico (RF henceforth). They appear to be in complementary distribution and to
create gender distinction in the plural of nouns; in fact, metaphony takes place in masculine plural forms, while RF marks feminine plural ones. Therefore, two distinct phenomena, one being phonological, namely metaphony, and one being phono-syntactic, namely RF, happen to interact within plural noun formation. These two processes, which developed separately, acquired, synchronically speaking, a value of gender distinction.
Metaphony is a well-known phenomenon of Italian dialects, which consists in the raising or diphthongization of a stressed vowel under the influence of a non-adjacent following high vowel (Rohlfs 1966, Fanciullo 1994, Ledgeway 2009, Maiden 2010). In the dialect of Airola, it only affects mid vowels, namely /ɔ, o, e, ɛ/, and its attestation is not limited to the nominal class; it occurs, in fact, in various word categories, such as adjectives, verbs and possessive pronouns.
RF is an external sandhi phenomenon which consists in the gemination of a word-initial consonant under the influence of a preceding word (Rohlfs 1970, Leone 1984,
Loporcaro 1997, Borrelli 2002). In Airolano RF is lexically triggered, differently from the RF attested in Standard Italian, which occurs to be stress-induced.
The aim of this thesis is to describe the two phenomena, metaphony and RF, in Airolano and to give an analysis of them in order to explain their division of labor. To do so, the processes are first analyzed separately. Then, a unified analysis is elaborated aiming to shed some light on the difference between genders in the plural of nouns.
The analysis of the two phenomena will be based on data from Airolano that were collected in December 2013 and April 2014 by the author.Ten informants were selected, which were classified into four different age groups. All
the recordings were, subsequently, transcribed in IPA and they appear in this form in the text. The full set of data is stored in the Italian Dialect archive of Leiden University.