Nilo Pedrazzini | The Alan Turing Institute (original) (raw)

Papers by Nilo Pedrazzini

Research paper thumbnail of Mapping 'when'-clauses in Latin American and Caribbean languages: an experiment in subtoken-based typology

Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP), 2024

Languages can encode temporal subordination lexically, via subordinating conjunctions, and morpho... more Languages can encode temporal subordination lexically, via subordinating conjunctions, and morphologically, by marking the relation on the predicate. Systematic cross-linguistic variation among the former can be studied using wellestablished token-based typological approaches to token-aligned parallel corpora. Variation among different morphological means is instead much harder to tackle and therefore more poorly understood, despite being predominant in several language groups. This paper explores variation in the expression of generic temporal subordination ('when'-clauses) among the languages of Latin America and the Caribbean, where morphological marking is particularly common. It presents probabilistic semantic maps computed on the basis of the languages of the region, thus avoiding bias towards the many world's languages that exclusively use lexified connectors, incorporating associations between character n-grams and English when. The approach allows capturing morphological clauselinkage devices in addition to lexified connectors, paving the way for larger-scale, strategyagnostic analyses of typological variation in temporal subordination.

Research paper thumbnail of The semantic map of when and its typological parallels

Frontiers in Communication, 2023

In this paper, we explore the semantic map of the English temporal connective when and its parall... more In this paper, we explore the semantic map of the English temporal connective when and its parallels in more than 1,000 languages drawn from a parallel corpus of New Testament translations. We show that there is robust evidence for a cross-linguistic distinction between universal and existential WHEN. We also see tentative evidence that innovation in this area involves recruiting new items for universal WHEN which gradually can take over the existential usage. Another possible distinction that we see is between serialized events, which tend to be expressed with non-lexified constructions and framing/backgrounding constructions, which favor an explicit subordinator.

Research paper thumbnail of Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work

Proceedings of the Ancient Language Processing Workshop associated with RANLP-2023, 2023

We evaluate four count-based and predictive distributional semantic models of Ancient Greek again... more We evaluate four count-based and predictive distributional semantic models of Ancient Greek against AGREE, a composite benchmark of human judgements, to assess their ability to retrieve semantic relatedness. On the basis of the observations deriving from the analysis of the results, we design a procedure for a larger scale intrinsic evaluation of count-based and predictive language models, including syntactic embeddings. We also propose possible ways of exploiting the different layers of the whole AGREE benchmark (including both human and machine-generated data) and different evaluation metrics.

Research paper thumbnail of Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers

Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, 2022

The industrialization process associated with the so-called Industrial Revolution in 19th-century... more The industrialization process associated with the so-called Industrial Revolution in 19th-century Great Britain was a time of profound changes, including in the English lexicon. An important yet understudied phenomenon is the semantic shift in the lexicon of mechanisation. In this paper we present the first large-scale analysis of terms related to mechanization over the course of the 19th-century in English. We draw on a corpus of historical British newspapers comprising 4.6 billion tokens and train historical word embedding models. We test existing semantic change detection techniques and analyse the results in light of previous historical linguistic scholarship.

Research paper thumbnail of Deep Impact: A study on the impact of data papers and datasets in the Humanities and Social Sciences

Publications, 2022

The humanities and social sciences (HSS) have recently witnessed an exponential growth in data-dr... more The humanities and social sciences (HSS) have recently witnessed an exponential growth in data-driven research. In response, attention has been afforded to datasets and accompanying data papers as outputs of the research and dissemination ecosystem. In 2015, two data journals dedicated to HSS disciplines appeared in this landscape: Journal of Open Humanities Data (JOHD) and Research Data Journal for the Humanities and Social Sciences (RDJ). In this paper, we analyse the state of the art in the landscape of data journals in HSS using JOHD and RDJ as exemplars by measuring performance and the deep impact of data-driven projects, including metrics (citation count; Altmetrics, views, downloads, tweets) of data papers in relation to associated research papers and the reuse of associated datasets. Our findings indicate: that data papers are published following the deposit of datasets in a repository and usually following research articles; that data papers have a positive impact on both the metrics of research papers associated with them and on data reuse; and that Twitter hashtags targeted at specific research campaigns can lead to increases in data papers’ views and downloads. HSS data papers improve the visibility of datasets they describe, support accompanying research articles, and add to transparency and the open research agenda.

Research paper thumbnail of One question, different annotation depths: A case study in Early Slavic

Journal of Historical Syntax, 2022

This paper addresses some of the challenges of carrying out corpus-based linguistic analyses on h... more This paper addresses some of the challenges of carrying out corpus-based linguistic analyses on historical corpora of different sizes and annotation depths. Data from the TOROT Treebank is collected to carry out a case study on Early Slavic dative absolutes, showing the extent to which methodology and results may change depending on the amount of data and the levels of linguistic annotation available. The analysis indicates that deeply-annotated treebanks of limited size can be exploited to establish a solid guideline to analyze a phenomenon in shallowly-annotated corpora and even new, unannotated texts. This is particularly encouraging for historical languages, such as Early Slavic, showing very high diatopic and diachronic variation, which significantly undermines corpus-annotation automation and therefore calls for alternative strategies to counteract data scarcity. * This work was supported by the Economic and Social Research Council [grant number ES/P000649/1]. I am indebted to Hanne Eckhoff, for the constant support and detailed critical feedback. I am also grateful to Marieke Meelen for the useful comments and Mary MacRobert for the feedback on an earlier draft. All shortcomings and errors remain, of course, my own.

Research paper thumbnail of OldSlavNet: A scalable Early Slavic dependency parser trained on modern language data

Software Impacts, 2021

Historical languages are increasingly being modelled computationally. Syntactically annotated tex... more Historical languages are increasingly being modelled computationally. Syntactically annotated texts are often a sine-qua-non in their modelling, but parsing of pre-modern language varieties faces great data sparsity, intensified by high levels of orthographic variation. In this paper we present a good-quality Early Slavic dependency parser, attained via manipulation of modern Slavic data to resemble the orthography and morphosyntax of pre-modern varieties. The tool can be deployed to expand historical treebanks, which are crucial for data collection and quantification, and beneficial to downstream NLP tasks and historical text mining.

Research paper thumbnail of Exploiting Cross-Dialectal Gold Syntax for Low-Resource Historical Languages: Towards a Generic Parser for Pre-Modern Slavic

Proceedings of the Workshop on Computational Humanities Research (CHR 2020), Nov 2020

This paper explores the possibility of improving the performance of specialized parsers for pre-m... more This paper explores the possibility of improving the performance of specialized parsers for pre-modern Slavic by training them on data from different related varieties. Because of their linguistic heterogeneity, pre-modern Slavic varieties are treated as low-resource historical languages, whereby cross-dialectal treebank data may be exploited to overcome data scarcity and attempt the training of a variety-agnostic parser. Previous experiments on early Slavic dependency parsing are discussed, particularly with regard to their ability to tackle different orthographic, regional and stylistic features. A generic pre-modern Slavic parser and two specialized parsers-one for East Slavic and one for South Slavic-are trained using jPTDP [8], a neural network model for joint part-of-speech (POS) tagging and dependency parsing which had shown promising results on a number of Universal Dependency (UD) treebanks, including Old Church Slavonic (OCS). With these experiments, a new state of the art is obtained for both OCS (83.79% unlabelled attachment score (UAS) and 78.43% labelled attachment score (LAS)) and Old East Slavic (OES) (85.7% UAS and 80.16% LAS).

Talks by Nilo Pedrazzini

Research paper thumbnail of Historic machines from 'prams' to 'Parliament': new avenues for collaborative linguistic research

DH Benelux 2022 - ReMIX: Creation and alteration in DH, 2022

Research in computational linguistics has made successful attempts at modelling word meaning at s... more Research in computational linguistics has made successful attempts at modelling word meaning at scale, but much remains to be done to put these computational models to the test of historical scholarship (see e.g. Beelen et al. 2021). More importantly, a lot of computational research looks at texts in a historical vacuum, 'synchronically', as linguists would say. Living with Machines is an interdisciplinary research project that rethinks the impact of technology on the lives of ordinary people during the Industrial Revolution (Ahnert et al. 2021). During this project, we decided to address a fundamental question: what did people mean by 'machine' and how has this meaning changed over time? This paper outlines how a simple research question like 'what was a machine?' can provide an opportunity to engage the public with our work while also generating data for analysis and new avenues of research in a radically collaborative way. Turning to a diachronic perspective, we wanted to capture how changes in the usage of this word in nineteenth century texts can help us understand the role of machines in nineteenth century imaginations. An earlier crowdsourcing task on the project defined machines as 'devices or equipment not powered by people or animals'. As a result of that task, we discovered that this definition did not reflect how 'machine' was used in contemporary newspaper articles. Accordingly, we designed the 'What's that machine?' citizen science tasks to find out what a 'machine' was in the 19th century as part of our linguistic and historical research. As engaging the public with our research is a key goal of the project, crowdsourcing, rather than internal annotation, was a natural fit. It also allowed us to tackle classification challenges at scale. We set up two related 'What's that machine?' tasks on the Zooniverse platform: Describe it! and Classify it! (Ridge, 2020). The former asked the public to transcribe excerpts from newspaper articles

Thesis Chapters by Nilo Pedrazzini

Research paper thumbnail of A quantitative and typological study of Early Slavic participle clauses and their competition

PhD Thesis, University of Oxford, 2023

This thesis investigates the semantic and pragmatic properties of Early Slavic participle constru... more This thesis investigates the semantic and pragmatic properties of Early Slavic participle constructions (conjunct participles and dative absolutes) to understand the principles motivating their selection over one another and over their main finite competitor (jegda-clauses). The issue is tackled by adopting two broadly different approaches, which inform the division of the thesis into two parts.

The first part of the thesis uses detailed linguistic annotation on Early Slavic corpora at the morphosyntactic, dependency, information-structural, and lexical levels to obtain indirect evidence for different potential functions of participle clauses and their main finite competitor. The goal of this part of the thesis is to understand the roles of compositionality and default discourse reasoning as explanations for the distribution of participle constructions and jegda-clauses in the Early Slavic corpus. The investigation shows that the competition between conjunct participles, absolute constructions, and jegda-clauses occurs at the level of discourse organization, where the main determining factor in their distribution is the distinction between background and foreground content of an (elementary or complex) discourse unit. The analysis also shows that the major common denominator between the three constructions is that all of them can function as frame-setting devices (i.e. background clauses), albeit to very different extents. In fact, conjunct participles are more typically associated with the foreground constituent of a discourse unit, whereas dative absolutes and jegda-clauses are typically associated with the background content.

The second part of the thesis uses massively parallel data, including Old Church Slavonic and Ancient Greek, and analyses typological variation in how languages express the semantic space of English when, whose scope encompasses that of Early Slavic participle constructions and jegda-clauses. To do so, probabilistic semantic maps are generated and statistical methods (including Kriging, Gaussian Mixture Modelling, precision and recall analysis) are used to induce cross-linguistically salient dimensions from the parallel corpus and to study conceptual variation within the semantic space of the hypothetical concept when. Clear typological correspondences and differences with Early Slavic from linguistic phenomena in other languages are then exploited to corroborate and refine observations made on the core semantic-pragmatic properties of participle constructions and jegda-clauses on the basis of annotated Early Slavic data.

The analysis shows that 'null’ constructions (juxtaposed clauses such as participles and converbs, or independent clauses) consistently cluster in particular regions of the semantic map cross-linguistically, which clearly indicates that participle clauses are not equally viable as alternatives to any use of when, but carry particular meanings that make them less suitable for some of its functions. The investigation helped identify genealogically and areally unrelated languages that seem typologically very similar to Old Church Slavonic in the way they divide the semantic space of when between overtly subordinated and 'null’ constructions. Comparison with these languages reveals great similarities between the functions of Early Slavic participle constructions and of linguistic phenomena in some of these languages (particularly clause chaining, bridging, insubordination, and switch reference). Crucially, new clear correspondences are found between these phenomena and 'non-canonical’ usages of participle constructions (i.e. coreferential dative absolutes, syntactically independent absolutes and conjunct participles, and participle constructions with no apparent matrix clause), which had often been written off as ‘aberrations’ by previous literature on Early Slavic.

Research paper thumbnail of Mapping 'when'-clauses in Latin American and Caribbean languages: an experiment in subtoken-based typology

Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP), 2024

Languages can encode temporal subordination lexically, via subordinating conjunctions, and morpho... more Languages can encode temporal subordination lexically, via subordinating conjunctions, and morphologically, by marking the relation on the predicate. Systematic cross-linguistic variation among the former can be studied using wellestablished token-based typological approaches to token-aligned parallel corpora. Variation among different morphological means is instead much harder to tackle and therefore more poorly understood, despite being predominant in several language groups. This paper explores variation in the expression of generic temporal subordination ('when'-clauses) among the languages of Latin America and the Caribbean, where morphological marking is particularly common. It presents probabilistic semantic maps computed on the basis of the languages of the region, thus avoiding bias towards the many world's languages that exclusively use lexified connectors, incorporating associations between character n-grams and English when. The approach allows capturing morphological clauselinkage devices in addition to lexified connectors, paving the way for larger-scale, strategyagnostic analyses of typological variation in temporal subordination.

Research paper thumbnail of The semantic map of when and its typological parallels

Frontiers in Communication, 2023

In this paper, we explore the semantic map of the English temporal connective when and its parall... more In this paper, we explore the semantic map of the English temporal connective when and its parallels in more than 1,000 languages drawn from a parallel corpus of New Testament translations. We show that there is robust evidence for a cross-linguistic distinction between universal and existential WHEN. We also see tentative evidence that innovation in this area involves recruiting new items for universal WHEN which gradually can take over the existential usage. Another possible distinction that we see is between serialized events, which tend to be expressed with non-lexified constructions and framing/backgrounding constructions, which favor an explicit subordinator.

Research paper thumbnail of Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work

Proceedings of the Ancient Language Processing Workshop associated with RANLP-2023, 2023

We evaluate four count-based and predictive distributional semantic models of Ancient Greek again... more We evaluate four count-based and predictive distributional semantic models of Ancient Greek against AGREE, a composite benchmark of human judgements, to assess their ability to retrieve semantic relatedness. On the basis of the observations deriving from the analysis of the results, we design a procedure for a larger scale intrinsic evaluation of count-based and predictive language models, including syntactic embeddings. We also propose possible ways of exploiting the different layers of the whole AGREE benchmark (including both human and machine-generated data) and different evaluation metrics.

Research paper thumbnail of Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers

Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, 2022

The industrialization process associated with the so-called Industrial Revolution in 19th-century... more The industrialization process associated with the so-called Industrial Revolution in 19th-century Great Britain was a time of profound changes, including in the English lexicon. An important yet understudied phenomenon is the semantic shift in the lexicon of mechanisation. In this paper we present the first large-scale analysis of terms related to mechanization over the course of the 19th-century in English. We draw on a corpus of historical British newspapers comprising 4.6 billion tokens and train historical word embedding models. We test existing semantic change detection techniques and analyse the results in light of previous historical linguistic scholarship.

Research paper thumbnail of Deep Impact: A study on the impact of data papers and datasets in the Humanities and Social Sciences

Publications, 2022

The humanities and social sciences (HSS) have recently witnessed an exponential growth in data-dr... more The humanities and social sciences (HSS) have recently witnessed an exponential growth in data-driven research. In response, attention has been afforded to datasets and accompanying data papers as outputs of the research and dissemination ecosystem. In 2015, two data journals dedicated to HSS disciplines appeared in this landscape: Journal of Open Humanities Data (JOHD) and Research Data Journal for the Humanities and Social Sciences (RDJ). In this paper, we analyse the state of the art in the landscape of data journals in HSS using JOHD and RDJ as exemplars by measuring performance and the deep impact of data-driven projects, including metrics (citation count; Altmetrics, views, downloads, tweets) of data papers in relation to associated research papers and the reuse of associated datasets. Our findings indicate: that data papers are published following the deposit of datasets in a repository and usually following research articles; that data papers have a positive impact on both the metrics of research papers associated with them and on data reuse; and that Twitter hashtags targeted at specific research campaigns can lead to increases in data papers’ views and downloads. HSS data papers improve the visibility of datasets they describe, support accompanying research articles, and add to transparency and the open research agenda.

Research paper thumbnail of One question, different annotation depths: A case study in Early Slavic

Journal of Historical Syntax, 2022

This paper addresses some of the challenges of carrying out corpus-based linguistic analyses on h... more This paper addresses some of the challenges of carrying out corpus-based linguistic analyses on historical corpora of different sizes and annotation depths. Data from the TOROT Treebank is collected to carry out a case study on Early Slavic dative absolutes, showing the extent to which methodology and results may change depending on the amount of data and the levels of linguistic annotation available. The analysis indicates that deeply-annotated treebanks of limited size can be exploited to establish a solid guideline to analyze a phenomenon in shallowly-annotated corpora and even new, unannotated texts. This is particularly encouraging for historical languages, such as Early Slavic, showing very high diatopic and diachronic variation, which significantly undermines corpus-annotation automation and therefore calls for alternative strategies to counteract data scarcity. * This work was supported by the Economic and Social Research Council [grant number ES/P000649/1]. I am indebted to Hanne Eckhoff, for the constant support and detailed critical feedback. I am also grateful to Marieke Meelen for the useful comments and Mary MacRobert for the feedback on an earlier draft. All shortcomings and errors remain, of course, my own.

Research paper thumbnail of OldSlavNet: A scalable Early Slavic dependency parser trained on modern language data

Software Impacts, 2021

Historical languages are increasingly being modelled computationally. Syntactically annotated tex... more Historical languages are increasingly being modelled computationally. Syntactically annotated texts are often a sine-qua-non in their modelling, but parsing of pre-modern language varieties faces great data sparsity, intensified by high levels of orthographic variation. In this paper we present a good-quality Early Slavic dependency parser, attained via manipulation of modern Slavic data to resemble the orthography and morphosyntax of pre-modern varieties. The tool can be deployed to expand historical treebanks, which are crucial for data collection and quantification, and beneficial to downstream NLP tasks and historical text mining.

Research paper thumbnail of Exploiting Cross-Dialectal Gold Syntax for Low-Resource Historical Languages: Towards a Generic Parser for Pre-Modern Slavic

Proceedings of the Workshop on Computational Humanities Research (CHR 2020), Nov 2020

This paper explores the possibility of improving the performance of specialized parsers for pre-m... more This paper explores the possibility of improving the performance of specialized parsers for pre-modern Slavic by training them on data from different related varieties. Because of their linguistic heterogeneity, pre-modern Slavic varieties are treated as low-resource historical languages, whereby cross-dialectal treebank data may be exploited to overcome data scarcity and attempt the training of a variety-agnostic parser. Previous experiments on early Slavic dependency parsing are discussed, particularly with regard to their ability to tackle different orthographic, regional and stylistic features. A generic pre-modern Slavic parser and two specialized parsers-one for East Slavic and one for South Slavic-are trained using jPTDP [8], a neural network model for joint part-of-speech (POS) tagging and dependency parsing which had shown promising results on a number of Universal Dependency (UD) treebanks, including Old Church Slavonic (OCS). With these experiments, a new state of the art is obtained for both OCS (83.79% unlabelled attachment score (UAS) and 78.43% labelled attachment score (LAS)) and Old East Slavic (OES) (85.7% UAS and 80.16% LAS).

Research paper thumbnail of Historic machines from 'prams' to 'Parliament': new avenues for collaborative linguistic research

DH Benelux 2022 - ReMIX: Creation and alteration in DH, 2022

Research in computational linguistics has made successful attempts at modelling word meaning at s... more Research in computational linguistics has made successful attempts at modelling word meaning at scale, but much remains to be done to put these computational models to the test of historical scholarship (see e.g. Beelen et al. 2021). More importantly, a lot of computational research looks at texts in a historical vacuum, 'synchronically', as linguists would say. Living with Machines is an interdisciplinary research project that rethinks the impact of technology on the lives of ordinary people during the Industrial Revolution (Ahnert et al. 2021). During this project, we decided to address a fundamental question: what did people mean by 'machine' and how has this meaning changed over time? This paper outlines how a simple research question like 'what was a machine?' can provide an opportunity to engage the public with our work while also generating data for analysis and new avenues of research in a radically collaborative way. Turning to a diachronic perspective, we wanted to capture how changes in the usage of this word in nineteenth century texts can help us understand the role of machines in nineteenth century imaginations. An earlier crowdsourcing task on the project defined machines as 'devices or equipment not powered by people or animals'. As a result of that task, we discovered that this definition did not reflect how 'machine' was used in contemporary newspaper articles. Accordingly, we designed the 'What's that machine?' citizen science tasks to find out what a 'machine' was in the 19th century as part of our linguistic and historical research. As engaging the public with our research is a key goal of the project, crowdsourcing, rather than internal annotation, was a natural fit. It also allowed us to tackle classification challenges at scale. We set up two related 'What's that machine?' tasks on the Zooniverse platform: Describe it! and Classify it! (Ridge, 2020). The former asked the public to transcribe excerpts from newspaper articles

Research paper thumbnail of A quantitative and typological study of Early Slavic participle clauses and their competition

PhD Thesis, University of Oxford, 2023

This thesis investigates the semantic and pragmatic properties of Early Slavic participle constru... more This thesis investigates the semantic and pragmatic properties of Early Slavic participle constructions (conjunct participles and dative absolutes) to understand the principles motivating their selection over one another and over their main finite competitor (jegda-clauses). The issue is tackled by adopting two broadly different approaches, which inform the division of the thesis into two parts.

The first part of the thesis uses detailed linguistic annotation on Early Slavic corpora at the morphosyntactic, dependency, information-structural, and lexical levels to obtain indirect evidence for different potential functions of participle clauses and their main finite competitor. The goal of this part of the thesis is to understand the roles of compositionality and default discourse reasoning as explanations for the distribution of participle constructions and jegda-clauses in the Early Slavic corpus. The investigation shows that the competition between conjunct participles, absolute constructions, and jegda-clauses occurs at the level of discourse organization, where the main determining factor in their distribution is the distinction between background and foreground content of an (elementary or complex) discourse unit. The analysis also shows that the major common denominator between the three constructions is that all of them can function as frame-setting devices (i.e. background clauses), albeit to very different extents. In fact, conjunct participles are more typically associated with the foreground constituent of a discourse unit, whereas dative absolutes and jegda-clauses are typically associated with the background content.

The second part of the thesis uses massively parallel data, including Old Church Slavonic and Ancient Greek, and analyses typological variation in how languages express the semantic space of English when, whose scope encompasses that of Early Slavic participle constructions and jegda-clauses. To do so, probabilistic semantic maps are generated and statistical methods (including Kriging, Gaussian Mixture Modelling, precision and recall analysis) are used to induce cross-linguistically salient dimensions from the parallel corpus and to study conceptual variation within the semantic space of the hypothetical concept when. Clear typological correspondences and differences with Early Slavic from linguistic phenomena in other languages are then exploited to corroborate and refine observations made on the core semantic-pragmatic properties of participle constructions and jegda-clauses on the basis of annotated Early Slavic data.

The analysis shows that 'null’ constructions (juxtaposed clauses such as participles and converbs, or independent clauses) consistently cluster in particular regions of the semantic map cross-linguistically, which clearly indicates that participle clauses are not equally viable as alternatives to any use of when, but carry particular meanings that make them less suitable for some of its functions. The investigation helped identify genealogically and areally unrelated languages that seem typologically very similar to Old Church Slavonic in the way they divide the semantic space of when between overtly subordinated and 'null’ constructions. Comparison with these languages reveals great similarities between the functions of Early Slavic participle constructions and of linguistic phenomena in some of these languages (particularly clause chaining, bridging, insubordination, and switch reference). Crucially, new clear correspondences are found between these phenomena and 'non-canonical’ usages of participle constructions (i.e. coreferential dative absolutes, syntactically independent absolutes and conjunct participles, and participle constructions with no apparent matrix clause), which had often been written off as ‘aberrations’ by previous literature on Early Slavic.