Barbara McGillivray | King's College London (original) (raw)

Papers by Barbara McGillivray

Research paper thumbnail of Language of Mechanisation Crowdsourcing Datasets from the Living with Machines Project

Journal of Open Humanities Data, 2024

We present the ‘Language of Mechanisation’ datasets with examples of re-use in visualisations and... more We present the ‘Language of Mechanisation’ datasets with examples of re-use in visualisations and analysis. These reusable CSV files, published on the British Library’s Research Repository, contain automatically-transcribed text from 19th century British newspaper articles. Volunteers on the Zooniverse crowdsourcing platform took part in tasks that asked ‘How did the word x change over time and place?’ They annotated articles with pre-selected meanings (senses) for the words coach, car, trolley and bike.

The datasets can support scholarship on a range of historical and linguistic research areas, including research on crowdsourcing and online volunteering behaviours, data processing and data visualisations methodologies.

Research paper thumbnail of Semantic change and socio-semantic variation: the case of COVID-related neologisms on Reddit

Linguistics Vanguard, 2024

COVID-19 has triggered innovations in science and society globally, leading to the emergence or e... more COVID-19 has triggered innovations in science and society globally, leading to the emergence or establishment of formal neologisms such as infodemic and working from home (WFH). While previous work on COVID-related lexical innovation has focused on such formal neologisms, this paper uses data from Reddit to study semantic neologisms like lockdown and mask, which have changed in meaning due to the pandemic. First, we identify words that have undergone meaning changes since the start of the pandemic. Our approach, based on word embeddings, successfully detects a variety of COVID-related terms that dominate the resulting list of semantic neologisms. Next, we generate community-specific semantic representations for the communities r/Coronavirus and r/conspiracy, which are both highly engaged in COVID-related discourse. We analyse socio-semantic variation along two dimensions: an evaluative dimension, based on amelioration/pejorization, and the loyalty/betrayal dimension of Moral Foundations Theory. Our findings reveal that the detected semantic neologisms exhibit more negative and betrayal-related associations in r/conspiracy, a subreddit critical of COVID-related sociopolitical measures. Mapping the community-specific representations for the term vaccines on a shared semantic space confirms these differences and reveals more fine-grained denotational and connotational differences between the two communities.

Research paper thumbnail of The Living Machine: A Computational Approach to the Nineteenth-Century Language of Technology

Technology and Culture

abstract: This article examines a long-standing question in the history of technology concerning ... more abstract: This article examines a long-standing question in the history of technology concerning the trope of the living machine. The authors do this by using a cutting-edge computational method, which they apply to large collections of digitized texts. In particular, they demonstrate the affordances of a neural language model for historical research. In a deliberate maneuver, the authors use a type of model, often portrayed as sentient today, to detect figures of speech in nineteenth-century texts that portrayed machines as self-acting, automatic, or alive. Their masked language model detects unusual or surprising turns of phrase, which could not be discovered using simple keyword search. The authors collect and close read such sentences to explore how figurative language produced a context that conceived humans and machines as interchangeable in complicated ways. They conclude that, used judiciously, language models have the potential to open up new avenues of historical research.

Research paper thumbnail of Release for code underlying paper for "Artificial Intelligence and Digital Heritage: Challenges and Opportunities - ARTIDGH 2020"

Release for code underlying paper for "Artificial Intelligence and Digital Heritage: Challenges a... more Release for code underlying paper for "Artificial Intelligence and Digital Heritage: Challenges and Opportunities - ARTIDGH 2020"

Research paper thumbnail of Quantifying the quantitative (re-)turn in historical linguistics

Humanities & social sciences communications, Jan 30, 2023

Historical linguistics is the study of language change and stability, of the history of individua... more Historical linguistics is the study of language change and stability, of the history of individual languages, and of the relatedness between languages. In spite of numerous acknowledgements, the adoption of quantitative methods in historical linguistics is still far from being mainstream and it falls below the level of other branches of linguistics. This comment considers the adoption of quantitative methods in recent historical linguistics research, and compares a study on 2012 publications with a similar study conducted seven years later. This comment argues for the advantages of a wider adoption of quantitative methods among historical linguists, and considers various reasons for the relatively slow progress in this direction. It also clarifies when quantitative methods are not the preferred route.

Research paper thumbnail of The Living Machine: A Computational Approach to the Nineteenth-Century Language of Technology

Technology and Culture, 2023

This article examines a long-standing question in the history of technology concerning the trope ... more This article examines a long-standing question in the history of technology concerning the trope of the living machine. The authors do this by using a cutting-edge computational method, which they apply to large collections of digitized texts. In particular, they demonstrate the affordances of a neural language model for historical research. In a self-conscious maneuver, the authors use a type of model, often portrayed as sentient today, to detect figures of speech in nineteenthcentury texts that portrayed machines as self-acting, automatic, or alive. Their method uses a masked language model to detect unusual or surprising turns of phrase, which could not be discovered using simple keyword search. The authors collect and close read such sentences to explore how figurative language produced a context in which humans and machines were conceived as interchangeable in complicated ways. They conclude that, used judiciously, language models have the potential to open new avenues of historical research.

Research paper thumbnail of Ancient Greek semantic change - annotated datasets and code

This collection contains five objects: two Python scripts and three datasets.<br> The Pytho... more This collection contains five objects: two Python scripts and three datasets.<br> The Python scripts create the datasets for semantic annotation of the Ancient Greek words <i>mus</i>, <i>harmonia</i>, and <i>kosmos</i>. <br>The datasets contain the manual annotation of the sentences containing words <i>mus</i>, <i>harmonia</i>, and <i>kosmos </i>in the Diorisis Ancient Greek corpus (Vatri & McGillivray 2018). The dataset for <i>kosmos</i> refers to sentences up to the year 142AD.<br>References:<br>Vatri, A. and McGillivray, B. (2018). The Diorisis Ancient Greek Corpus. Research Data Journal for the Humanities and Social Sciences. https://brill.com/view/journals/rdj/aop/article-10.1163-24523666-01000013.xml<br>McGillivray, B., Hengchen, S., Lähteenoja, V., Palma, M., and Vatri, A. (2019). A computational approach to lexical polysemy in Ancient Greek. <i>Digital Scho...

Research paper thumbnail of Deep Impact: A Study on the Impact of Data Papers and Datasets in the Humanities and Social Sciences

Publications

The humanities and social sciences (HSS) have recently witnessed an exponential growth in data-dr... more The humanities and social sciences (HSS) have recently witnessed an exponential growth in data-driven research. In response, attention has been afforded to datasets and accompanying data papers as outputs of the research and dissemination ecosystem. In 2015, two data journals dedicated to HSS disciplines appeared in this landscape: Journal of Open Humanities Data (JOHD) and Research Data Journal for the Humanities and Social Sciences (RDJ). In this paper, we analyse the state of the art in the landscape of data journals in HSS using JOHD and RDJ as exemplars by measuring performance and the deep impact of data-driven projects, including metrics (citation count; Altmetrics, views, downloads, tweets) of data papers in relation to associated research papers and the reuse of associated datasets. Our findings indicate: that data papers are published following the deposit of datasets in a repository and usually following research articles; that data papers have a positive impact on both t...

Research paper thumbnail of A new corpus annotation framework for Latin diachronic lexical semantics

Journal of Latin Linguistics, 2022

McGillivray, Barbara, Kondakova, Daria, Burman, Annie, Dell’Oro, Francesca, Bermúdez Sabel, Helen... more McGillivray, Barbara, Kondakova, Daria, Burman, Annie, Dell’Oro, Francesca, Bermúdez Sabel, Helena, Marongiu, Paola and Márquez Cruz, Manuel. "A new corpus annotation framework for Latin diachronic lexical semantics" Journal of Latin Linguistics, vol. 21, no. 1, 2022, pp. 47-105. https://doi.org/10.1515/joll-2022-2007

We present a new corpus-based resource and methodology for the annotation of Latin lexical semantics, consisting of 2,399 annotated passages of 40 lemmas from the Latin diachronic corpus LatinISE. We also describe how the annotation was designed, analyse annotators’ styles, and present the preliminary results of a study on the lexical semantics and diachronic change of the 40 lemmas. We complement this analysis with a case study on semantic vagueness. As the availability of digital corpora of ancient languages increases, and as computational research develops new methods for large-scale analysis of diachronic lexical semantics, building lexical semantic annotation resources can shed new light on large-scale patterns in the semantic development of lexical items over time. We share recommendations for designing the annotation task that will hopefully help similar research on other less-resourced or historical languages.

Research paper thumbnail of Investigating patterns of change, stability, and interaction among scientific disciplines using embeddings

Humanities and Social Sciences Communications

Multi-disciplinary and inter-disciplinary collaboration can be an appropriate response to tacklin... more Multi-disciplinary and inter-disciplinary collaboration can be an appropriate response to tackling the increasingly complex problems faced by today’s society. Scientific disciplines are not rigidly defined entities and their profiles change over time. No previous study has investigated multiple disciplinarity (i.e. the complex interaction between disciplines, whether of a multidisciplinary or an interdisciplinary nature) at scale with quantitative methods, and the change in the profile of disciplines over time. This article explores a dataset of over 21 million articles published in 8400 academic journals between 1990 and 2019 and proposes a new scalable data-driven approach to multiple disciplinarity. This approach can be used to study the relationship between disciplines over time. By creating vector representations (embeddings) of disciplines and measuring the geometric closeness between the embeddings, the analysis shows that the similarity between disciplines has increased over...

Research paper thumbnail of A new corpus annotation framework for Latin diachronic lexical semantics

Journal of Latin Linguistics

We present a new corpus-based resource and methodology for the annotation of Latin lexical semant... more We present a new corpus-based resource and methodology for the annotation of Latin lexical semantics, consisting of 2,399 annotated passages of 40 lemmas from the Latin diachronic corpus LatinISE. We also describe how the annotation was designed, analyse annotators’ styles, and present the preliminary results of a study on the lexical semantics and diachronic change of the 40 lemmas. We complement this analysis with a case study on semantic vagueness. As the availability of digital corpora of ancient languages increases, and as computational research develops new methods for large-scale analysis of diachronic lexical semantics, building lexical semantic annotation resources can shed new light on large-scale patterns in the semantic development of lexical items over time. We share recommendations for designing the annotation task that will hopefully help similar research on other less-resourced or historical languages.

Research paper thumbnail of Embedding Structured Dictionary Definitions

Workshop on Insights from Negative Results in NLP, Nov 19, 2020

Previous work has shown how to effectively use external resources such as dictionaries to improve... more Previous work has shown how to effectively use external resources such as dictionaries to improve English-language word embeddings, either by manipulating the training process or by applying post-hoc adjustments to the embedding space. We experiment with a multitask learning approach for explicitly incorporating the structured elements of dictionary entries, such as user-assigned tags and usage examples, when learning embeddings for dictionary headwords. Our work generalizes several existing models for learning word embeddings from dictionaries. However, we find that the most effective representations overall are learned by simply training with a skip-gram objective over the concatenated text of all entries in the dictionary, giving no particular focus to the structure of the entries.

Research paper thumbnail of Collocation

Research paper thumbnail of Frequency

Research paper thumbnail of Code for the Hartlib Papers

Code for processing Hartlib Papers, extract topics and plot them over time. The project was part ... more Code for processing Hartlib Papers, extract topics and plot them over time. The project was part of Simon Hengchen's short-term scientific mission funded by COST Action IS1310 "Reassembling the Republic of Letters".

Research paper thumbnail of LatinISE corpus (version 4)

The LatinISE corpus is a text corpus collected from the LacusCurtius, Intratext and Musisque Deoq... more The LatinISE corpus is a text corpus collected from the LacusCurtius, Intratext and Musisque Deoque websites. Corpus texts have rich metadata containing information as genre, title, century or specific date. This Latin corpus was built by Barbara McGillivray. In the version 4 of the corpus the high frequency lemmas have been manually corrected and sentence boundaries have been added.

Research paper thumbnail of LL(O)D and NLP perspectives on semantic change for humanities research

Semantic Web

This paper presents an overview of the LL(O)D and NLP methods, tools and data for detecting and r... more This paper presents an overview of the LL(O)D and NLP methods, tools and data for detecting and representing semantic change, with its main application in humanities research. The paper’s aim is to provide the starting point for the construction of a workflow and set of multilingual diachronic ontologies within the humanities use case of the COST Action Nexus Linguarum, European network for Web-centred linguistic data science, CA18209. The survey focuses on the essential aspects needed to understand the current trends and to build applications in this area of study.

Research paper thumbnail of D7.4 How to be FAIR with your data. A teaching and training handbook for higher education institutions

This handbook aims to support higher education institutions with the integration of FAIR-related ... more This handbook aims to support higher education institutions with the integration of FAIR-related content in their curricula and teaching. It was written and edited by a group of about 40 collaborators in a series of six book sprint events that took place between 1 and 10 June 2021. The document provides practical material, such as competence profiles, learning outcomes and lesson plans, and supporting information. It incorporates community feedback received during the public consultation which ran from 27 July to 12 September 2021.

Research paper thumbnail of Quantitative historical linguistics: A corpus framework by Gard B. Jenset and Barbara McGillivray

Research paper thumbnail of A computational approach to lexical polysemy in Ancient Greek

Digital Scholarship in the Humanities, 2019

Language is a complex and dynamic system. If we consider word meaning, which is the scope of lexi... more Language is a complex and dynamic system. If we consider word meaning, which is the scope of lexical semantics, we observe that some words have several meanings, thus displaying lexical polysemy. In this article, we present the first phase of a project that aims at computationally modelling Ancient Greek semantics over time. Our system is based on Bayesian learning and on the Diorisis Ancient Greek corpus, which we have built for this purpose. We illustrate preliminary results in light of expert annotation, and take this opportunity to discuss the role of computational systems and human analysis in a complex research area like historical semantics. On the one hand, computational approaches allow us to model large corpora of texts. On the other hand, a long and rich scholarly tradition in Ancient Greek has provided us with valuable insights into the mechanisms of semantic change (cf. e.g. Leiwo, M. (2012). Introduction: variation with multiple faces. In Leiwo, M., Halla-aho, H., and ...

Research paper thumbnail of Language of Mechanisation Crowdsourcing Datasets from the Living with Machines Project

Journal of Open Humanities Data, 2024

We present the ‘Language of Mechanisation’ datasets with examples of re-use in visualisations and... more We present the ‘Language of Mechanisation’ datasets with examples of re-use in visualisations and analysis. These reusable CSV files, published on the British Library’s Research Repository, contain automatically-transcribed text from 19th century British newspaper articles. Volunteers on the Zooniverse crowdsourcing platform took part in tasks that asked ‘How did the word x change over time and place?’ They annotated articles with pre-selected meanings (senses) for the words coach, car, trolley and bike.

The datasets can support scholarship on a range of historical and linguistic research areas, including research on crowdsourcing and online volunteering behaviours, data processing and data visualisations methodologies.

Research paper thumbnail of Semantic change and socio-semantic variation: the case of COVID-related neologisms on Reddit

Linguistics Vanguard, 2024

COVID-19 has triggered innovations in science and society globally, leading to the emergence or e... more COVID-19 has triggered innovations in science and society globally, leading to the emergence or establishment of formal neologisms such as infodemic and working from home (WFH). While previous work on COVID-related lexical innovation has focused on such formal neologisms, this paper uses data from Reddit to study semantic neologisms like lockdown and mask, which have changed in meaning due to the pandemic. First, we identify words that have undergone meaning changes since the start of the pandemic. Our approach, based on word embeddings, successfully detects a variety of COVID-related terms that dominate the resulting list of semantic neologisms. Next, we generate community-specific semantic representations for the communities r/Coronavirus and r/conspiracy, which are both highly engaged in COVID-related discourse. We analyse socio-semantic variation along two dimensions: an evaluative dimension, based on amelioration/pejorization, and the loyalty/betrayal dimension of Moral Foundations Theory. Our findings reveal that the detected semantic neologisms exhibit more negative and betrayal-related associations in r/conspiracy, a subreddit critical of COVID-related sociopolitical measures. Mapping the community-specific representations for the term vaccines on a shared semantic space confirms these differences and reveals more fine-grained denotational and connotational differences between the two communities.

Research paper thumbnail of The Living Machine: A Computational Approach to the Nineteenth-Century Language of Technology

Technology and Culture

abstract: This article examines a long-standing question in the history of technology concerning ... more abstract: This article examines a long-standing question in the history of technology concerning the trope of the living machine. The authors do this by using a cutting-edge computational method, which they apply to large collections of digitized texts. In particular, they demonstrate the affordances of a neural language model for historical research. In a deliberate maneuver, the authors use a type of model, often portrayed as sentient today, to detect figures of speech in nineteenth-century texts that portrayed machines as self-acting, automatic, or alive. Their masked language model detects unusual or surprising turns of phrase, which could not be discovered using simple keyword search. The authors collect and close read such sentences to explore how figurative language produced a context that conceived humans and machines as interchangeable in complicated ways. They conclude that, used judiciously, language models have the potential to open up new avenues of historical research.

Research paper thumbnail of Release for code underlying paper for "Artificial Intelligence and Digital Heritage: Challenges and Opportunities - ARTIDGH 2020"

Release for code underlying paper for "Artificial Intelligence and Digital Heritage: Challenges a... more Release for code underlying paper for "Artificial Intelligence and Digital Heritage: Challenges and Opportunities - ARTIDGH 2020"

Research paper thumbnail of Quantifying the quantitative (re-)turn in historical linguistics

Humanities & social sciences communications, Jan 30, 2023

Historical linguistics is the study of language change and stability, of the history of individua... more Historical linguistics is the study of language change and stability, of the history of individual languages, and of the relatedness between languages. In spite of numerous acknowledgements, the adoption of quantitative methods in historical linguistics is still far from being mainstream and it falls below the level of other branches of linguistics. This comment considers the adoption of quantitative methods in recent historical linguistics research, and compares a study on 2012 publications with a similar study conducted seven years later. This comment argues for the advantages of a wider adoption of quantitative methods among historical linguists, and considers various reasons for the relatively slow progress in this direction. It also clarifies when quantitative methods are not the preferred route.

Research paper thumbnail of The Living Machine: A Computational Approach to the Nineteenth-Century Language of Technology

Technology and Culture, 2023

This article examines a long-standing question in the history of technology concerning the trope ... more This article examines a long-standing question in the history of technology concerning the trope of the living machine. The authors do this by using a cutting-edge computational method, which they apply to large collections of digitized texts. In particular, they demonstrate the affordances of a neural language model for historical research. In a self-conscious maneuver, the authors use a type of model, often portrayed as sentient today, to detect figures of speech in nineteenthcentury texts that portrayed machines as self-acting, automatic, or alive. Their method uses a masked language model to detect unusual or surprising turns of phrase, which could not be discovered using simple keyword search. The authors collect and close read such sentences to explore how figurative language produced a context in which humans and machines were conceived as interchangeable in complicated ways. They conclude that, used judiciously, language models have the potential to open new avenues of historical research.

Research paper thumbnail of Ancient Greek semantic change - annotated datasets and code

This collection contains five objects: two Python scripts and three datasets.<br> The Pytho... more This collection contains five objects: two Python scripts and three datasets.<br> The Python scripts create the datasets for semantic annotation of the Ancient Greek words <i>mus</i>, <i>harmonia</i>, and <i>kosmos</i>. <br>The datasets contain the manual annotation of the sentences containing words <i>mus</i>, <i>harmonia</i>, and <i>kosmos </i>in the Diorisis Ancient Greek corpus (Vatri & McGillivray 2018). The dataset for <i>kosmos</i> refers to sentences up to the year 142AD.<br>References:<br>Vatri, A. and McGillivray, B. (2018). The Diorisis Ancient Greek Corpus. Research Data Journal for the Humanities and Social Sciences. https://brill.com/view/journals/rdj/aop/article-10.1163-24523666-01000013.xml<br>McGillivray, B., Hengchen, S., Lähteenoja, V., Palma, M., and Vatri, A. (2019). A computational approach to lexical polysemy in Ancient Greek. <i>Digital Scho...

Research paper thumbnail of Deep Impact: A Study on the Impact of Data Papers and Datasets in the Humanities and Social Sciences

Publications

The humanities and social sciences (HSS) have recently witnessed an exponential growth in data-dr... more The humanities and social sciences (HSS) have recently witnessed an exponential growth in data-driven research. In response, attention has been afforded to datasets and accompanying data papers as outputs of the research and dissemination ecosystem. In 2015, two data journals dedicated to HSS disciplines appeared in this landscape: Journal of Open Humanities Data (JOHD) and Research Data Journal for the Humanities and Social Sciences (RDJ). In this paper, we analyse the state of the art in the landscape of data journals in HSS using JOHD and RDJ as exemplars by measuring performance and the deep impact of data-driven projects, including metrics (citation count; Altmetrics, views, downloads, tweets) of data papers in relation to associated research papers and the reuse of associated datasets. Our findings indicate: that data papers are published following the deposit of datasets in a repository and usually following research articles; that data papers have a positive impact on both t...

Research paper thumbnail of A new corpus annotation framework for Latin diachronic lexical semantics

Journal of Latin Linguistics, 2022

McGillivray, Barbara, Kondakova, Daria, Burman, Annie, Dell’Oro, Francesca, Bermúdez Sabel, Helen... more McGillivray, Barbara, Kondakova, Daria, Burman, Annie, Dell’Oro, Francesca, Bermúdez Sabel, Helena, Marongiu, Paola and Márquez Cruz, Manuel. "A new corpus annotation framework for Latin diachronic lexical semantics" Journal of Latin Linguistics, vol. 21, no. 1, 2022, pp. 47-105. https://doi.org/10.1515/joll-2022-2007

We present a new corpus-based resource and methodology for the annotation of Latin lexical semantics, consisting of 2,399 annotated passages of 40 lemmas from the Latin diachronic corpus LatinISE. We also describe how the annotation was designed, analyse annotators’ styles, and present the preliminary results of a study on the lexical semantics and diachronic change of the 40 lemmas. We complement this analysis with a case study on semantic vagueness. As the availability of digital corpora of ancient languages increases, and as computational research develops new methods for large-scale analysis of diachronic lexical semantics, building lexical semantic annotation resources can shed new light on large-scale patterns in the semantic development of lexical items over time. We share recommendations for designing the annotation task that will hopefully help similar research on other less-resourced or historical languages.

Research paper thumbnail of Investigating patterns of change, stability, and interaction among scientific disciplines using embeddings

Humanities and Social Sciences Communications

Multi-disciplinary and inter-disciplinary collaboration can be an appropriate response to tacklin... more Multi-disciplinary and inter-disciplinary collaboration can be an appropriate response to tackling the increasingly complex problems faced by today’s society. Scientific disciplines are not rigidly defined entities and their profiles change over time. No previous study has investigated multiple disciplinarity (i.e. the complex interaction between disciplines, whether of a multidisciplinary or an interdisciplinary nature) at scale with quantitative methods, and the change in the profile of disciplines over time. This article explores a dataset of over 21 million articles published in 8400 academic journals between 1990 and 2019 and proposes a new scalable data-driven approach to multiple disciplinarity. This approach can be used to study the relationship between disciplines over time. By creating vector representations (embeddings) of disciplines and measuring the geometric closeness between the embeddings, the analysis shows that the similarity between disciplines has increased over...

Research paper thumbnail of A new corpus annotation framework for Latin diachronic lexical semantics

Journal of Latin Linguistics

We present a new corpus-based resource and methodology for the annotation of Latin lexical semant... more We present a new corpus-based resource and methodology for the annotation of Latin lexical semantics, consisting of 2,399 annotated passages of 40 lemmas from the Latin diachronic corpus LatinISE. We also describe how the annotation was designed, analyse annotators’ styles, and present the preliminary results of a study on the lexical semantics and diachronic change of the 40 lemmas. We complement this analysis with a case study on semantic vagueness. As the availability of digital corpora of ancient languages increases, and as computational research develops new methods for large-scale analysis of diachronic lexical semantics, building lexical semantic annotation resources can shed new light on large-scale patterns in the semantic development of lexical items over time. We share recommendations for designing the annotation task that will hopefully help similar research on other less-resourced or historical languages.

Research paper thumbnail of Embedding Structured Dictionary Definitions

Workshop on Insights from Negative Results in NLP, Nov 19, 2020

Previous work has shown how to effectively use external resources such as dictionaries to improve... more Previous work has shown how to effectively use external resources such as dictionaries to improve English-language word embeddings, either by manipulating the training process or by applying post-hoc adjustments to the embedding space. We experiment with a multitask learning approach for explicitly incorporating the structured elements of dictionary entries, such as user-assigned tags and usage examples, when learning embeddings for dictionary headwords. Our work generalizes several existing models for learning word embeddings from dictionaries. However, we find that the most effective representations overall are learned by simply training with a skip-gram objective over the concatenated text of all entries in the dictionary, giving no particular focus to the structure of the entries.

Research paper thumbnail of Collocation

Research paper thumbnail of Frequency

Research paper thumbnail of Code for the Hartlib Papers

Code for processing Hartlib Papers, extract topics and plot them over time. The project was part ... more Code for processing Hartlib Papers, extract topics and plot them over time. The project was part of Simon Hengchen's short-term scientific mission funded by COST Action IS1310 "Reassembling the Republic of Letters".

Research paper thumbnail of LatinISE corpus (version 4)

The LatinISE corpus is a text corpus collected from the LacusCurtius, Intratext and Musisque Deoq... more The LatinISE corpus is a text corpus collected from the LacusCurtius, Intratext and Musisque Deoque websites. Corpus texts have rich metadata containing information as genre, title, century or specific date. This Latin corpus was built by Barbara McGillivray. In the version 4 of the corpus the high frequency lemmas have been manually corrected and sentence boundaries have been added.

Research paper thumbnail of LL(O)D and NLP perspectives on semantic change for humanities research

Semantic Web

This paper presents an overview of the LL(O)D and NLP methods, tools and data for detecting and r... more This paper presents an overview of the LL(O)D and NLP methods, tools and data for detecting and representing semantic change, with its main application in humanities research. The paper’s aim is to provide the starting point for the construction of a workflow and set of multilingual diachronic ontologies within the humanities use case of the COST Action Nexus Linguarum, European network for Web-centred linguistic data science, CA18209. The survey focuses on the essential aspects needed to understand the current trends and to build applications in this area of study.

Research paper thumbnail of D7.4 How to be FAIR with your data. A teaching and training handbook for higher education institutions

This handbook aims to support higher education institutions with the integration of FAIR-related ... more This handbook aims to support higher education institutions with the integration of FAIR-related content in their curricula and teaching. It was written and edited by a group of about 40 collaborators in a series of six book sprint events that took place between 1 and 10 June 2021. The document provides practical material, such as competence profiles, learning outcomes and lesson plans, and supporting information. It incorporates community feedback received during the public consultation which ran from 27 July to 12 September 2021.

Research paper thumbnail of Quantitative historical linguistics: A corpus framework by Gard B. Jenset and Barbara McGillivray

Research paper thumbnail of A computational approach to lexical polysemy in Ancient Greek

Digital Scholarship in the Humanities, 2019

Language is a complex and dynamic system. If we consider word meaning, which is the scope of lexi... more Language is a complex and dynamic system. If we consider word meaning, which is the scope of lexical semantics, we observe that some words have several meanings, thus displaying lexical polysemy. In this article, we present the first phase of a project that aims at computationally modelling Ancient Greek semantics over time. Our system is based on Bayesian learning and on the Diorisis Ancient Greek corpus, which we have built for this purpose. We illustrate preliminary results in light of expert annotation, and take this opportunity to discuss the role of computational systems and human analysis in a complex research area like historical semantics. On the one hand, computational approaches allow us to model large corpora of texts. On the other hand, a long and rich scholarly tradition in Ancient Greek has provided us with valuable insights into the mechanisms of semantic change (cf. e.g. Leiwo, M. (2012). Introduction: variation with multiple faces. In Leiwo, M., Halla-aho, H., and ...

Research paper thumbnail of D7.4 How to be FAIR with your data. A teaching and training handbook for higher education institutions

Zenodo, 2022

This handbook aims to support higher education institutions with the integration of FAIR-related ... more This handbook aims to support higher education institutions with the integration of FAIR-related content in their curricula and teaching. It was written and edited by a group of about 40 collaborators in a series of six book sprint events that took place between 1 and 10 June 2021. The document provides practical material, such as competence profiles, learning outcomes and lesson plans, and supporting information. It incorporates community feedback received during the public consultation which ran from 27 July to 12 September 2021.

Research paper thumbnail of Applying Language Technology in Humanities Research. Design, Application, and the Underlying Logic

Palgrave Macmillan, 2020

This book presents established and state-of-the-art methods in Language Technology (including tex... more This book presents established and state-of-the-art methods in Language Technology (including text mining, corpus linguistics, computational linguistics, and natural language processing), and demonstrates how they can be applied by humanities scholars working with textual data. The landscape of humanities research has recently changed thanks to the proliferation of big data and large textual collections such as Google Books, Early English Books Online, and Project Gutenberg. These resources have yet to be fully explored by new generations of scholars, and the authors argue that Language Technology has a key role to play in the exploration of large-scale textual data. The authors use a series of illustrative examples from various humanistic disciplines (mainly but not exclusively from History, Classics, and Literary Studies) to demonstrate basic and more complex use-case scenarios. This book will be useful to graduate students and researchers in humanistic disciplines working with textual data, including History, Modern Languages, Literary studies, Classics, and Linguistics. This is also a very useful book for anyone teaching or learning Digital Humanities and interested in the basic concepts from computational linguistics, corpus linguistics, and natural language processing.

Research paper thumbnail of Methods in Latin Computational Linguistics

Research paper thumbnail of Quantitative Historical Linguistics. A corpus framework

This book is an innovative guide to quantitative, corpus-based research in historical and diachro... more This book is an innovative guide to quantitative, corpus-based research in historical and diachronic linguistics. Gard B. Jenset and Barbara McGillivray argue that, although historical linguistics has been successful in using the comparative method, the field lags behind other branches of linguistics with respect to adopting quantitative methods. Here they provide a theoretically agnostic description of a new framework for quantitatively assessing models and hypotheses in historical linguistics, based on corpus data and using case studies to illustrate how this framework can answer research questions in historical linguistics. The authors offer an in-depth explanation and discussion of the benefits of working with quantitative methods, corpus data, and corpus annotation, and the advantages of open and reproducible research. The book will be a valuable resource for graduate students and researchers in historical linguistics, as well as for all those working with linguistic corpora.

Research paper thumbnail of Digital Humanities and Natural Language Processing: Je t’aime... Moi non plus

Digital Humanities Quarterly, 2020

In spite of the increasingly large textual datasets humanities researchers are confronted with, a... more In spite of the increasingly large textual datasets humanities researchers are confronted with, and the need for automatic tools to extract information from them, we observe a lack of communication and diverging goals between the communities of Natural Language Processing
(NLP) and Digital Humanities (DH). This contrasts with the wealth of potential opportunities that could arise from closer collaborations. We argue that more efforts are needed to make NLP tools work for DH datasets so that that NLP research applied to humanities data receives more attention, leading to the development of evaluation approaches tailored towards relevant research questions. This has the potential to bring methodological advances to NLP, while at the same time confronting DH datasets with powerful state-of-the-art techniques.

Research paper thumbnail of Historic machines from 'prams' to 'Parliament': new avenues for collaborative linguistic research

DH Benelux 2022 - ReMIX: Creation and alteration in DH (hybrid), Belval Campus, Esch-sure-Alzette, Luxembourg and online., 2022

Abstract for long paper, DH Benelux 2022: RE-MIX. Creation and alteration in DH (Hybrid), 1-3 Jun... more Abstract for long paper, DH Benelux 2022: RE-MIX. Creation and alteration in DH (Hybrid), 1-3 June 2022.

Research in computational linguistics has made successful attempts at modelling word meaning at scale, but much remains to be done to put these computational models to the test of historical scholarship (see e.g. Beelen et al. 2021). More importantly, a lot of computational research looks at texts in a historical vacuum, 'synchronically', as linguists would say. Living with Machines is an interdisciplinary research project that rethinks the impact of technology on the lives of ordinary people during the Industrial Revolution (Ahnert et al. 2021). During this project, we decided to address a fundamental question: what did people mean by ‘machine’ and how has this meaning changed over time?

This paper outlines how a simple research question like 'what was a machine?' can provide an opportunity to engage the public with our work while also generating data for analysis and new avenues of research in a radically collaborative way.