Natchanun Sanitdee | University of Helsinki (original) (raw)

Papers by Natchanun Sanitdee

Research paper thumbnail of Helsinki Digital Humanities Hackathon 2024 Portfolio

It was a wonderful opportunity to have participated in the Helsinki Digital Humanities Hackathon ... more It was a wonderful opportunity to have participated in the Helsinki Digital Humanities Hackathon 2024. The 8 days workshop was full of collaborations and ideas. I enjoyed the teamwork from students and professionals from different backgrounds. Our Civility in an Online Discourse group consisted of 12 collaborators including two team leaders. We attempted to examine civility in Reddit datasets.

The online orientation sessions significantly facilitated group dynamics, allowing team members to get to know each other and familiarize themselves with the datasets. I particularly appreciated the provided pre-readings, which offered a comprehensive overview of Reddit's structure and components. These readings also inspired ideas on which features could be extracted from the data, helping team members formulate relevant research questions. Based on our topics of interest, we divided our group into two teams: one focusing on structural cues and the other on linguistic cues.

During the first three days, the structural cue team familiarized themselves with the data and produced various visualizations. Meanwhile, the linguistic cue team worked on defining and conceptualizing civility, turning theory into practice. By the end of the first week, our group had a clearer direction: to train an annotation model that could predict civility scores. We summarized our progress and presented it in a preliminary presentation that Friday.

To train the annotation model, we needed annotated datasets. Therefore, we wrote an annotation codebook and designed an annotation scheme based on our definitions and concepts of civility (Annotation codebook). We spent the next few days annotating the data.

Concurrently, the structural cue team delved deeper into the relationship between different elements within the datasets, conducting more in-depth analyses.

At the beginning of the second week, the linguistic cue team focused on adjusting the annotation scheme and environment and calculating an inter-annotator agreement score. Although this stage was time-consuming and required significant effort, it was crucial for training an efficient model. We reported our progress in an interim presentation on Wednesday.

In the meantime, I integrated comments, replies, and authors from Reddit into a network analysis framework. Network analysis can identify influential users and their impact on discourse civility. In addition, community detection can be used to find clusters of civil or uncivil behavior. Lastly, performing close-reading on the comments can identify common topics and their relation to civility. This network when incorporated with civility scores derived from the model predictions makes an appropriate framework to study civility in Reddit discourse because it allows the examination of the datasets as a whole revealing the relationships between authors, comments, and replies. Additionally, network analysis enables a zoom-in approach where each individual or a set of comments and replies can be closely investigated.

To examine the unique words in each of the top 20 communities, I employed the tf-idf method. This finding is in accordance with the topic of each post, together with its comments and replies as one community represents one post.

To compare the unique words from each community with topics discussed, I applied a topic modeling method on the topics and topic texts of r/ChangeMyView and a random channel. This revealed that all topics are discussed more in r/ChangeMyView when compared to a random channel, and each community is composed of unique words corresponding to different topics. (Network analysis, tf-idf, and topic modeling codes)

During this time, I received many useful tips and comments that I can use in the future. Some of those include how to divide the workforce according to skills and interests and how to run subtasks to achieve a bigger goal.

After the interim presentation, realizing we only had one more day, our group spent the whole day and whole night to finally come up with a data-driven dataset with manually annotated civility scores and successfully used it to train the annotation model. After the main building was closed, we moved to the Digital Humanities Lab in Metsätalo building to continue working on our project. Simultaneously, the poster team was compiling and binding all the results together. By 1 a.m. on the following Friday, the last day of DHH24, we finished our project and handed in the poster.

We spent the following morning preparing the final presentation slides, and we delivered our findings on that Friday afternoon. The subsequent poster exhibition went smoothly, with many participants and sponsors eager to learn more about our projects.

Despite the time-consuming annotation task and the incomplete network analysis, I was satisfied with our group's outcomes. We achieved nearly all our goals and made significant progress in understanding civility in online discourse. To further develop the project, the annotation model should be integrated with the network analysis method. The model can assign civility scores to all posts, comments, and replies, which will be mapped as attributes to each node in the network. This approach offers extensive possibilities for exploring civility in Reddit discourse.

Overall, participating in the Digital Humanities Hackathon Helsinki 2024 was a fulfilling experience. I learned a great deal from the collaboration and teamwork within our group and from other groups as well.

Research paper thumbnail of Who was done what? A Parser-based Study of Passive Voice Constructions in Media Discourse on the Russo-Ukrainian War

The Russo-Ukrainian War is considered the biggest attack to Europe and the first full-scale war i... more The Russo-Ukrainian War is considered the biggest attack to Europe and the first full-scale war in Eu- rope since the Second World War. Media and press do not merely serve the purpose of communicating events, but they have been used as the tool to propagate agenda and play a vital role in shaping the perspectives of the audience towards the events. This prompts scholars to investigate language use in news to examine linguistic phenomena underlined by political and ideological currents.

This project analyzes subject-verb pairs occurring under passive voice constructions in news corpora. Passive voice constructions play a crucial role in revealing how agency and responsibility are repre- sented, especially in the discourse surrounding war and conflict, and this aspect has been underexplored in previous studies. Datasets used in this study include the Ukraine War corpus (170K), Leipzig news corpora (17M), and English speaking Russian news corpora (210K). Examining passive pairs, proper noun subjects, by-agent phrases, as well as conducting diachronic analysis on active and passive voice, short and long passive, and be- and get-passive, the study aims to answer the following research ques- tions: i) How Russia and Ukraine are represented in the Western news media compared to the Russian news media? ii) How their agency and responsibility are delegated, and iii) How the usage of passive voice changes over time?

A combination of theoretical linguistic frameworks on passive voice and semantic roles, corpus lin- guistic methodologies, statistical measures, as well as natural language processing (NLP) methods, are employed. SpaCy parser is used to extract passive subject-verb pairs. Semantic map aids the catego- rization of words. Topic modeling reveals different topics within a corpus, and t-score and p-value are used as the measurements of collocational strengths.

The main findings include the representation of Russia as a malevolent actor in Western media both towards its own people and foreign citizens while Ukraine is represented as victims. The use of eu- phemism with Russia signals mechanistic dehumanization. In the Russian news, Russia is found to be associated with neutral and positive verbs. Regarding the assignment of responsibility, weapons are found to be used to perform violent actions while human agents are more often associated with restrictive or non-violent actions. The diachronic analysis indicates an increased usage of short and get passives but a gradual decrease of passive voice in general.

Research paper thumbnail of Analyzing Persuasive Techniques in TED Talks

This report discusses a project involving a persuasive technique annotation scheme applied to a T... more This report discusses a project involving a persuasive technique annotation scheme applied to a TEDTalk dataset. As TEDTalks are by convention persuasive texts with easily available transcripts, they are apt for a speculative exploration into the design and implementation of a corpus using TED as a dataset and its annotation for persuasive techniques. The 10 most watched TEDTalk videos are selected for the annotation tasks. The annotation is completed in the INCEpTION environment using 12 categories (tags) and 4 annotators. The datasets are categorized based on the persuasive technique framework (ethos, logos, pathos) (Higgins & Walker, 2012). Fleiss's k agreement coefficient (Fleiss, 1971) is used to measure the inter-annotator agreement (IAA). The score for each category is also reported. The overall score of 0.271 indicates a fair agreement level among annotators with the highest agreement on Call-To-Action (CTA) category at 0.653 and the lowest agreement on Pathos-Emotion (PAT-EMO) and Pathos-Rhetorical Question (PAT-RH) at 0.066 and 0.058, respectively. The categorical scores implies that anecdote-related categories (PAT-ANE, ETH-ANE, and LOG-ANE) are relatively challenging to annotate, and emotion-related categories (especially PAT-EMO and PAT-RH) are the most difficult ones.

Research paper thumbnail of Exploring Modality in 19 th -Century Diplomatic Treaties: A Parser-based Linguistic Analysis of the Burney, Robert, and Bowring Treaties between Siam (now Thailand) and Western Powers

This research proposal aims to analyze modal verb phrases (modal verb + infinitive verb) in three... more This research proposal aims to analyze modal verb phrases (modal verb + infinitive verb) in three different treaties: Burney, Robert, and Bowring Treaties different in their objectives and periods of time. Modal verbs like shall, should, can, or could convey different attitudes, and in the discourse of international relations and politics, modal verbs can reveal hidden intentions and underlying power dynamics. Although the three diplomatic treaties have been well-studied in terms of their societal, political, and economical consequences, this linguistic arena on the use of modal verbs has been underexplored. This study attempts to answer the following research questions: i) what are the most common modal verb phrases used in the three treaties? ii) what do the modal verb phrases represent? and iii) how do the occurrence frequencies compare between the three treaties and over time?

A combination of qualitative and quantitative methods and tools is implemented. While the Burney and the Bowring Treaties can be downloaded from Wikisource, the Roberts Treaty needs to be digitized. Therefore, the study also employs Transkribus as a digitization image-to-text tool. To process the treaty texts, a text parser spaCy is used to extract the modal verb phrases. A set of nine modal verbs is used as keywords together with the dependency tag "aux" to identify the modal verb phrases. The context words in the same sentence of modal verb phrases are also extracted into sub-corpora for further analysis. The occurrences of modal verb phrases are counted and categorized based on Someya's 13 categorization (2010). The significance of the phrases are evaluated using frequency analysis, O/E score, t-score, and p-value. Semantic shift analysis is conducted based on Millar's framework of epistemic and deontic sense (2009). Close-reading, semantic map, and Critical Discourse Analysis (CDA) are also employed for a local scale analysis.

Research paper thumbnail of Navigating the Evolution of Language and Brain: A Journey through Time and Technology

Embarking on a historical voyage through the intricate landscape of language and the brain, this ... more Embarking on a historical voyage through the intricate landscape of language and the brain, this essay unfolds the chronology of discoveries and technologies that have shaped our understanding. From the classic models of Broca and Wernicke and the cutting-edge neuroimaging tools to the Dual Stream model and computational modeling, the narrative seeks to weave a coherent tapestry of the journey into the neural bases of language.

Research paper thumbnail of Truths to be Told: Protein, Veggies, and Carbs for Culinary Sustainability

Zenodo (CERN European Organization for Nuclear Research), Jun 21, 2023

The intricate relationship between personal well-being and the health of our planet calls for min... more The intricate relationship between personal well-being and the health of our planet calls for mindful choices that can shape a healthier future. This essay delves into the impact of meat consumption, the benefits of vegetables, the misconceptions surrounding carbohydrates, and the path to sustainable living.

Research paper thumbnail of Exploring Vocabulary Development of European and Latin American Spanish-Speaking Children: Insights from the Wordbank Dataset

Zenodo (CERN European Organization for Nuclear Research), May 22, 2023

Introduction Vocabulary development plays a crucial role in children's language acquisition, serv... more Introduction Vocabulary development plays a crucial role in children's language acquisition, serving as a foundation for communication and cognitive growth. Understanding how vocabulary development varies across different populations is essential for gaining insights into language acquisition processes and identifying potential linguistic and cultural influences. This study aims to explore the vocabulary development of European and Latin American Spanish-speaking children and investigate the variations that may exist across different age groups. Specifically, we seek to answer the research question: How does vocabulary development vary across age groups between European Spanish and Latin American Spanish-speaking children? 2 Datasets To investigate this research question, we employ the Wordbank dataset, a valuable resource that provides comprehensive data on children's language development across various linguistic contexts. The Wordbank dataset comprises a large collection of parental reports, offering insights into children's vocabulary acquisition, language exposure, and linguistic milestones. For this study, we extract data from the Wordbank dataset specifically for children growing up in European Spanish-speaking countries, i.e. Spain, and Latin American Spanish-speaking countries, i.e. Mexico. We aim to capture potential variations in vocabulary development influenced by linguistic, cultural, and environmental factors. Wordbank is a site for archiving, sharing, and exploring anonymized MacArthur-Bates Communicative Development Inventory (CDIs) data from the original English form and from CDI adaptations in many languages (such as Croatian, Danish, English, German, Italian, Norwegian, Russian, Spanish, Swedish, and Turkish). Wordbank compiles responses from norming studies but also includes data that individual researchers have contributed from various research projects, large and small (Frank et al 2021). Out of 16,868 entries in admins dataframe, we filter it to have only Spanish language in Europe and Mexico and are left with 2,939 entries. 3 Methods After we get the datasets, we employ several statistical analyses and data visualization methods in R to compare vocabulary size, growth trajectories, or specific word types between these variations of Spanish, for example. 4 Results From the dataset, there are certain variables that could affect vocabulary development of Spanish-speaking babies in Europe and Mexico. Those variables include age, gender, mothetnal education, and birth order. 4.1 Age 4.2 Age and Gender The charts in sections 4.1. and 4.2. reveal that during the first 20 months, Spanish-speaking babies comprehend more words than they produce. They continue to build up their vocabularies and convey more when they reach 25 months at about 300 words, except Spanish baby girls from both Spain and Mexico with an average of only around 200 words. By the age of 30 months, they all reach around 400 words in both comprehension and production. 4.3 Mother Education We can examine if the educational level of mother correlates to vocabulary development of a child. The bar chart shows an overview of mother's education of the children speaking Spanish in Spain and Mexico ranging from the lowest level "None" to the highest level "Graduate." A majority of mothers in Spain possess a graduate and a college degree (514 and 263 respectively), while in Mexico, most mothers go to some college (361) or have a degree lower than college (1879). It can be concluded that mothers in Spain have higher education than those in Mexico. 4.3.1 Mother's Education Level and Comprehended Words To see the correlation between level of mother education and child's vocabulary development, we make 2 types of visualizations. First of all, we make a box plot. Then, we draw a correlation plot and calculate a correlation coefficient. 4.3.2 Mother's Education Level and Produced Words The charts in 4.4.1. and 4.4.2. sections show that the babies raised by mothers who finish a primary degree in Spain understand and express more vocabulary than in Mexico (199.87 and 163.61 VS 188.82 and 132.44 respectively.) On the other hand, babies who grow up with mothers with a secondary school degree in Spain comprehend and produce less words than in Mexico (188.59 and 154.49 VS 254.24 and 254.24 respectively.) The same observation can be found in the college level as well with 182.27 and 152.65 words in Spain VS 525.17 and 525.17 words in Mexico. It is worth noting that the above findings might be biased because there are much more entries in the Mexico dataset when compared to Spain (391,419 VS 169,446 for comprehension and 312,395 VS 136242 for production) as in the table below. Comprehension and Production Counts by Language

Research paper thumbnail of A brief Overview of Case System of Moroccan Arabic, Spanish, and Quechua

Zenodo (CERN European Organization for Nuclear Research), Mar 26, 2023

“Case marking is one of the most important areas of linguistic typology and universals” (Croft 20... more “Case marking is one of the most important areas of linguistic typology and universals” (Croft 2003, p. 214). Case marking signifies the grammatical relationship between a noun or pronoun and other sentence elements. Languages mark cases in various ways, including through inflectional endings on nouns, pronouns, and adjectives; through prepositions or postpositions; through word order; or through a combination of these methods. Chapter 49 in the World Atlas of Language Structures (WALS) website focuses on the morphological case marking (Iggesen 2003). This essay discusses the case marking of Moroccan Arabic, Spanish, and Quechua (Imbabura).

Research paper thumbnail of Collocation and colligation analysis of the verb "wax

Zenodo (CERN European Organization for Nuclear Research), Dec 23, 2022

Research paper thumbnail of Start to do VS Start doing: A Diachronic Corpus-based Analysis

Zenodo (CERN European Organization for Nuclear Research), Dec 23, 2022

Research paper thumbnail of Thai Script: The Romanization of Thai Script and Thai Magic Tattoos

Zenodo (CERN European Organization for Nuclear Research), Dec 14, 2022

Research paper thumbnail of Investigating the Relationship between Case Marking and Word Order in Languages: A Cross-Linguistic Analysis using WALS Data

Zenodo (CERN European Organization for Nuclear Research), May 27, 2023

Research paper thumbnail of A diachronic corpus-based analysis: The rise and fall of conjunctions for, as, and because

Zenodo (CERN European Organization for Nuclear Research), Dec 23, 2022

Research paper thumbnail of Overview of Case Marking of Moroccan Arabic, Spanish, and Quechua (Imbabura) in WALS (the World Atlas of Language Structures)

Zenodo (CERN European Organization for Nuclear Research), Apr 4, 2023

WALS Exercise (Chapter 49: Number of Cases) "Case marking is one of the most important areas of l... more WALS Exercise (Chapter 49: Number of Cases) "Case marking is one of the most important areas of linguistic typology and universals" (Croft 2003, p. 214). Case marking signifies the grammatical relationship between a noun or pronoun and other sentence elements. Languages mark cases in various ways, including through inflectional endings on nouns, pronouns, and adjectives; through prepositions or postpositions; through word order; or through a combination of these methods. Chapter 49 in the World Atlas of Language Structures (WALS) website focuses on the morphological case marking (Iggesen 2003). This essay discusses the case marking of Moroccan Arabic, Spanish, and Quechua (Imbabura). In Moroccan Arabic, prepositions and particles are used extensively as a case marker (Harrell 1962). Taha (1993) specifies that the language has five cases: nominative, accusative, genitive, dative, and locative. Beyond these basic grammatical categories, Fassi Fehri (2011) mentions that prepositions and particles in Moroccan Arabic can be used to convey a wide range of semantic relationships. Harrell (1962) exemplifies other usage; preposition mā (with) is used to show instrumental relationships while particle min (from) is used to indicate an ablative relationship. Reviewing the above reference grammar, I would assign the feature value 7 to Moroccan Arabic. Similar to Moroccan Arabic, for Spanish, although Butt and Benjamin (2011) do not use the traditional labels, they identify five comparable cases: nominative, accusative, dative, genitive, and locative. Moreover, Bosque and Demonte (1999), De Bruyne, Pountain, and Kattán-Ibarra (2013), and Kattán-Ibarra and Howkins (2014) discuss markers that indicate instrumental and ablative cases like con (with) and de (from), respectively. With a total of 7 cases from this observation, I believe the assigned value of "No morphological case-marking" in WALS is inaccurate. Quechua (Imbabura) has a complex case system. Quechua marks cases through suffixes on nouns and adjectives (Sánchez-Moreno 2019). Cole (1982) identifies eight cases: nominative, accusative, dative, genitive, ablative, allative, instrumental, and comitative while Cerrón-Palomino (1994) and Adelaar and Muysken (2004) identify the ninth case: locative. Case markers in Quechua (Imbabura) are, for example, the suffix-wan (with) used for instrumental case, and the suffix-pi (from) used for ablative case. In my opinion, the feature value "8-9 cases" assigned to Quechua (Imbabura) is accurate. In conclusion, case marking plays a crucial role in indicating the grammatical relationship between a noun or pronoun and other sentence elements. In this essay, the case marking systems of Moroccan Arabic, Spanish, and Quechua (Imbabura) were discussed. Moroccan Arabic and Spanish have five cases each, and both languages use prepositions and particles to convey different semantic relationships. On the other hand, Quechua (Imbabura) has a complex case system, with eight to nine cases marked through suffixes on nouns and adjectives. WALS assigns values to the number of cases in a language, and the observations in this essay suggest that the values for Moroccan Arabic and Spanish need to be updated.

Research paper thumbnail of Coffee and Tea? A diachronic corpus-based, collocation and colligation analysis of the words coffee and tea

Zenodo (CERN European Organization for Nuclear Research), Mar 20, 2023

Research paper thumbnail of Coffee and Tea? A corpus-based collocation and colligation analysis of the words coffee and tea

Zenodo (CERN European Organization for Nuclear Research), Mar 20, 2023

Coffee and tea are one of the most popular beverages in the world. Coffee and tea have been an in... more Coffee and tea are one of the most popular beverages in the world. Coffee and tea have been an integral part of many cultures around the world for centuries, and the words are used widely in everyday life. This paper aims to compare different methods to analyze the words coffee and tea in a corpus. The methods employed in this study are frequency analysis,

Research paper thumbnail of East Meets West: A collocation and colligation analysis of "vaccine" with the keywords: Pfizer, Moderna, Sinovac, and Novavax

Zenodo (CERN European Organization for Nuclear Research), Dec 14, 2022

Research paper thumbnail of The study of the Northern Thai dialect: Phonetic variations of Sao Wa sub-variant in Chiang Rai province

Zenodo (CERN European Organization for Nuclear Research), Apr 10, 2023

Research paper thumbnail of Coffee and Tea? A comparison of different methods in corpus-based natural language processing on "coffee" and "tea

Zenodo (CERN European Organization for Nuclear Research), Mar 20, 2023

Coffee and tea have been an integral part of many cultures around the world for centuries, and th... more Coffee and tea have been an integral part of many cultures around the world for centuries, and the words are used widely in everyday life. This paper aims to compare different methods to analyze the words coffee and tea in a corpus. The methods employed in this study are frequency analysis,

Research paper thumbnail of What did they say?" Network analysis of Twitter quotes @JoeBiden and @realDonaldTrump during 2020 United States presidential election (second debate)

Zenodo (CERN European Organization for Nuclear Research), Jan 17, 2023

Research paper thumbnail of Helsinki Digital Humanities Hackathon 2024 Portfolio

It was a wonderful opportunity to have participated in the Helsinki Digital Humanities Hackathon ... more It was a wonderful opportunity to have participated in the Helsinki Digital Humanities Hackathon 2024. The 8 days workshop was full of collaborations and ideas. I enjoyed the teamwork from students and professionals from different backgrounds. Our Civility in an Online Discourse group consisted of 12 collaborators including two team leaders. We attempted to examine civility in Reddit datasets.

The online orientation sessions significantly facilitated group dynamics, allowing team members to get to know each other and familiarize themselves with the datasets. I particularly appreciated the provided pre-readings, which offered a comprehensive overview of Reddit's structure and components. These readings also inspired ideas on which features could be extracted from the data, helping team members formulate relevant research questions. Based on our topics of interest, we divided our group into two teams: one focusing on structural cues and the other on linguistic cues.

During the first three days, the structural cue team familiarized themselves with the data and produced various visualizations. Meanwhile, the linguistic cue team worked on defining and conceptualizing civility, turning theory into practice. By the end of the first week, our group had a clearer direction: to train an annotation model that could predict civility scores. We summarized our progress and presented it in a preliminary presentation that Friday.

To train the annotation model, we needed annotated datasets. Therefore, we wrote an annotation codebook and designed an annotation scheme based on our definitions and concepts of civility (Annotation codebook). We spent the next few days annotating the data.

Concurrently, the structural cue team delved deeper into the relationship between different elements within the datasets, conducting more in-depth analyses.

At the beginning of the second week, the linguistic cue team focused on adjusting the annotation scheme and environment and calculating an inter-annotator agreement score. Although this stage was time-consuming and required significant effort, it was crucial for training an efficient model. We reported our progress in an interim presentation on Wednesday.

In the meantime, I integrated comments, replies, and authors from Reddit into a network analysis framework. Network analysis can identify influential users and their impact on discourse civility. In addition, community detection can be used to find clusters of civil or uncivil behavior. Lastly, performing close-reading on the comments can identify common topics and their relation to civility. This network when incorporated with civility scores derived from the model predictions makes an appropriate framework to study civility in Reddit discourse because it allows the examination of the datasets as a whole revealing the relationships between authors, comments, and replies. Additionally, network analysis enables a zoom-in approach where each individual or a set of comments and replies can be closely investigated.

To examine the unique words in each of the top 20 communities, I employed the tf-idf method. This finding is in accordance with the topic of each post, together with its comments and replies as one community represents one post.

To compare the unique words from each community with topics discussed, I applied a topic modeling method on the topics and topic texts of r/ChangeMyView and a random channel. This revealed that all topics are discussed more in r/ChangeMyView when compared to a random channel, and each community is composed of unique words corresponding to different topics. (Network analysis, tf-idf, and topic modeling codes)

During this time, I received many useful tips and comments that I can use in the future. Some of those include how to divide the workforce according to skills and interests and how to run subtasks to achieve a bigger goal.

After the interim presentation, realizing we only had one more day, our group spent the whole day and whole night to finally come up with a data-driven dataset with manually annotated civility scores and successfully used it to train the annotation model. After the main building was closed, we moved to the Digital Humanities Lab in Metsätalo building to continue working on our project. Simultaneously, the poster team was compiling and binding all the results together. By 1 a.m. on the following Friday, the last day of DHH24, we finished our project and handed in the poster.

We spent the following morning preparing the final presentation slides, and we delivered our findings on that Friday afternoon. The subsequent poster exhibition went smoothly, with many participants and sponsors eager to learn more about our projects.

Despite the time-consuming annotation task and the incomplete network analysis, I was satisfied with our group's outcomes. We achieved nearly all our goals and made significant progress in understanding civility in online discourse. To further develop the project, the annotation model should be integrated with the network analysis method. The model can assign civility scores to all posts, comments, and replies, which will be mapped as attributes to each node in the network. This approach offers extensive possibilities for exploring civility in Reddit discourse.

Overall, participating in the Digital Humanities Hackathon Helsinki 2024 was a fulfilling experience. I learned a great deal from the collaboration and teamwork within our group and from other groups as well.

Research paper thumbnail of Who was done what? A Parser-based Study of Passive Voice Constructions in Media Discourse on the Russo-Ukrainian War

The Russo-Ukrainian War is considered the biggest attack to Europe and the first full-scale war i... more The Russo-Ukrainian War is considered the biggest attack to Europe and the first full-scale war in Eu- rope since the Second World War. Media and press do not merely serve the purpose of communicating events, but they have been used as the tool to propagate agenda and play a vital role in shaping the perspectives of the audience towards the events. This prompts scholars to investigate language use in news to examine linguistic phenomena underlined by political and ideological currents.

This project analyzes subject-verb pairs occurring under passive voice constructions in news corpora. Passive voice constructions play a crucial role in revealing how agency and responsibility are repre- sented, especially in the discourse surrounding war and conflict, and this aspect has been underexplored in previous studies. Datasets used in this study include the Ukraine War corpus (170K), Leipzig news corpora (17M), and English speaking Russian news corpora (210K). Examining passive pairs, proper noun subjects, by-agent phrases, as well as conducting diachronic analysis on active and passive voice, short and long passive, and be- and get-passive, the study aims to answer the following research ques- tions: i) How Russia and Ukraine are represented in the Western news media compared to the Russian news media? ii) How their agency and responsibility are delegated, and iii) How the usage of passive voice changes over time?

A combination of theoretical linguistic frameworks on passive voice and semantic roles, corpus lin- guistic methodologies, statistical measures, as well as natural language processing (NLP) methods, are employed. SpaCy parser is used to extract passive subject-verb pairs. Semantic map aids the catego- rization of words. Topic modeling reveals different topics within a corpus, and t-score and p-value are used as the measurements of collocational strengths.

The main findings include the representation of Russia as a malevolent actor in Western media both towards its own people and foreign citizens while Ukraine is represented as victims. The use of eu- phemism with Russia signals mechanistic dehumanization. In the Russian news, Russia is found to be associated with neutral and positive verbs. Regarding the assignment of responsibility, weapons are found to be used to perform violent actions while human agents are more often associated with restrictive or non-violent actions. The diachronic analysis indicates an increased usage of short and get passives but a gradual decrease of passive voice in general.

Research paper thumbnail of Analyzing Persuasive Techniques in TED Talks

This report discusses a project involving a persuasive technique annotation scheme applied to a T... more This report discusses a project involving a persuasive technique annotation scheme applied to a TEDTalk dataset. As TEDTalks are by convention persuasive texts with easily available transcripts, they are apt for a speculative exploration into the design and implementation of a corpus using TED as a dataset and its annotation for persuasive techniques. The 10 most watched TEDTalk videos are selected for the annotation tasks. The annotation is completed in the INCEpTION environment using 12 categories (tags) and 4 annotators. The datasets are categorized based on the persuasive technique framework (ethos, logos, pathos) (Higgins & Walker, 2012). Fleiss's k agreement coefficient (Fleiss, 1971) is used to measure the inter-annotator agreement (IAA). The score for each category is also reported. The overall score of 0.271 indicates a fair agreement level among annotators with the highest agreement on Call-To-Action (CTA) category at 0.653 and the lowest agreement on Pathos-Emotion (PAT-EMO) and Pathos-Rhetorical Question (PAT-RH) at 0.066 and 0.058, respectively. The categorical scores implies that anecdote-related categories (PAT-ANE, ETH-ANE, and LOG-ANE) are relatively challenging to annotate, and emotion-related categories (especially PAT-EMO and PAT-RH) are the most difficult ones.

Research paper thumbnail of Exploring Modality in 19 th -Century Diplomatic Treaties: A Parser-based Linguistic Analysis of the Burney, Robert, and Bowring Treaties between Siam (now Thailand) and Western Powers

This research proposal aims to analyze modal verb phrases (modal verb + infinitive verb) in three... more This research proposal aims to analyze modal verb phrases (modal verb + infinitive verb) in three different treaties: Burney, Robert, and Bowring Treaties different in their objectives and periods of time. Modal verbs like shall, should, can, or could convey different attitudes, and in the discourse of international relations and politics, modal verbs can reveal hidden intentions and underlying power dynamics. Although the three diplomatic treaties have been well-studied in terms of their societal, political, and economical consequences, this linguistic arena on the use of modal verbs has been underexplored. This study attempts to answer the following research questions: i) what are the most common modal verb phrases used in the three treaties? ii) what do the modal verb phrases represent? and iii) how do the occurrence frequencies compare between the three treaties and over time?

A combination of qualitative and quantitative methods and tools is implemented. While the Burney and the Bowring Treaties can be downloaded from Wikisource, the Roberts Treaty needs to be digitized. Therefore, the study also employs Transkribus as a digitization image-to-text tool. To process the treaty texts, a text parser spaCy is used to extract the modal verb phrases. A set of nine modal verbs is used as keywords together with the dependency tag "aux" to identify the modal verb phrases. The context words in the same sentence of modal verb phrases are also extracted into sub-corpora for further analysis. The occurrences of modal verb phrases are counted and categorized based on Someya's 13 categorization (2010). The significance of the phrases are evaluated using frequency analysis, O/E score, t-score, and p-value. Semantic shift analysis is conducted based on Millar's framework of epistemic and deontic sense (2009). Close-reading, semantic map, and Critical Discourse Analysis (CDA) are also employed for a local scale analysis.

Research paper thumbnail of Navigating the Evolution of Language and Brain: A Journey through Time and Technology

Embarking on a historical voyage through the intricate landscape of language and the brain, this ... more Embarking on a historical voyage through the intricate landscape of language and the brain, this essay unfolds the chronology of discoveries and technologies that have shaped our understanding. From the classic models of Broca and Wernicke and the cutting-edge neuroimaging tools to the Dual Stream model and computational modeling, the narrative seeks to weave a coherent tapestry of the journey into the neural bases of language.

Research paper thumbnail of Truths to be Told: Protein, Veggies, and Carbs for Culinary Sustainability

Zenodo (CERN European Organization for Nuclear Research), Jun 21, 2023

The intricate relationship between personal well-being and the health of our planet calls for min... more The intricate relationship between personal well-being and the health of our planet calls for mindful choices that can shape a healthier future. This essay delves into the impact of meat consumption, the benefits of vegetables, the misconceptions surrounding carbohydrates, and the path to sustainable living.

Research paper thumbnail of Exploring Vocabulary Development of European and Latin American Spanish-Speaking Children: Insights from the Wordbank Dataset

Zenodo (CERN European Organization for Nuclear Research), May 22, 2023

Introduction Vocabulary development plays a crucial role in children's language acquisition, serv... more Introduction Vocabulary development plays a crucial role in children's language acquisition, serving as a foundation for communication and cognitive growth. Understanding how vocabulary development varies across different populations is essential for gaining insights into language acquisition processes and identifying potential linguistic and cultural influences. This study aims to explore the vocabulary development of European and Latin American Spanish-speaking children and investigate the variations that may exist across different age groups. Specifically, we seek to answer the research question: How does vocabulary development vary across age groups between European Spanish and Latin American Spanish-speaking children? 2 Datasets To investigate this research question, we employ the Wordbank dataset, a valuable resource that provides comprehensive data on children's language development across various linguistic contexts. The Wordbank dataset comprises a large collection of parental reports, offering insights into children's vocabulary acquisition, language exposure, and linguistic milestones. For this study, we extract data from the Wordbank dataset specifically for children growing up in European Spanish-speaking countries, i.e. Spain, and Latin American Spanish-speaking countries, i.e. Mexico. We aim to capture potential variations in vocabulary development influenced by linguistic, cultural, and environmental factors. Wordbank is a site for archiving, sharing, and exploring anonymized MacArthur-Bates Communicative Development Inventory (CDIs) data from the original English form and from CDI adaptations in many languages (such as Croatian, Danish, English, German, Italian, Norwegian, Russian, Spanish, Swedish, and Turkish). Wordbank compiles responses from norming studies but also includes data that individual researchers have contributed from various research projects, large and small (Frank et al 2021). Out of 16,868 entries in admins dataframe, we filter it to have only Spanish language in Europe and Mexico and are left with 2,939 entries. 3 Methods After we get the datasets, we employ several statistical analyses and data visualization methods in R to compare vocabulary size, growth trajectories, or specific word types between these variations of Spanish, for example. 4 Results From the dataset, there are certain variables that could affect vocabulary development of Spanish-speaking babies in Europe and Mexico. Those variables include age, gender, mothetnal education, and birth order. 4.1 Age 4.2 Age and Gender The charts in sections 4.1. and 4.2. reveal that during the first 20 months, Spanish-speaking babies comprehend more words than they produce. They continue to build up their vocabularies and convey more when they reach 25 months at about 300 words, except Spanish baby girls from both Spain and Mexico with an average of only around 200 words. By the age of 30 months, they all reach around 400 words in both comprehension and production. 4.3 Mother Education We can examine if the educational level of mother correlates to vocabulary development of a child. The bar chart shows an overview of mother's education of the children speaking Spanish in Spain and Mexico ranging from the lowest level "None" to the highest level "Graduate." A majority of mothers in Spain possess a graduate and a college degree (514 and 263 respectively), while in Mexico, most mothers go to some college (361) or have a degree lower than college (1879). It can be concluded that mothers in Spain have higher education than those in Mexico. 4.3.1 Mother's Education Level and Comprehended Words To see the correlation between level of mother education and child's vocabulary development, we make 2 types of visualizations. First of all, we make a box plot. Then, we draw a correlation plot and calculate a correlation coefficient. 4.3.2 Mother's Education Level and Produced Words The charts in 4.4.1. and 4.4.2. sections show that the babies raised by mothers who finish a primary degree in Spain understand and express more vocabulary than in Mexico (199.87 and 163.61 VS 188.82 and 132.44 respectively.) On the other hand, babies who grow up with mothers with a secondary school degree in Spain comprehend and produce less words than in Mexico (188.59 and 154.49 VS 254.24 and 254.24 respectively.) The same observation can be found in the college level as well with 182.27 and 152.65 words in Spain VS 525.17 and 525.17 words in Mexico. It is worth noting that the above findings might be biased because there are much more entries in the Mexico dataset when compared to Spain (391,419 VS 169,446 for comprehension and 312,395 VS 136242 for production) as in the table below. Comprehension and Production Counts by Language

Research paper thumbnail of A brief Overview of Case System of Moroccan Arabic, Spanish, and Quechua

Zenodo (CERN European Organization for Nuclear Research), Mar 26, 2023

“Case marking is one of the most important areas of linguistic typology and universals” (Croft 20... more “Case marking is one of the most important areas of linguistic typology and universals” (Croft 2003, p. 214). Case marking signifies the grammatical relationship between a noun or pronoun and other sentence elements. Languages mark cases in various ways, including through inflectional endings on nouns, pronouns, and adjectives; through prepositions or postpositions; through word order; or through a combination of these methods. Chapter 49 in the World Atlas of Language Structures (WALS) website focuses on the morphological case marking (Iggesen 2003). This essay discusses the case marking of Moroccan Arabic, Spanish, and Quechua (Imbabura).

Research paper thumbnail of Collocation and colligation analysis of the verb "wax

Zenodo (CERN European Organization for Nuclear Research), Dec 23, 2022

Research paper thumbnail of Start to do VS Start doing: A Diachronic Corpus-based Analysis

Zenodo (CERN European Organization for Nuclear Research), Dec 23, 2022

Research paper thumbnail of Thai Script: The Romanization of Thai Script and Thai Magic Tattoos

Zenodo (CERN European Organization for Nuclear Research), Dec 14, 2022

Research paper thumbnail of Investigating the Relationship between Case Marking and Word Order in Languages: A Cross-Linguistic Analysis using WALS Data

Zenodo (CERN European Organization for Nuclear Research), May 27, 2023

Research paper thumbnail of A diachronic corpus-based analysis: The rise and fall of conjunctions for, as, and because

Zenodo (CERN European Organization for Nuclear Research), Dec 23, 2022

Research paper thumbnail of Overview of Case Marking of Moroccan Arabic, Spanish, and Quechua (Imbabura) in WALS (the World Atlas of Language Structures)

Zenodo (CERN European Organization for Nuclear Research), Apr 4, 2023

WALS Exercise (Chapter 49: Number of Cases) "Case marking is one of the most important areas of l... more WALS Exercise (Chapter 49: Number of Cases) "Case marking is one of the most important areas of linguistic typology and universals" (Croft 2003, p. 214). Case marking signifies the grammatical relationship between a noun or pronoun and other sentence elements. Languages mark cases in various ways, including through inflectional endings on nouns, pronouns, and adjectives; through prepositions or postpositions; through word order; or through a combination of these methods. Chapter 49 in the World Atlas of Language Structures (WALS) website focuses on the morphological case marking (Iggesen 2003). This essay discusses the case marking of Moroccan Arabic, Spanish, and Quechua (Imbabura). In Moroccan Arabic, prepositions and particles are used extensively as a case marker (Harrell 1962). Taha (1993) specifies that the language has five cases: nominative, accusative, genitive, dative, and locative. Beyond these basic grammatical categories, Fassi Fehri (2011) mentions that prepositions and particles in Moroccan Arabic can be used to convey a wide range of semantic relationships. Harrell (1962) exemplifies other usage; preposition mā (with) is used to show instrumental relationships while particle min (from) is used to indicate an ablative relationship. Reviewing the above reference grammar, I would assign the feature value 7 to Moroccan Arabic. Similar to Moroccan Arabic, for Spanish, although Butt and Benjamin (2011) do not use the traditional labels, they identify five comparable cases: nominative, accusative, dative, genitive, and locative. Moreover, Bosque and Demonte (1999), De Bruyne, Pountain, and Kattán-Ibarra (2013), and Kattán-Ibarra and Howkins (2014) discuss markers that indicate instrumental and ablative cases like con (with) and de (from), respectively. With a total of 7 cases from this observation, I believe the assigned value of "No morphological case-marking" in WALS is inaccurate. Quechua (Imbabura) has a complex case system. Quechua marks cases through suffixes on nouns and adjectives (Sánchez-Moreno 2019). Cole (1982) identifies eight cases: nominative, accusative, dative, genitive, ablative, allative, instrumental, and comitative while Cerrón-Palomino (1994) and Adelaar and Muysken (2004) identify the ninth case: locative. Case markers in Quechua (Imbabura) are, for example, the suffix-wan (with) used for instrumental case, and the suffix-pi (from) used for ablative case. In my opinion, the feature value "8-9 cases" assigned to Quechua (Imbabura) is accurate. In conclusion, case marking plays a crucial role in indicating the grammatical relationship between a noun or pronoun and other sentence elements. In this essay, the case marking systems of Moroccan Arabic, Spanish, and Quechua (Imbabura) were discussed. Moroccan Arabic and Spanish have five cases each, and both languages use prepositions and particles to convey different semantic relationships. On the other hand, Quechua (Imbabura) has a complex case system, with eight to nine cases marked through suffixes on nouns and adjectives. WALS assigns values to the number of cases in a language, and the observations in this essay suggest that the values for Moroccan Arabic and Spanish need to be updated.

Research paper thumbnail of Coffee and Tea? A diachronic corpus-based, collocation and colligation analysis of the words coffee and tea

Zenodo (CERN European Organization for Nuclear Research), Mar 20, 2023

Research paper thumbnail of Coffee and Tea? A corpus-based collocation and colligation analysis of the words coffee and tea

Zenodo (CERN European Organization for Nuclear Research), Mar 20, 2023

Coffee and tea are one of the most popular beverages in the world. Coffee and tea have been an in... more Coffee and tea are one of the most popular beverages in the world. Coffee and tea have been an integral part of many cultures around the world for centuries, and the words are used widely in everyday life. This paper aims to compare different methods to analyze the words coffee and tea in a corpus. The methods employed in this study are frequency analysis,

Research paper thumbnail of East Meets West: A collocation and colligation analysis of "vaccine" with the keywords: Pfizer, Moderna, Sinovac, and Novavax

Zenodo (CERN European Organization for Nuclear Research), Dec 14, 2022

Research paper thumbnail of The study of the Northern Thai dialect: Phonetic variations of Sao Wa sub-variant in Chiang Rai province

Zenodo (CERN European Organization for Nuclear Research), Apr 10, 2023

Research paper thumbnail of Coffee and Tea? A comparison of different methods in corpus-based natural language processing on "coffee" and "tea

Zenodo (CERN European Organization for Nuclear Research), Mar 20, 2023

Coffee and tea have been an integral part of many cultures around the world for centuries, and th... more Coffee and tea have been an integral part of many cultures around the world for centuries, and the words are used widely in everyday life. This paper aims to compare different methods to analyze the words coffee and tea in a corpus. The methods employed in this study are frequency analysis,

Research paper thumbnail of What did they say?" Network analysis of Twitter quotes @JoeBiden and @realDonaldTrump during 2020 United States presidential election (second debate)

Zenodo (CERN European Organization for Nuclear Research), Jan 17, 2023