Sarah Schulz - Academia.edu (original) (raw)
Papers by Sarah Schulz
The aim of this tutorial is to give the participants concrete and practical insights into a stand... more The aim of this tutorial is to give the participants concrete and practical insights into a standard case of automatic text analysis. Using the example of automatic recognition of entity references, we will discuss general assumptions, procedures, and methodological standards in machine learning. The participants can fathom and test the scope of such procedures when editing executable programming code.
Ein grundlegender Schritt für eine Vielzahl von Aufgaben aus dem Bereich des Natural Language Pro... more Ein grundlegender Schritt für eine Vielzahl von Aufgaben aus dem Bereich des Natural Language Processing (NLP) ist das Part of Speech (PoS)-Tagging. Ein PoS-Tagger annotiert im Kontext eines Satzes jedes Wort mit seiner Wortart aus einer Menge an festgelegten Wortarten (Tagset). Ein Großteil der dazu vorhandenen Arbeiten konzentriert sich auf das Englische, auch für das Neuhochdeutsche sind vergleichbar viele Daten verfügbar. Historische Sprachstufen stellen hingegen eine Herausforderung für NLP-Aufgaben wie PoS-Tagging dar, da sie keine Standardsprache kennen, sondern nur als Vielfalt dialektaler Varietäten existieren, und ihre Verschriftlichung nicht nach einheitlichen Regeln erfolgt. Dies schlägt sich in einer hohen Varianz nieder, was die Annotation einer ausreichenden Menge an Referenzdaten erschwert. Mit diesem Beitrag möchten wir einen PoS-Tagger für das Mittelhochdeutsche vorstellen, der auf einem thematisch breiten und diachronen Korpus trainiert wurde. Als Tagset verwenden wir ein Inventar aus 17 universellen Wortart-Kategorien (Universal Dependency-Tagset, Nivre et al. 2016). Mit den annotierten Daten entwickeln wir ein Modell für den TreeTagger (Schmid 1995), das frei zugänglich ist. Dabei vergleichen wir drei verschiedene Möglichkeiten, den PoS-Tagger zu trainieren. Zunächst verwenden wir ein kleines, manuell annotiertes Trainingsset, vergleichen dessen Ergebnisse dann mit einem kleinen, automatisch disambiguierten Trainingsset und schließlich mit den maximal verfügbaren Daten. Mit dem Tagger möchten wir nicht nur eine "Marktlücke" schließen (denn bisher gibt es keinen frei verwendbaren PoS-Tagger für das Mittelhochdeutsche), sondern auch eine größtmögliche Anwendbarkeit auf mittelhochdeutsche Texte verschiedener Gattungen, Jahrhunderte und regionaler Varietäten erreichen und weiteren Arbeiten mit mittelhochdeutschen Texten den Weg ebnen.
The structure of the Digital Humanities master’s program at University of Stuttgart is characteri... more The structure of the Digital Humanities master’s program at University of Stuttgart is characterized by a big proportion of classes related to natural language processing. In this paper, we discuss the motivation for this design and associated challenges students and teachers are faced with. To provide background information, we also sum up our underlying perspective on Digital Humanities. Our discussion is driven by a qualitative analysis of a survey handed to the students of the program.
ACM Transactions on Intelligent Systems and Technology, 2016
As social media constitutes a valuable source for data analysis for a wide range of applications,... more As social media constitutes a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the nonstandard language used on social media poses problems for natural language processing (NLP) tools, as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multimodular approach to account for the diversity of normalization issues encountered in user-generated content (UGC). We consider three different types of UGC written in Dutch (SNS, SMS, and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer, and named-entity recognizer before and after normalization.
In this paper we present a Dutch and English dataset that can serve as a gold standard for evalua... more In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluating text normalization approaches. With the combination of text messages, message board posts and tweets, these datasets represent a variety of user generated content. All data was manually normalized to their standard form using newly-developed guidelines. We perform automatic lexical normalization experiments on these datasets using statistical machine translation techniques. We focus on both the word and character level and find that we can improve the BLEU score with ca. 20% for both languages. In order for this user generated content data to be released publicly to the research community some issues first need to be resolved. These are discussed in closer detail by focussing on the current legislation and by investigating previous similar data collection projects. With this discussion we hope to shed some light on various difficulties researchers are facing when trying to share social media data.
ArXiv, 2017
We characterize three notions of explainable AI that cut across research fields: opaque systems t... more We characterize three notions of explainable AI that cut across research fields: opaque systems that offer no insight into its algo- rithmic mechanisms; interpretable systems where users can mathemat- ically analyze its algorithmic mechanisms; and comprehensible systems that emit symbols enabling user-driven explanations of how a conclusion is reached. The paper is motivated by a corpus analysis of NIPS, ACL, COGSCI, and ICCV/ECCV paper titles showing differences in how work on explainable AI is positioned in various fields. We close by introducing a fourth notion: truly explainable systems, where automated reasoning is central to output crafted explanations without requiring human post processing as final step of the generative process.
In this paper, we report on the creation of a web corpus for the variety of German spoken in Sout... more In this paper, we report on the creation of a web corpus for the variety of German spoken in South Tyrol. We hence provide an example for the compilation of a corpus for a language variety that has neighboring varieties and for which the content on the internet is both sparse and published under various top-level domains. We discuss how we tackled the task of finding a balance between data quantity and quality. Our aim was twofold: to create a web corpus diverse in terms of text types and highly representative of South Tyrolean German. We present our procedure for collecting relevant texts and an approach to enhance diversity by detecting and filling gaps in a corpus.
The aim of this tutorial is to give the participants concrete and practical insights into a stand... more The aim of this tutorial is to give the participants concrete and practical insights into a standard case of automatic text analysis. Using the example of automatic recognition of entity references, we will discuss general assumptions, procedures, and methodological standards in machine learning. The participants can fathom and test the scope of such procedures when editing executable programming code.
Ein grundlegender Schritt für eine Vielzahl von Aufgaben aus dem Bereich des Natural Language Pro... more Ein grundlegender Schritt für eine Vielzahl von Aufgaben aus dem Bereich des Natural Language Processing (NLP) ist das Part of Speech (PoS)-Tagging. Ein PoS-Tagger annotiert im Kontext eines Satzes jedes Wort mit seiner Wortart aus einer Menge an festgelegten Wortarten (Tagset). Ein Großteil der dazu vorhandenen Arbeiten konzentriert sich auf das Englische, auch für das Neuhochdeutsche sind vergleichbar viele Daten verfügbar. Historische Sprachstufen stellen hingegen eine Herausforderung für NLP-Aufgaben wie PoS-Tagging dar, da sie keine Standardsprache kennen, sondern nur als Vielfalt dialektaler Varietäten existieren, und ihre Verschriftlichung nicht nach einheitlichen Regeln erfolgt. Dies schlägt sich in einer hohen Varianz nieder, was die Annotation einer ausreichenden Menge an Referenzdaten erschwert. Mit diesem Beitrag möchten wir einen PoS-Tagger für das Mittelhochdeutsche vorstellen, der auf einem thematisch breiten und diachronen Korpus trainiert wurde. Als Tagset verwenden wir ein Inventar aus 17 universellen Wortart-Kategorien (Universal Dependency-Tagset, Nivre et al. 2016). Mit den annotierten Daten entwickeln wir ein Modell für den TreeTagger (Schmid 1995), das frei zugänglich ist. Dabei vergleichen wir drei verschiedene Möglichkeiten, den PoS-Tagger zu trainieren. Zunächst verwenden wir ein kleines, manuell annotiertes Trainingsset, vergleichen dessen Ergebnisse dann mit einem kleinen, automatisch disambiguierten Trainingsset und schließlich mit den maximal verfügbaren Daten. Mit dem Tagger möchten wir nicht nur eine "Marktlücke" schließen (denn bisher gibt es keinen frei verwendbaren PoS-Tagger für das Mittelhochdeutsche), sondern auch eine größtmögliche Anwendbarkeit auf mittelhochdeutsche Texte verschiedener Gattungen, Jahrhunderte und regionaler Varietäten erreichen und weiteren Arbeiten mit mittelhochdeutschen Texten den Weg ebnen.
The structure of the Digital Humanities master’s program at University of Stuttgart is characteri... more The structure of the Digital Humanities master’s program at University of Stuttgart is characterized by a big proportion of classes related to natural language processing. In this paper, we discuss the motivation for this design and associated challenges students and teachers are faced with. To provide background information, we also sum up our underlying perspective on Digital Humanities. Our discussion is driven by a qualitative analysis of a survey handed to the students of the program.
ACM Transactions on Intelligent Systems and Technology, 2016
As social media constitutes a valuable source for data analysis for a wide range of applications,... more As social media constitutes a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the nonstandard language used on social media poses problems for natural language processing (NLP) tools, as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multimodular approach to account for the diversity of normalization issues encountered in user-generated content (UGC). We consider three different types of UGC written in Dutch (SNS, SMS, and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer, and named-entity recognizer before and after normalization.
In this paper we present a Dutch and English dataset that can serve as a gold standard for evalua... more In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluating text normalization approaches. With the combination of text messages, message board posts and tweets, these datasets represent a variety of user generated content. All data was manually normalized to their standard form using newly-developed guidelines. We perform automatic lexical normalization experiments on these datasets using statistical machine translation techniques. We focus on both the word and character level and find that we can improve the BLEU score with ca. 20% for both languages. In order for this user generated content data to be released publicly to the research community some issues first need to be resolved. These are discussed in closer detail by focussing on the current legislation and by investigating previous similar data collection projects. With this discussion we hope to shed some light on various difficulties researchers are facing when trying to share social media data.
ArXiv, 2017
We characterize three notions of explainable AI that cut across research fields: opaque systems t... more We characterize three notions of explainable AI that cut across research fields: opaque systems that offer no insight into its algo- rithmic mechanisms; interpretable systems where users can mathemat- ically analyze its algorithmic mechanisms; and comprehensible systems that emit symbols enabling user-driven explanations of how a conclusion is reached. The paper is motivated by a corpus analysis of NIPS, ACL, COGSCI, and ICCV/ECCV paper titles showing differences in how work on explainable AI is positioned in various fields. We close by introducing a fourth notion: truly explainable systems, where automated reasoning is central to output crafted explanations without requiring human post processing as final step of the generative process.
In this paper, we report on the creation of a web corpus for the variety of German spoken in Sout... more In this paper, we report on the creation of a web corpus for the variety of German spoken in South Tyrol. We hence provide an example for the compilation of a corpus for a language variety that has neighboring varieties and for which the content on the internet is both sparse and published under various top-level domains. We discuss how we tackled the task of finding a balance between data quantity and quality. Our aim was twofold: to create a web corpus diverse in terms of text types and highly representative of South Tyrolean German. We present our procedure for collecting relevant texts and an approach to enhance diversity by detecting and filling gaps in a corpus.