Bryan Jurish | Berlin-Brandenburg Academy of Sciences and Humanities (original) (raw)

Thesis by Bryan Jurish

This work addresses issues in the automatic preprocessing of historical German input text for use... more This work addresses issues in the automatic preprocessing of historical German input text for use by conventional natural language processing techniques. Conventional techniques cannot adequately account for historical input text due to conventional tools' reliance on a fixed application-specific lexicon keyed by contemporary orthographic surface form on the one hand, and the lack of consistent orthographic conventions in historical input text on the other. Historical spelling variation is treated here as an error-correction problem or "canonicalization" task: an attempt to automatically assign each (historical) input word a unique extant canonical cognate, thus allowing direct application-specific processing (tagging, parsing, etc.) of the returned canonical forms without need for any additional application-specific modifications. In the course of the work, various methods for automatic canonicalization are investigated and empirically evaluated, including conflation by phonetic identity, conflation by lemma instantiation heuristics, canonicalization by weighted finite-state rewrite cascade, and token-wise disambiguation by a dynamic Hidden Markov Model.

Papers by Bryan Jurish

Icherklarehiermit,dassichankeineranderenHochschuleeinPromotionsver-fahreneroﬀnethabe. Zudemerklar... more Icherklarehiermit,dassichankeineranderenHochschuleeinPromotionsver-fahreneroﬀnethabe. Zudemerklareichhiermit,dassdieDissertationselb-standigundohneunzulassigeHilfeDritterverfasstwurdeundbeiderAb-fassungnurdieinderDissertationangegebenenHilfsmittelbenutztsowieallewortlichoderinhaltlichubernommenenStellenalssolchegekennzeichnetwurden.Des Weiteren erklare ich, dass die Dissertation in der gegenwartigen odereineranderenFassungbeikeineranderenFakultateinerwissenschaftlichenHochschulezurBegutachtungimRahmeneinesPromotionsverfahrensvorgele-genhat.BryanJurishBerlin,Januar2011

We present a simple and effective approach to the task of grapheme-tophoneme conversion based on ... more We present a simple and effective approach to the task of grapheme-tophoneme conversion based on a set of manually edited grapheme-phoneme mappings which drives not only the alignment of words and corresponding pronunciations, but also the segmentation of words during model training and application, respectively. The actual conversion is performed with the help of a conditional random field model, after which a language model selects the most likely string of grapheme-phoneme segment pairs from the set of hypotheses. We evaluate our approach by comparing it to a state-ofthe-art joint sequence model with respect to two different datasets of contemporary German and one of contemporary English.

J. Lang. Technol. Comput. Linguistics, 2013

We present a novel method (“waste”) for the segmentation of text into tokens and sentences. Our a... more We present a novel method (“waste”) for the segmentation of text into tokens and sentences. Our approach makes use of a Hidden Markov Model for the detection of segment boundaries. Model parameters can be estimated from pre-segmented text which is widely available in the form of treebanks or aligned multi-lingual corpora. We formally define the waste boundary detection model and evaluate the system’s performance on corpora from various languages as well as a small corpus of computer-mediated communication.

Much research has been devoted to the task of learning lexical classes from unannotated input tex... more Much research has been devoted to the task of learning lexical classes from unannotated input text. Among the chief difficulties facing any approach to the unsupervised induction of lexical classes are that of token-level ambiguity and the classification of rare and unknown words. Following the work of previous authors, the initial stage of syntactic category induction is treated in the current approach as a clustering problem over a small number of highly frequent word types. An iterative procedure making use of Zipf’s law to generate the clustering schedule classifies less frequent words based on the monotonic Bernoulli entropy of expected co-occurrence probability with respect to the clusters output by the previous stage, employing a fuzzy cluster membership heuristic to approximate type-level ambiguity and reduce error propagation in a simulated melting procedure. In a second processing phase, cluster membership probabilities output by the final clustering stage are used in a pr...

Historical text presents numerous challenges for contemporary natural language processing techniq... more Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system requiring reference to a static lexicon indexed by orthographic form. Canonicalization approaches seek to address these issues by assigning an extant equivalent to each word of the input text and deferring application analysis to these canonical forms. Quantitative evaluation of canonicalization techniques in terms of precision and recall requires reference to a ground-truth corpus in which the canonical form for each corpus token has been manually verified, but such manually annotated corpora are difficult to come by and in general both costly and time-consuming to create. In this paper, we describe a method for bootstrapping a ground-truth canonicalization corpus with minimal manual annotation effort by means of automatic alignment of historical texts with curr...

This paper outlines some potential applications of the open-source software tool “DiaCollo” to mu... more This paper outlines some potential applications of the open-source software tool “DiaCollo” to multi-genre diachronic corpora. Explicitly developed for the efficient extraction, comparison, and interactive visualization of collocations from a diachronic text corpus, DiaCollo – unlike conventional collocation extractors – is suitable for processing collocation pairs whose association strength depends on the date of their occurrence. By tracking changes in a word’s typical collocates over time, DiaCollo can help to provide a clearer picture of diachronic changes in the word’s usage, especially those related to semantic shift or discourse environment. Use of the flexible DDC search engine back-end allows user queries to make explicit reference to genre and other document-level metadata, thus allowing e.g. independent genre-local profiles or cross-genre comparisons. In addition to traditional static tabular display formats, a web-service plugin also offers a number of intuitive interact...

As a tinker of algorithms, a tweaker of data structures, and a dyed-in-the-wool Platonist, I am c... more As a tinker of algorithms, a tweaker of data structures, and a dyed-in-the-wool Platonist, I am committed to the (objective) existence of mathematical entities such as numbers and the relations between them. I nonetheless follow Christiane Birr in her skepticism regarding Anderson's (2008) blithe assertion that "given enough data, the numbers speak for themselves". Numbers seldom lie per se, but neither are they renowned for their loquacity. The traditional distinction between deductive truths about formal objects and inductive interpretations of empirical data acquired from a data sample such as a text corpus is useful here. Computers are well-suited to deductive tasks involving counting and other numerical manipulations, and are quite reliable for such purposes. They are not very adept at e.g. deciding what to count (corpus selection) or drawing (creative, interpretative) conclusions based on (numerical) data; as Silke Schwandt suggested: "I cannot find what I d...

This article presents some applications of the open-source software tool DiaCollo for historical ... more This article presents some applications of the open-source software tool DiaCollo for historical research. Developed in a cooperation between computational linguists and historians within the framework of CLARIN-D’s discipline-specific working groups, DiaCollo can be used to explore and visualize diachronic collocation phenomena in large text corpora. In this paper, we briefly discuss the constitution and aims of the CLARIN-D discipline-specific working groups, and then introduce and demonstrate DiaCollo in more detail from a user perspective, providing concrete examples from the newspaper “Die Grenzboten” (“messengers from the borders”) and other historical text corpora. Our goal is to demonstrate the utility of the software tool for historical research, and to raise awareness regarding the need for well-curated data and solutions for specific scientific interests.

Historical document collections present unique challenges for information retrieval. In particula... more Historical document collections present unique challenges for information retrieval. In particular, the absence of consistent orthographic conventions in historical text presents di culties for conventional search architectures which typically rely on a static inverted index keyed by orthographic form. Additional steps must therefore be taken in order to improve recall, in particular for single-term bareword queries from nonexpert users. This paper describes the query processing architecture currently employed for full-text search of the historical German document collection of the Deutsches Textarchiv project.

This chapter presents the formal basis for diachronic collocation profiling as implemented in the... more This chapter presents the formal basis for diachronic collocation profiling as implemented in the open-source software tool “DiaCollo” and sketches some potential applications to multi-genre diachronic corpora. Explicitly developed for the efficient extraction, comparison, and interactive visualization of collocations from a diachronic text corpus, DiaCollo is suitable for processing collocation pairs whose association strength depends on extralinguistic features such as the date of occurrence or text genre. By tracking changes in a word’s typical collocates over time, DiaCollo can help to provide a clearer picture of diachronic changes in the word’s usage, especially those related to semantic shift or discourse environment. Use of the flexible DDC search engine1 back-end allows user queries to make explicit reference to genre and other document-level metadata, thus allowing e.g. independent genre-local profiles or cross-genre comparisons. In addition to traditional static tabular d...

We investigate the composition of finitestate automata in a multiprocessor environment, presentin... more We investigate the composition of finitestate automata in a multiprocessor environment, presenting a parallel variant of a widely-used composition algorithm. We provide an approximate upper bound for composition speedup of the parallel variant with respect to serial execution, and empirically evaluate the performance of our implementation with respect to this bound.

This paper presents DiaCollo, a software tool developed in the context of CLARIN for the efficien... more This paper presents DiaCollo, a software tool developed in the context of CLARIN for the efficient extraction, comparison, and interactive visualization of collocations from a diachronic text corpus. Unlike other conventional collocation extractors, DiaCollo is suitable for extraction and analysis of diachronic collocation data, i.e. collocate pairs whose association strength depends on the date of their occurrence. By tracking changes in a word’s typical collocates over time, DiaCollo can help to provide a clearer picture of diachronic changes in the word’s usage, in particular those related to semantic shift. Beyond the domain of linguistics, DiaCollo profiles can be used to provide humanities researchers with an overview of the discourse topics commonly associated with a particular query term and their variation over time or corpus subset, while comparison or “diff” profiles highlight the most prominent differences between two independent target queries. In addition to traditiona...

Virtually all conventional text-based natural language processing techniques – from traditional i... more Virtually all conventional text-based natural language processing techniques – from traditional information retrieval systems to full-fledged parsers – require reference to a fixed lexicon accessed by surface form, typically trained from or constructed for synchronic input text adhering strictly to contemporary orthographic conventions. Unorthodox input such as historical text which violates these conventions therefore presents difficulties for any such system due to lexical variants present in the input but missing from the application lexicon. Canonicalization approaches (Rayson et al., 2005; Jurish, 2012; Porta et al., 2013) seek to address these issues by assigning an extant equivalent to each word of the input text and deferring application analysis to these canonical cognates. Traditional approaches to the problems arising from an attempt to incorporate historical text into such a system rely on the use of additional specialized (often application-specific) lexical resources t...

Part-of-Speech (PoS) Tagging – the automatic annotation of lexical categories – is a widely used ... more Part-of-Speech (PoS) Tagging – the automatic annotation of lexical categories – is a widely used early stage of linguistic text analysis. One approach, rule-based morphological anaylsis, employs linguistic knowledge in the form of hand-coded rules to derive a set of possible analyses for each input token, but is known to produce highly ambiguous results. Stochastic tagging techniques such as Hidden Markov Models (HMMs) make use of both lexical and bigram probabilities estimated from a tagged training corpus in order to compute the most likely PoS tag sequence for each input sentence, but provide no allowance for prior linguistic knowledge. In this report, I describe the dwdst PoS tagging library, which makes use of a rule-based morphological component to extend traditional HMM techniques by the inclusion of lexical class probabilities and theoretically motivated search space reduction.

Historical text presents numerous challenges for contemporary natural language processing techniq... more Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system requiring reference to a fixed lexicon accessed by orthographic form, such as information retrieval systems (Sokirko, 2003; Cafarella and Cutting, 2004), part-of-speech taggers (DeRose, 1988; Brill, 1992; Schmid, 1994), simple word stemmers (Lovins, 1968; Porter, 1980), or more sophisticated morphological analyzers (Geyken and Hanneforth, 2006; Zielinski et al., 2009).1 Traditional approaches to the problems arising from an attempt to incorporate historical text into such a system rely on the use of additional specialized (often application-specific) lexical resources to explicitly encode known historical variants. Such specialized lexica are not only costly and time-consuming to create, but also – in their simplest form of static finite word lists – necessarily ...

The main focus of this paper is the characterization of generic musical structure in terms of the... more The main focus of this paper is the characterization of generic musical structure in terms of the apparatus of formal language theory. It is argued that musical competence falls into the same class as natural language with respect to strong generative capacity { the class of mildly context-sensitive languages described by Joshi (1985).

Digitale Infrastrukturen für die germanistische Forschung, Jul 9, 2018

J. Lang. Technol. Comput. Linguistics, 2013

Historical text presents numerous challenges for contemporary natural language processing techniq... more Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system requiring reference to a fixed lexicon accessed by orthographic form, such as information retrieval systems (Sokirko, 2003; Cafarella and Cutting, 2004), part-of-speech taggers (DeRose, 1988; Brill, 1992; Schmid, 1994), simple word stemmers (Lovins, 1968; Porter, 1980), or more sophisticated morphological analyzers (Geyken and Hanneforth, 2006; Zielinski et al., 2009).1 Traditional approaches to the problems arising from an attempt to incorporate historical text into such a system rely on the use of additional specialized (often application-specific) lexical resources to explicitly encode known historical variants. Such specialized lexica are not only costly and time-consuming to create, but also – in their simplest form of static finite word lists – necessarily ...

Digitale Infrastrukturen für die germanistische Forschung, Jul 9, 2018