Joe McFall - Academia.edu (original) (raw)
Related Authors
University of Illinois at Urbana-Champaign
Uploads
Papers by Joe McFall
x 4.11 Multiple two-tailed t-tests comparing equivalence classes by group, showing pvalues (proba... more x 4.11 Multiple two-tailed t-tests comparing equivalence classes by group, showing pvalues (probability that groups are not different), using Bonferroni correction to control for multiple tests. p<0.
MontyLingua (Liu 2004) is a freeware natural language processing package written in Python and al... more MontyLingua (Liu 2004) is a freeware natural language processing package written in Python and also supplied as a Java archive (.jar file). This document tells you how to compile MontyLingua into a .NET DLL file and call it from C# programs. This process relies on IKVM , a freeware Java-to-.NET conversion utility.
* We want to measure lexical diversity, which reflects size of a writer or speaker’s vocabulary a... more * We want to measure lexical diversity, which reflects size of a writer or speaker’s vocabulary and variety of subject matter in a text. (Useful for stylometry and for psycholinguistic research, e.g., tracking the topic drift in schizophrenia.) * Type-token ratio (TTR) = vocabulary size ÷ length of document * TTR is not a good measure of lexical diversity because it is always lower with longer documents. * Truncating all texts to a fixed length won’t do because the introduction of a long text is not comparable to the whole of a short text. * There are formulas to “correct ” TTR for document length (Herdan 1960, 1966), Carroll (1964), Guiraud (1959), etc., but Hess et al. (1986, 1989) show that none of them work as intended. * Vocabulary size vs. length of text examined could be modeled with a parameter (Yule, 1944; Malvern & Richards 2002; Tuldava 1995; Panas 2001, etc.) or plotted as a cumulative function (Youmans 1991) but these approaches bring in statistical assumptions and are ...
Journal of Quantitative Linguistics, 2010
Type-token ratio (TTR), or vocabulary size divided by text length (V/N), is a timehonoured but un... more Type-token ratio (TTR), or vocabulary size divided by text length (V/N), is a timehonoured but unsatisfactory measure of lexical diversity. The problem is that the TTR of a text sample is affected by its length. We present an algorithm for rapidly computing TTR through a moving window that is independent of text length, and we demonstrate that this measurement can detect changes within a text as well as differences between texts.
This thesis explores the¯ eld of natural language grammar induction as applied to psy-cholinguist... more This thesis explores the¯ eld of natural language grammar induction as applied to psy-cholinguistic comparison. It speci¯ cally concentrates on one algorithm, the ADIOS algorithm. After a discussion of language, grammar and grammar induction methods, it ...
x 4.11 Multiple two-tailed t-tests comparing equivalence classes by group, showing pvalues (proba... more x 4.11 Multiple two-tailed t-tests comparing equivalence classes by group, showing pvalues (probability that groups are not different), using Bonferroni correction to control for multiple tests. p<0.
MontyLingua (Liu 2004) is a freeware natural language processing package written in Python and al... more MontyLingua (Liu 2004) is a freeware natural language processing package written in Python and also supplied as a Java archive (.jar file). This document tells you how to compile MontyLingua into a .NET DLL file and call it from C# programs. This process relies on IKVM , a freeware Java-to-.NET conversion utility.
* We want to measure lexical diversity, which reflects size of a writer or speaker’s vocabulary a... more * We want to measure lexical diversity, which reflects size of a writer or speaker’s vocabulary and variety of subject matter in a text. (Useful for stylometry and for psycholinguistic research, e.g., tracking the topic drift in schizophrenia.) * Type-token ratio (TTR) = vocabulary size ÷ length of document * TTR is not a good measure of lexical diversity because it is always lower with longer documents. * Truncating all texts to a fixed length won’t do because the introduction of a long text is not comparable to the whole of a short text. * There are formulas to “correct ” TTR for document length (Herdan 1960, 1966), Carroll (1964), Guiraud (1959), etc., but Hess et al. (1986, 1989) show that none of them work as intended. * Vocabulary size vs. length of text examined could be modeled with a parameter (Yule, 1944; Malvern & Richards 2002; Tuldava 1995; Panas 2001, etc.) or plotted as a cumulative function (Youmans 1991) but these approaches bring in statistical assumptions and are ...
Journal of Quantitative Linguistics, 2010
Type-token ratio (TTR), or vocabulary size divided by text length (V/N), is a timehonoured but un... more Type-token ratio (TTR), or vocabulary size divided by text length (V/N), is a timehonoured but unsatisfactory measure of lexical diversity. The problem is that the TTR of a text sample is affected by its length. We present an algorithm for rapidly computing TTR through a moving window that is independent of text length, and we demonstrate that this measurement can detect changes within a text as well as differences between texts.
This thesis explores the¯ eld of natural language grammar induction as applied to psy-cholinguist... more This thesis explores the¯ eld of natural language grammar induction as applied to psy-cholinguistic comparison. It speci¯ cally concentrates on one algorithm, the ADIOS algorithm. After a discussion of language, grammar and grammar induction methods, it ...