Hermann Moisl | Newcastle University (original) (raw)
Papers by Hermann Moisl
Crossing Boundaries. Interdisciplinary approaches to art, material culture, language, and literature of the early medieval world, 2017
The argument is that the 'Flodibor rex Francorum' cited in the early medieval Irish annals for th... more The argument is that the 'Flodibor rex Francorum' cited in the early medieval Irish annals for the year 658 is the Merovingian king Clovis II.
De Gruyter, 2015
The standard scientific methodology in linguistics is empirical testing of falsifiable hypotheses... more The standard scientific methodology in linguistics is empirical testing of falsifiable hypotheses. As such the process of hypothesis generation is central, and involves formulation of a research question about a domain of interest and statement of a hypothesis relative to it. In corpus linguistics the domain is text, and generation involves abstraction of data from text, data analysis, and formulation of a hypothesis based on inference from the results. Traditionally this process has been paper-based, but the advent of electronic text has increasingly rendered it obsolete both because the size of digital corpora is now at or beyond the limit of what can efficiently be used in the traditional way, and because the complexity of data abstracted from them can be impenetrable to understanding. Linguists are increasingly turning to mathematical and statistical computational methods for help, and cluster analysis is such a method. It is used across the sciences for hypothesis generation by identification of structure in data which are too large or complex, or both, to be interpretable by direct inspection. This book aims to show how cluster analysis can be used for hypothesis generation in corpus linguistics, thereby contributing to a quantitative empirical methodology for the discipline.
Aggregating Dialectology, Typology, and Register Analysis, ed. Szmrecsanyi, Benedikt and Wälchli, Bernhard, 2014
The Diachronic Electronic Corpus of Tyneside English (DECTE) is a naturalistic spoken corpus of ... more The Diachronic Electronic Corpus of Tyneside English (DECTE) is a naturalistic
spoken corpus of interviews with residents of Tyneside and surrounding areas of North East
England. It updates the earlier Newcastle Electronic Corpus of Tyneside English (NECTE),
which combined two sub-corpora dating from the late 1960s and mid 1990s, and supplements
these with materials from an ongoing monitor corpus established in 2007. The first part of this
paper outlines the background and development of the DECTE project. It then reviews
research that has already been conducted on the corpus, comparing the different feature-based
and aggregate analyses that have been employed. In doing so, we hope to highlight the crucial
role that aggregate methods, such as hierarchical cluster analysis, can have in identifying and
explaining the parameters that underpin aspects of language variation, and to demonstrate that
such methods can and do work well in combination with feature-centric approaches.
Methods and Applications of Quantitative Linguistics, edited by Ivan Obradović, Emmerich Kelih and Reinhard Köhler, University of Belgrade, 2013
Most science and engineering disciplines recognize that application of linear analytical methods ... more Most science and engineering disciplines recognize that application of linear analytical methods to data containing nonlinearities can distort results, and in response have developed mathematically and statistically based methods for dealing with nonlinearity. In linguistics, however, there has thus far been little recognition of the possibility that there might be nonlinearity in data abstracted from speech and text corpora or, where found, what the implications for analysis are. The present paper addresses this issue in three main parts. The first part outlines the nature of data nonlinearity, the second reviews existing methods for detection of nonlinearity and proposes a way of measuring nonlinear relationships between data objects, and, using these methods, the third identifies and quantifies the degree of nonlinearity is present in data abstracted from the Diachronic Electronic Corpus of Tyneside English, a dialect speech corpus.
Synergetic Linguistics. Text and Language as Dynamic Systems. To Reinhard Köhler on the Occasion of his 60th Birthday, ed. S. Naumann, P. Grzybek, R. Vulanovic, G. Altmann, 2012
The Newcastle Electronic Corpus of Tyneside English (Necte) is a sample of dialect speech from Ty... more The Newcastle Electronic Corpus of Tyneside English (Necte) is a sample of dialect speech from Tyneside in North-East England (Corrigan et al 2006; Allen et al. 2007). Jones-Sargent (1983), Moisl and Jones (2005), and Moisl, Maguire and Allen (2006) used cluster analysis to show that the speakers who constitute the earlier of the two chronological strata in the corpus fall into distinct groups defined by relative frequency of usage of phonetic segments, and Moisl and Maguire (2008) went on to identify the main phonetic determinants of that grouping by comparing cluster centroids. The present discussion develops these findings by constructing a map which comprehensively describes the pattern of phonetic variation across the Necte speakers, and, in combination with the earlier studies just cited, is intended as a contribution to a methodology for corpus-based mathematical and statistical study of language variation.
The discussion is in two main parts: the first part briefly describes Necte, the second constructs the phonetic variation map.
Newcastle University, 2013
Welcome to the Diachronic Electronic Corpus of Tyneside English (DECTE), a corpus of dialect sp... more Welcome to the Diachronic Electronic Corpus of Tyneside English (DECTE), a corpus of dialect speech from the Tyneside area of North-East England.
DECTE is an amalgamation of the existing Newcastle Electronic Corpus of Tyneside English (NECTE) created between 2001 and 2005 (http://research.ncl.ac.uk/necte), and NECTE2, a collection of interviews conducted in the Tyneside area since 2007. It thereby constitutes a rare example of a publicly available on-line corpus presenting dialect material spanning five decades.
The present website is designed for research use. DECTE also, however, includes an interactive website, The Talk of the Toon, which integrates topics and narratives of regional cultural significance in the corpus with relevant still and moving images, and which is designed primarily for use in schools and museums and by the general public.
Journal of Quantitative Linguistics 18, 23-52, 2011
Cluster analysis has long been used across a wide range of science and engineering disciplines a... more Cluster analysis has long been used across a wide range of science and
engineering disciplines as a way of identifying interesting structure in data
(refs). The advent of digital electronic natural language text has seen its
application in text-oriented disciplines like information retrieval (refs) and data
mining (refs) and, increasingly, in corpus-based linguistics (refs). In all these
domains, the reliability of cluster analytical results is contingent both on the
nature of the particular clustering algorithm being used and on the
characteristics of the data being analyzed, where 'reliability' is understood as
the extent to which the result identifies structure which really is present in the
domain from which the data was abstracted, given some well defined sense of
'really present'. The present discussion focuses on how the reliability of
cluster analysis can be compromised by one particular characteristic of data
abstracted from natural language corpora.
The characteristic in question arises when the aim is to cluster a collection of
length-varying documents based on the frequency of occurrence of one or
more linguistic or textual features; examples are (refs). Because longer
documents are, in general, likely to contain more examples of the feature or
features of interest than shorter ones, the frequencies of the data variables
representing those features will be numerically greater for the longer
documents than for the shorter ones, which in turn leads one to expect that
the documents will cluster in accordance with relative length rather than with
some more interesting criterion latent in the data; this expectation has been
empirically confirmed (refs). The solution is to eliminate relative document
length as a factor in clustering by adjusting the data frequencies using a
length normalization method such as cosine normalization, which is
extensively used in information retrieval for precisely this purpose (refs). This
solution is not a panacea, however. One or more documents in the collection
might be too short to provide accurate population probability estimates for the
data variables, and, because length normalization methods exacerbate such
inaccuracies, the result would be that analysis based on the normalized data
inaccurately clusters the documents in question.
The present discussion proposes a way of dealing with short documents in
clustering of length-varying multi-document corpora: that a threshold length
for acceptably accurate variable probability estimation be defined, and that all
documents shorter than that threshold be eliminated from the analysis. The
discussion is in 3 main parts. The first part outlines the nature of the problem
in detail, the second develops a method for determining a minimum document
length threshold, and the third exemplifies the application of that method to an
actual corpus.
Handbook of Corpus Phonology, ed. J. Durand, U. Gut, G. Kristofferson, Oxford University Press, 2010
The aim of this chapter is to encourage corpus linguists to use quantitative and more specifical... more The aim of this chapter is to encourage corpus linguists to use quantitative
and more specifically statistical methods in analyzing large digital electronic
corpora, focussing in particular on cluster analysis. The first part of the
discussion motivates the use of cluster analysis in corpus linguistics, the
second gives an outline account of data creation and clustering with reference
to the Newcastle Electronic Corpus of Tyneside English, and the third is a
selective literature review.
Corpus Linguistics and Linguistic Theory 6, 75-103, 2010
Where the variables selected for cluster analysis of linguistic data are measured on different n... more Where the variables selected for cluster analysis of linguistic data are
measured on different numerical scales, those whose scales permit relatively
larger values can have a greater influence on clustering than those whose
scales restrict them to relatively smaller ones, and this can compromise the
reliability of the analysis. The first part of this discussion describes the nature
of that compromise. The second part argues that a widely used method for
removing disparity of variable scale, Z-standardization, is unsatisfactory for
cluster analysis because it eliminates differences in variability among
variables, thereby distorting the intrinsic cluster structure of the
unstandardized data, and instead proposes a standardization method based
on variable means which preserves these differences. The proposed meanbased
method is compared to several other alternatives to Z-standardization,
and is found to be superior to them in cluster analysis applications.
Analysing Variation in English: What we know, what we don't, and why it matters, ed. A. McMahon & W. Maguire, Cambridge University Press, 2010
Traditionally, hypothesis generation based on linguistic corpora has involved the researcher list... more Traditionally, hypothesis generation based on linguistic corpora has involved the researcher listening to or reading through a corpus, often repeatedly, noting features of interest, and then formulating a hypothesis. The advent of information technology in general and of digital representation of text in particular in the past few decades has made this often-onerous process much easier via a range of computational tools, but, as the amount of digitally-represented language available to linguists has grown, a new problem has emerged: data overload. Actual and potential language corpora are growing ever-larger, and even now they can be on the limit of what the individual researcher can work through efficiently in the traditional way. Moreover, as we shall see, data abstracted from such large corpora can be impenetrable to understanding. One approach to the problem is to deal only with corpora of tractable size, or, equivalently, with tractable subsets of large corpora, but ignoring potential data in so unprincipled a way is not scientifically respectable. The alternative is to use mathematically-based computational tools for data exploration developed in the physical and social sciences, where data overload has long been a problem. This latter alternative is the one explored here. Specifically, the discussion shows how a particular type of computational tool, cluster analysis, can be used in the formulation of hypotheses in corpus-based linguistic research.
The discussion is in three main parts. The first describes data abstraction from corpora, the second outlines the principles of cluster analysis, and the third shows how the results of cluster analysis can be used in the formulation of hypotheses. Examples are based on the Newcastle Electronic Corpus of Tyneside English (NECTE), a corpus of dialect speech (Allen et al. 2007). The overall approach is introductory, and as such the aim has been to make the material accessible to as broad a readership as possible.
Durand, J., Gut, U. & Kristofferson, G., (ed.) Handbook of Corpus Phonology, Oxford: Oxford University Press, 2010
This chapter describes the construction of the Newcastle Electronic Corpus of Tyneside English (... more This chapter describes the construction of the Newcastle Electronic Corpus of Tyneside
English (NECTE), a legacy corpus based on data collected for two sociolinguistic
surveys conducted on Tyneside in the north-east of England in c.1969 and 1994,
respectively. It focusses on transcription issues relevant for addressing research
questions in phonetics/phonology. There is also discussion of the rationale for the text
encoding systems adopted in the corpus construction phase as well as the
dissemination strategy employed since completion in 2005.
M. Dossena & R. Lass, (ed.) Studies in English and European Historical Dialectology, Bern:Peter Lang, 2009
The proliferation of computational technology has generated an explosive production of electroni... more The proliferation of computational technology has generated an
explosive production of electronically encoded information of all
kinds. In the face of this, traditional philological methods for search
and interpretation of data have been overwhelmed by volume, and a
variety of computational methods have been developed in an attempt
to make the deluge tractable. These developments have clear
implications for corpus-based linguistics in general, and for corpusbased
study of historical dialectology in particular: as more and larger
historical text corpora become available, effective analysis of them
will increasingly be tractable only by adapting the interpretative
methods developed by the statistical (Hair et al. 2005; Tabachnik &
Fidell 2006), information retrieval (Belew 2000; Grossman & Frieder
2004), pattern recognition (Bishop 2006), and related communities.
To use such analytical methods effectively, however, issues that arise
with respect to the abstraction of data from corpora have to be
understood. This paper addresses an issue that has a fundamental
bearing on the validity of analytical results based on such data:
variation in document length. The discussion is in four main parts.
The first part shows how a particular class of computational methods,
exploratory multivariate analysis, can be used in historical
dialectology research, the second explains why variation in document
length can be a problem in such analysis, the third proposes document
length normalization as a solution to that problem, and the fourth
points out some difficulties associated with document length
normalization
Lüdeling A., Kytö M., (ed.) Corpus Linguistics. An International Handbook, Berlin: Mouton de Gruyter, 874-99, 2009
The present chapter deals with one type of analytical tool: exploratory multivariate analysis. T... more The present chapter deals with one type of analytical
tool: exploratory multivariate analysis. The discussion is
in six main parts. The first part is the present
introduction, the second explains what is meant by
exploratory multivariate analysis, the third discusses the
characteristics of data and the implications of these
characteristics for generation and interpretation of
analytical results, the fourth gives an overview of the
various exploratory analytical methods currently
available, the fifth reviews the application of exploratory
multivariate analysis in corpus linguistics, and the sixth
is a select bibliography. The material is presented in an
intuitively accessible way, avoiding formalisms as much
as possible. However, in order to work with multivariate
analytical methods some background in mathematics and
ACM Transactions on Asian Language Information Processing, 2009
Thabet [2005] applied cluster analysis to the Qur’an in the hope of generating a classification o... more Thabet [2005] applied cluster analysis to the Qur’an in the hope of generating a classification of the (suras) that is useful for understanding of its thematic structure. The result was positive, but variation in (sura) length was a problem because clustering of the shorter was found to be unreliable. The present discussion addresses this problem in four parts. The first part summarizes Thabet’s work. The second part argues that unreliable clustering of the shorter is a consequence of poor estimation of lexical population probabilities in those. The third part proposes a solution to the problem based on calculation of a minimum length threshold using concepts from statistical sampling theory followed by selection of and lexical variables based on that threshold. The fourth part applies the proposed solution to a reanalysis of the Qur’an.
Proceedings of INFOS2008: 6th International Conference on Informatics and Systems, Cairo University, 27-29 March 2008, 2008
The advent of large electronic text corpora has generated a range of technologies for their sear... more The advent of large electronic text corpora has generated a range of technologies for their
search and interpretation. Variation in document length can be a problem for these technologies, and
several normalization methods for mitigating its effects have been proposed. This paper assesses the
effectiveness of such methods in specific relation to exploratory multivariate analysis. The discussion
is in four main parts. The first part states the problem, the second describes some normalization
methods, the third identifies poor estimation of the population probability of variables as a factor that
compromises the effectiveness of the normalization methods for very short documents, and the fourth
proposes elimination of data matrix rows representing document which are too short to be reliably
normalized and suggests ways of identifying those documents.
Journal of Quantitative Linguistics 15, 46-69, 2008
The Newcastle Electronic Corpus of Tyneside English is a corpus of dialect speech from North-East... more The Newcastle Electronic Corpus of Tyneside English is a corpus of dialect speech from North-East England. It includes phonetic transcriptions of 63 interviews together with social data relating to each interviewee, and offers an opportunity to study the sociophonetics of Tyneside speech of the late 1960s. In a previous paper we began that study with an exploratory multivariate analysis of the transcriptions. The results were that speakers fell into clearly defined groups on the basis of their phonetic usage, and that these groups correlated well with social characteristics associated with the speakers. The present paper develops these results by trying to identify the main phonetic determinants of the speaker groups.
Tsiplakou, S., Karyolemu, M., Pavlou, P. (ed.) Language Variation. European Perspectives, Amsterdam: John Benjamins, 169-178, 2008
This paper addresses an issue that has a fundamental bearing on the validity of analytical resul... more This paper addresses an issue that has a fundamental bearing on the
validity of analytical results based on such data: sparsity. The discussion is
in three main parts. The first part shows how a particular class of
computational methods, exploratory multivariate analysis, can be used in
language variation research, the second explains why data sparsity can be a
problem in such analysis, and the third outlines some solutions.
Beal, J., Corrigan, K., Moisl, H., (ed.) Creating and Digitizing Language Corpora: Synchronic Databases, Palgrave Macmillan, 1-16, 2007
Six of the contributions to Volume 1 (Anderson et al.; Anderwald and Wagner; Barbiers et al.; Se... more Six of the contributions to Volume 1 (Anderson et al.; Anderwald and
Wagner; Barbiers et al.; Sebba and Dray; Kallen and Kirk; Tagliamonte)
arose from invited presentations at the workshop on ‘Models and
Methods in the Handling of Unconventional Digital Corpora organized
by the editors of the present volume that was held in April 2004 during
the Fifteenth Sociolinguistics Symposium (SS15) at the University of
Newcastle. The book project then evolved by inviting further contributions
from key corpus creators so that the companion volumes would
contain treatments outlining the models and methods underpinning a
variety of digitized diachronic and synchronic corpora with a view to
highlighting synergies and points of contrast between them.
Crossing Boundaries. Interdisciplinary approaches to art, material culture, language, and literature of the early medieval world, 2017
The argument is that the 'Flodibor rex Francorum' cited in the early medieval Irish annals for th... more The argument is that the 'Flodibor rex Francorum' cited in the early medieval Irish annals for the year 658 is the Merovingian king Clovis II.
De Gruyter, 2015
The standard scientific methodology in linguistics is empirical testing of falsifiable hypotheses... more The standard scientific methodology in linguistics is empirical testing of falsifiable hypotheses. As such the process of hypothesis generation is central, and involves formulation of a research question about a domain of interest and statement of a hypothesis relative to it. In corpus linguistics the domain is text, and generation involves abstraction of data from text, data analysis, and formulation of a hypothesis based on inference from the results. Traditionally this process has been paper-based, but the advent of electronic text has increasingly rendered it obsolete both because the size of digital corpora is now at or beyond the limit of what can efficiently be used in the traditional way, and because the complexity of data abstracted from them can be impenetrable to understanding. Linguists are increasingly turning to mathematical and statistical computational methods for help, and cluster analysis is such a method. It is used across the sciences for hypothesis generation by identification of structure in data which are too large or complex, or both, to be interpretable by direct inspection. This book aims to show how cluster analysis can be used for hypothesis generation in corpus linguistics, thereby contributing to a quantitative empirical methodology for the discipline.
Aggregating Dialectology, Typology, and Register Analysis, ed. Szmrecsanyi, Benedikt and Wälchli, Bernhard, 2014
The Diachronic Electronic Corpus of Tyneside English (DECTE) is a naturalistic spoken corpus of ... more The Diachronic Electronic Corpus of Tyneside English (DECTE) is a naturalistic
spoken corpus of interviews with residents of Tyneside and surrounding areas of North East
England. It updates the earlier Newcastle Electronic Corpus of Tyneside English (NECTE),
which combined two sub-corpora dating from the late 1960s and mid 1990s, and supplements
these with materials from an ongoing monitor corpus established in 2007. The first part of this
paper outlines the background and development of the DECTE project. It then reviews
research that has already been conducted on the corpus, comparing the different feature-based
and aggregate analyses that have been employed. In doing so, we hope to highlight the crucial
role that aggregate methods, such as hierarchical cluster analysis, can have in identifying and
explaining the parameters that underpin aspects of language variation, and to demonstrate that
such methods can and do work well in combination with feature-centric approaches.
Methods and Applications of Quantitative Linguistics, edited by Ivan Obradović, Emmerich Kelih and Reinhard Köhler, University of Belgrade, 2013
Most science and engineering disciplines recognize that application of linear analytical methods ... more Most science and engineering disciplines recognize that application of linear analytical methods to data containing nonlinearities can distort results, and in response have developed mathematically and statistically based methods for dealing with nonlinearity. In linguistics, however, there has thus far been little recognition of the possibility that there might be nonlinearity in data abstracted from speech and text corpora or, where found, what the implications for analysis are. The present paper addresses this issue in three main parts. The first part outlines the nature of data nonlinearity, the second reviews existing methods for detection of nonlinearity and proposes a way of measuring nonlinear relationships between data objects, and, using these methods, the third identifies and quantifies the degree of nonlinearity is present in data abstracted from the Diachronic Electronic Corpus of Tyneside English, a dialect speech corpus.
Synergetic Linguistics. Text and Language as Dynamic Systems. To Reinhard Köhler on the Occasion of his 60th Birthday, ed. S. Naumann, P. Grzybek, R. Vulanovic, G. Altmann, 2012
The Newcastle Electronic Corpus of Tyneside English (Necte) is a sample of dialect speech from Ty... more The Newcastle Electronic Corpus of Tyneside English (Necte) is a sample of dialect speech from Tyneside in North-East England (Corrigan et al 2006; Allen et al. 2007). Jones-Sargent (1983), Moisl and Jones (2005), and Moisl, Maguire and Allen (2006) used cluster analysis to show that the speakers who constitute the earlier of the two chronological strata in the corpus fall into distinct groups defined by relative frequency of usage of phonetic segments, and Moisl and Maguire (2008) went on to identify the main phonetic determinants of that grouping by comparing cluster centroids. The present discussion develops these findings by constructing a map which comprehensively describes the pattern of phonetic variation across the Necte speakers, and, in combination with the earlier studies just cited, is intended as a contribution to a methodology for corpus-based mathematical and statistical study of language variation.
The discussion is in two main parts: the first part briefly describes Necte, the second constructs the phonetic variation map.
Newcastle University, 2013
Welcome to the Diachronic Electronic Corpus of Tyneside English (DECTE), a corpus of dialect sp... more Welcome to the Diachronic Electronic Corpus of Tyneside English (DECTE), a corpus of dialect speech from the Tyneside area of North-East England.
DECTE is an amalgamation of the existing Newcastle Electronic Corpus of Tyneside English (NECTE) created between 2001 and 2005 (http://research.ncl.ac.uk/necte), and NECTE2, a collection of interviews conducted in the Tyneside area since 2007. It thereby constitutes a rare example of a publicly available on-line corpus presenting dialect material spanning five decades.
The present website is designed for research use. DECTE also, however, includes an interactive website, The Talk of the Toon, which integrates topics and narratives of regional cultural significance in the corpus with relevant still and moving images, and which is designed primarily for use in schools and museums and by the general public.
Journal of Quantitative Linguistics 18, 23-52, 2011
Cluster analysis has long been used across a wide range of science and engineering disciplines a... more Cluster analysis has long been used across a wide range of science and
engineering disciplines as a way of identifying interesting structure in data
(refs). The advent of digital electronic natural language text has seen its
application in text-oriented disciplines like information retrieval (refs) and data
mining (refs) and, increasingly, in corpus-based linguistics (refs). In all these
domains, the reliability of cluster analytical results is contingent both on the
nature of the particular clustering algorithm being used and on the
characteristics of the data being analyzed, where 'reliability' is understood as
the extent to which the result identifies structure which really is present in the
domain from which the data was abstracted, given some well defined sense of
'really present'. The present discussion focuses on how the reliability of
cluster analysis can be compromised by one particular characteristic of data
abstracted from natural language corpora.
The characteristic in question arises when the aim is to cluster a collection of
length-varying documents based on the frequency of occurrence of one or
more linguistic or textual features; examples are (refs). Because longer
documents are, in general, likely to contain more examples of the feature or
features of interest than shorter ones, the frequencies of the data variables
representing those features will be numerically greater for the longer
documents than for the shorter ones, which in turn leads one to expect that
the documents will cluster in accordance with relative length rather than with
some more interesting criterion latent in the data; this expectation has been
empirically confirmed (refs). The solution is to eliminate relative document
length as a factor in clustering by adjusting the data frequencies using a
length normalization method such as cosine normalization, which is
extensively used in information retrieval for precisely this purpose (refs). This
solution is not a panacea, however. One or more documents in the collection
might be too short to provide accurate population probability estimates for the
data variables, and, because length normalization methods exacerbate such
inaccuracies, the result would be that analysis based on the normalized data
inaccurately clusters the documents in question.
The present discussion proposes a way of dealing with short documents in
clustering of length-varying multi-document corpora: that a threshold length
for acceptably accurate variable probability estimation be defined, and that all
documents shorter than that threshold be eliminated from the analysis. The
discussion is in 3 main parts. The first part outlines the nature of the problem
in detail, the second develops a method for determining a minimum document
length threshold, and the third exemplifies the application of that method to an
actual corpus.
Handbook of Corpus Phonology, ed. J. Durand, U. Gut, G. Kristofferson, Oxford University Press, 2010
The aim of this chapter is to encourage corpus linguists to use quantitative and more specifical... more The aim of this chapter is to encourage corpus linguists to use quantitative
and more specifically statistical methods in analyzing large digital electronic
corpora, focussing in particular on cluster analysis. The first part of the
discussion motivates the use of cluster analysis in corpus linguistics, the
second gives an outline account of data creation and clustering with reference
to the Newcastle Electronic Corpus of Tyneside English, and the third is a
selective literature review.
Corpus Linguistics and Linguistic Theory 6, 75-103, 2010
Where the variables selected for cluster analysis of linguistic data are measured on different n... more Where the variables selected for cluster analysis of linguistic data are
measured on different numerical scales, those whose scales permit relatively
larger values can have a greater influence on clustering than those whose
scales restrict them to relatively smaller ones, and this can compromise the
reliability of the analysis. The first part of this discussion describes the nature
of that compromise. The second part argues that a widely used method for
removing disparity of variable scale, Z-standardization, is unsatisfactory for
cluster analysis because it eliminates differences in variability among
variables, thereby distorting the intrinsic cluster structure of the
unstandardized data, and instead proposes a standardization method based
on variable means which preserves these differences. The proposed meanbased
method is compared to several other alternatives to Z-standardization,
and is found to be superior to them in cluster analysis applications.
Analysing Variation in English: What we know, what we don't, and why it matters, ed. A. McMahon & W. Maguire, Cambridge University Press, 2010
Traditionally, hypothesis generation based on linguistic corpora has involved the researcher list... more Traditionally, hypothesis generation based on linguistic corpora has involved the researcher listening to or reading through a corpus, often repeatedly, noting features of interest, and then formulating a hypothesis. The advent of information technology in general and of digital representation of text in particular in the past few decades has made this often-onerous process much easier via a range of computational tools, but, as the amount of digitally-represented language available to linguists has grown, a new problem has emerged: data overload. Actual and potential language corpora are growing ever-larger, and even now they can be on the limit of what the individual researcher can work through efficiently in the traditional way. Moreover, as we shall see, data abstracted from such large corpora can be impenetrable to understanding. One approach to the problem is to deal only with corpora of tractable size, or, equivalently, with tractable subsets of large corpora, but ignoring potential data in so unprincipled a way is not scientifically respectable. The alternative is to use mathematically-based computational tools for data exploration developed in the physical and social sciences, where data overload has long been a problem. This latter alternative is the one explored here. Specifically, the discussion shows how a particular type of computational tool, cluster analysis, can be used in the formulation of hypotheses in corpus-based linguistic research.
The discussion is in three main parts. The first describes data abstraction from corpora, the second outlines the principles of cluster analysis, and the third shows how the results of cluster analysis can be used in the formulation of hypotheses. Examples are based on the Newcastle Electronic Corpus of Tyneside English (NECTE), a corpus of dialect speech (Allen et al. 2007). The overall approach is introductory, and as such the aim has been to make the material accessible to as broad a readership as possible.
Durand, J., Gut, U. & Kristofferson, G., (ed.) Handbook of Corpus Phonology, Oxford: Oxford University Press, 2010
This chapter describes the construction of the Newcastle Electronic Corpus of Tyneside English (... more This chapter describes the construction of the Newcastle Electronic Corpus of Tyneside
English (NECTE), a legacy corpus based on data collected for two sociolinguistic
surveys conducted on Tyneside in the north-east of England in c.1969 and 1994,
respectively. It focusses on transcription issues relevant for addressing research
questions in phonetics/phonology. There is also discussion of the rationale for the text
encoding systems adopted in the corpus construction phase as well as the
dissemination strategy employed since completion in 2005.
M. Dossena & R. Lass, (ed.) Studies in English and European Historical Dialectology, Bern:Peter Lang, 2009
The proliferation of computational technology has generated an explosive production of electroni... more The proliferation of computational technology has generated an
explosive production of electronically encoded information of all
kinds. In the face of this, traditional philological methods for search
and interpretation of data have been overwhelmed by volume, and a
variety of computational methods have been developed in an attempt
to make the deluge tractable. These developments have clear
implications for corpus-based linguistics in general, and for corpusbased
study of historical dialectology in particular: as more and larger
historical text corpora become available, effective analysis of them
will increasingly be tractable only by adapting the interpretative
methods developed by the statistical (Hair et al. 2005; Tabachnik &
Fidell 2006), information retrieval (Belew 2000; Grossman & Frieder
2004), pattern recognition (Bishop 2006), and related communities.
To use such analytical methods effectively, however, issues that arise
with respect to the abstraction of data from corpora have to be
understood. This paper addresses an issue that has a fundamental
bearing on the validity of analytical results based on such data:
variation in document length. The discussion is in four main parts.
The first part shows how a particular class of computational methods,
exploratory multivariate analysis, can be used in historical
dialectology research, the second explains why variation in document
length can be a problem in such analysis, the third proposes document
length normalization as a solution to that problem, and the fourth
points out some difficulties associated with document length
normalization
Lüdeling A., Kytö M., (ed.) Corpus Linguistics. An International Handbook, Berlin: Mouton de Gruyter, 874-99, 2009
The present chapter deals with one type of analytical tool: exploratory multivariate analysis. T... more The present chapter deals with one type of analytical
tool: exploratory multivariate analysis. The discussion is
in six main parts. The first part is the present
introduction, the second explains what is meant by
exploratory multivariate analysis, the third discusses the
characteristics of data and the implications of these
characteristics for generation and interpretation of
analytical results, the fourth gives an overview of the
various exploratory analytical methods currently
available, the fifth reviews the application of exploratory
multivariate analysis in corpus linguistics, and the sixth
is a select bibliography. The material is presented in an
intuitively accessible way, avoiding formalisms as much
as possible. However, in order to work with multivariate
analytical methods some background in mathematics and
ACM Transactions on Asian Language Information Processing, 2009
Thabet [2005] applied cluster analysis to the Qur’an in the hope of generating a classification o... more Thabet [2005] applied cluster analysis to the Qur’an in the hope of generating a classification of the (suras) that is useful for understanding of its thematic structure. The result was positive, but variation in (sura) length was a problem because clustering of the shorter was found to be unreliable. The present discussion addresses this problem in four parts. The first part summarizes Thabet’s work. The second part argues that unreliable clustering of the shorter is a consequence of poor estimation of lexical population probabilities in those. The third part proposes a solution to the problem based on calculation of a minimum length threshold using concepts from statistical sampling theory followed by selection of and lexical variables based on that threshold. The fourth part applies the proposed solution to a reanalysis of the Qur’an.
Proceedings of INFOS2008: 6th International Conference on Informatics and Systems, Cairo University, 27-29 March 2008, 2008
The advent of large electronic text corpora has generated a range of technologies for their sear... more The advent of large electronic text corpora has generated a range of technologies for their
search and interpretation. Variation in document length can be a problem for these technologies, and
several normalization methods for mitigating its effects have been proposed. This paper assesses the
effectiveness of such methods in specific relation to exploratory multivariate analysis. The discussion
is in four main parts. The first part states the problem, the second describes some normalization
methods, the third identifies poor estimation of the population probability of variables as a factor that
compromises the effectiveness of the normalization methods for very short documents, and the fourth
proposes elimination of data matrix rows representing document which are too short to be reliably
normalized and suggests ways of identifying those documents.
Journal of Quantitative Linguistics 15, 46-69, 2008
The Newcastle Electronic Corpus of Tyneside English is a corpus of dialect speech from North-East... more The Newcastle Electronic Corpus of Tyneside English is a corpus of dialect speech from North-East England. It includes phonetic transcriptions of 63 interviews together with social data relating to each interviewee, and offers an opportunity to study the sociophonetics of Tyneside speech of the late 1960s. In a previous paper we began that study with an exploratory multivariate analysis of the transcriptions. The results were that speakers fell into clearly defined groups on the basis of their phonetic usage, and that these groups correlated well with social characteristics associated with the speakers. The present paper develops these results by trying to identify the main phonetic determinants of the speaker groups.
Tsiplakou, S., Karyolemu, M., Pavlou, P. (ed.) Language Variation. European Perspectives, Amsterdam: John Benjamins, 169-178, 2008
This paper addresses an issue that has a fundamental bearing on the validity of analytical resul... more This paper addresses an issue that has a fundamental bearing on the
validity of analytical results based on such data: sparsity. The discussion is
in three main parts. The first part shows how a particular class of
computational methods, exploratory multivariate analysis, can be used in
language variation research, the second explains why data sparsity can be a
problem in such analysis, and the third outlines some solutions.
Beal, J., Corrigan, K., Moisl, H., (ed.) Creating and Digitizing Language Corpora: Synchronic Databases, Palgrave Macmillan, 1-16, 2007
Six of the contributions to Volume 1 (Anderson et al.; Anderwald and Wagner; Barbiers et al.; Se... more Six of the contributions to Volume 1 (Anderson et al.; Anderwald and
Wagner; Barbiers et al.; Sebba and Dray; Kallen and Kirk; Tagliamonte)
arose from invited presentations at the workshop on ‘Models and
Methods in the Handling of Unconventional Digital Corpora organized
by the editors of the present volume that was held in April 2004 during
the Fifteenth Sociolinguistics Symposium (SS15) at the University of
Newcastle. The book project then evolved by inviting further contributions
from key corpus creators so that the companion volumes would
contain treatments outlining the models and methods underpinning a
variety of digitized diachronic and synchronic corpora with a view to
highlighting synergies and points of contrast between them.