Michael Dalvean | The Australian National University (original) (raw)
Uploads
Papers by Michael Dalvean
arXiv (Cornell University), Apr 11, 2024
Word complexity is defined in a number of different ways. Psycholinguistic, morphological and lex... more Word complexity is defined in a number of different ways. Psycholinguistic, morphological and lexical proxies are often used. Human ratings are also used. The problem here is that these proxies do not measure complexity directly, and human ratings are susceptible to subjective bias. In this study we contend that some form of 'latent complexity' can be approximated by using samples of simple and complex words. We use a sample of 'simple' words from primary school picture books and a sample of 'complex' words from high school and academic settings. In order to analyse the differences between these groups, we look at the letter positional probabilities (LPPs). We find strong statistical associations between several LPPs and complexity. For example, simple words are significantly (p < .001) more likely to start with w, b, s, h, g, k, j, t, y or f, while complex words are significantly (p < .001) more likely to start with i, a, e, r, v, u or d. We find similar strong associations for subsequent letter positions, with 84 letter-position variables in the first 6 positions being significant at the p < .001 level. We then use LPPs as variables in creating a classifier which can classify the two classes with an 83% accuracy. We test these findings using a second data set, with 66 LPPs significant (p < .001) in the first 6 positions common to both datasets. We use these 66 variables to create a classifier that is able to classify a third dataset with an accuracy of 70%. Finally, we create a fourth sample by combining the extreme high and low scoring words generated by three classifiers built on the first three separate datasets and use this sample to build a classifier which has an accuracy of 97%. We use this to score the four levels of English word groups from an ESL program.
SSRN Electronic Journal, 2012
SSRN Electronic Journal, 2012
SSRN Electronic Journal, 2013
Abstract: In this paper I use computational linguistics to find the differences between poems wri... more Abstract: In this paper I use computational linguistics to find the differences between poems written by amateurs and poems written by professionals. I identify a number of linguistic variables that are important in distinguishing between the two classes of poems. To a large extent the findings corroborate those of earlier researchers. However, I go on to use the identifed characteristics to create an ensemble classifier using the principles of machine learning. The holdout sample classification accuracy of the classifier is 80%, indicating ...
Ministerial Careers and Accountability in the Australian Commonwealth Government, Sep 1, 2012
SSRN Electronic Journal, 2015
ICAME Journal, 2017
There have been significant social and political changes in Australian society since federation i... more There have been significant social and political changes in Australian society since federation in 1901. The issues that are considered politically salient have also changed significantly. The purpose of this article is to examine changes in the style and content of election campaign speeches over the period 1901-2016. The corpus consists of 88 election campaign speeches delivered by the Prime Minister and Opposition leader for the 45 elections from 1901 to 2016. I use natural language processing to extract from the speeches a number of linguistic variables which serve as independent variables and use the year of delivery as the dependent variable. I then use machine learning to develop a regression model which explains 77 per cent of the variance in the dependent variable. Examination of the salient independent variables shows that there have been significant linguistic changes in the style and content of election speeches over the study period. In particular speeches have become l...
Literary and Linguistic Computing, 2013
GOVERNMENT
Why did Barry Jones not become a cabinet minister while Gareth Evans did? Was it a difference in ... more Why did Barry Jones not become a cabinet minister while Gareth Evans did? Was it a difference in ability, social skill or political judgment? Was it inevitable that Peter McGauran, Martin Ferguson and David Kemp would become cabinet ministers while their brothers, Julian, Laurie and Rod respectively, would not? This chapter contends that there are reasons some individuals make it to cabinet and some do not, and these differences are detectable at an early stage of an individual's career and are far more important in ...
In this paper I use computational linguistics to find the differences between poems written by am... more In this paper I use computational linguistics to find the differences between poems written by amateurs and poems written by professionals. I identify a number of linguistic variables that are important in distinguishing between the two classes of poems. To a large extent the findings corroborate those of earlier researchers, such as the fact that professional poems have more concrete language than amateur poems. However, I go on to use the identifed characteristics to create an ensemble classifier using the principles of machine learning. ...
Linguistic Research, 2018
Empirical Studies of the Arts, 2016
This article extends recent work on the application of computational linguistics to the analysis ... more This article extends recent work on the application of computational linguistics to the analysis of poetry. The dataset consisted of 85 canonical English poems and a matched control group of obscure poems. I used Linguistic Inquiry and Word Count to create more than 65 linguistic variables and then used machine learning to develop a classifier designed to distinguish between the canonical (highly anthologized) poems and the obscure (seldom anthologized) poems. The classifier consists of 6 variables and has an accuracy of 69% in distinguishing between canonical and obscure poems. I then ranked the poems using the probability scores of the classifier and found that Blake's A Poison Tree scored highest. I explain the ranking method as being a means of distilling the “literary” appeal from the “popular” appeal of the poems in the sample. Finally, I discuss the implications for the theory of poetry in general.
Linguistic Research, 2018
Dalvean, Michael and Galbadrakh Enkhbayar. 2018. Standard readability measures are based on the r... more Dalvean, Michael and Galbadrakh Enkhbayar. 2018. Standard readability measures are based on the readability of non-fiction texts. Linguistic Research 35(Special Edition), 137-170. This means that the validity of the measures when applied to fiction texts is questionable. Thus, the scores given to fiction texts using such indices may be invalid when used by English teachers to identify fiction texts of appropriate difficulty for students with various reading ability levels. This paper attempts to address this problem by 1) developing a readability measure specifically designed for fiction texts and 2) applying it to 200 English fiction texts. A corpus, consisting of 100 adults' and 100 children's texts, is used for the analysis. In the initial modeling, several standard readability measures are used as variables, and machine learning is used to create a classifier which is able to classify the corpus with an accuracy of 84%. A second classifier is then created using linguistic variables rather than standard readability measures. The latter classifier is able to classify the corpus with an accuracy of 89%, indicating that the standard readability measures are less accurate in classifying fiction texts than linguistic variables. Due to its higher accuracy, the latter classifier is then used to provide a linear complexity or 'readability' rank for each text. The ranking using the linguistic-based classifier provides an more accurate method of determining which texts to choose for students according to their reading levels than the standard readability measures. Importantly, the ranking instantiates a fine-grained increase in complexity. This means that the ranking can be used by an English teacher to select a sequence of texts that represent an increasing challenge to a student without there being a frustratingly discrete rise in difficulty. (Canberra College ・ Southern Taiwan University of Science and Technology)
There have been significant social and political changes in Australian society since federation i... more There have been significant social and political changes in Australian society since federation in 1901. The issues that are considered politically salient have also changed significantly. The purpose of this article is to examine changes in the style and content of election campaign speeches over the period 1901 – 2016. The corpus consists of 88 election campaign speeches delivered by the Prime Minister and Opposition leader for the 45 elections from 1901 to 2016. I use natural language processing to extract from the speeches a number of linguistic variables which serve as independent variables and use the year of delivery as the dependent variable. I then use use machine learning to develop a regression model which explains 77 per cent of the variance in the dependent variable. Examination of the salient independent variables shows that there have been significant linguistic changes in the style and content of election speeches over the study period. In particular, speeches have become less linguistically complex, less analytical, more focused on work and the home, and contain more social references. I discuss these changes in the context of changes in Australian society over the study period.
arXiv (Cornell University), Apr 11, 2024
Word complexity is defined in a number of different ways. Psycholinguistic, morphological and lex... more Word complexity is defined in a number of different ways. Psycholinguistic, morphological and lexical proxies are often used. Human ratings are also used. The problem here is that these proxies do not measure complexity directly, and human ratings are susceptible to subjective bias. In this study we contend that some form of 'latent complexity' can be approximated by using samples of simple and complex words. We use a sample of 'simple' words from primary school picture books and a sample of 'complex' words from high school and academic settings. In order to analyse the differences between these groups, we look at the letter positional probabilities (LPPs). We find strong statistical associations between several LPPs and complexity. For example, simple words are significantly (p < .001) more likely to start with w, b, s, h, g, k, j, t, y or f, while complex words are significantly (p < .001) more likely to start with i, a, e, r, v, u or d. We find similar strong associations for subsequent letter positions, with 84 letter-position variables in the first 6 positions being significant at the p < .001 level. We then use LPPs as variables in creating a classifier which can classify the two classes with an 83% accuracy. We test these findings using a second data set, with 66 LPPs significant (p < .001) in the first 6 positions common to both datasets. We use these 66 variables to create a classifier that is able to classify a third dataset with an accuracy of 70%. Finally, we create a fourth sample by combining the extreme high and low scoring words generated by three classifiers built on the first three separate datasets and use this sample to build a classifier which has an accuracy of 97%. We use this to score the four levels of English word groups from an ESL program.
SSRN Electronic Journal, 2012
SSRN Electronic Journal, 2012
SSRN Electronic Journal, 2013
Abstract: In this paper I use computational linguistics to find the differences between poems wri... more Abstract: In this paper I use computational linguistics to find the differences between poems written by amateurs and poems written by professionals. I identify a number of linguistic variables that are important in distinguishing between the two classes of poems. To a large extent the findings corroborate those of earlier researchers. However, I go on to use the identifed characteristics to create an ensemble classifier using the principles of machine learning. The holdout sample classification accuracy of the classifier is 80%, indicating ...
Ministerial Careers and Accountability in the Australian Commonwealth Government, Sep 1, 2012
SSRN Electronic Journal, 2015
ICAME Journal, 2017
There have been significant social and political changes in Australian society since federation i... more There have been significant social and political changes in Australian society since federation in 1901. The issues that are considered politically salient have also changed significantly. The purpose of this article is to examine changes in the style and content of election campaign speeches over the period 1901-2016. The corpus consists of 88 election campaign speeches delivered by the Prime Minister and Opposition leader for the 45 elections from 1901 to 2016. I use natural language processing to extract from the speeches a number of linguistic variables which serve as independent variables and use the year of delivery as the dependent variable. I then use machine learning to develop a regression model which explains 77 per cent of the variance in the dependent variable. Examination of the salient independent variables shows that there have been significant linguistic changes in the style and content of election speeches over the study period. In particular speeches have become l...
Literary and Linguistic Computing, 2013
GOVERNMENT
Why did Barry Jones not become a cabinet minister while Gareth Evans did? Was it a difference in ... more Why did Barry Jones not become a cabinet minister while Gareth Evans did? Was it a difference in ability, social skill or political judgment? Was it inevitable that Peter McGauran, Martin Ferguson and David Kemp would become cabinet ministers while their brothers, Julian, Laurie and Rod respectively, would not? This chapter contends that there are reasons some individuals make it to cabinet and some do not, and these differences are detectable at an early stage of an individual's career and are far more important in ...
In this paper I use computational linguistics to find the differences between poems written by am... more In this paper I use computational linguistics to find the differences between poems written by amateurs and poems written by professionals. I identify a number of linguistic variables that are important in distinguishing between the two classes of poems. To a large extent the findings corroborate those of earlier researchers, such as the fact that professional poems have more concrete language than amateur poems. However, I go on to use the identifed characteristics to create an ensemble classifier using the principles of machine learning. ...
Linguistic Research, 2018
Empirical Studies of the Arts, 2016
This article extends recent work on the application of computational linguistics to the analysis ... more This article extends recent work on the application of computational linguistics to the analysis of poetry. The dataset consisted of 85 canonical English poems and a matched control group of obscure poems. I used Linguistic Inquiry and Word Count to create more than 65 linguistic variables and then used machine learning to develop a classifier designed to distinguish between the canonical (highly anthologized) poems and the obscure (seldom anthologized) poems. The classifier consists of 6 variables and has an accuracy of 69% in distinguishing between canonical and obscure poems. I then ranked the poems using the probability scores of the classifier and found that Blake's A Poison Tree scored highest. I explain the ranking method as being a means of distilling the “literary” appeal from the “popular” appeal of the poems in the sample. Finally, I discuss the implications for the theory of poetry in general.
Linguistic Research, 2018
Dalvean, Michael and Galbadrakh Enkhbayar. 2018. Standard readability measures are based on the r... more Dalvean, Michael and Galbadrakh Enkhbayar. 2018. Standard readability measures are based on the readability of non-fiction texts. Linguistic Research 35(Special Edition), 137-170. This means that the validity of the measures when applied to fiction texts is questionable. Thus, the scores given to fiction texts using such indices may be invalid when used by English teachers to identify fiction texts of appropriate difficulty for students with various reading ability levels. This paper attempts to address this problem by 1) developing a readability measure specifically designed for fiction texts and 2) applying it to 200 English fiction texts. A corpus, consisting of 100 adults' and 100 children's texts, is used for the analysis. In the initial modeling, several standard readability measures are used as variables, and machine learning is used to create a classifier which is able to classify the corpus with an accuracy of 84%. A second classifier is then created using linguistic variables rather than standard readability measures. The latter classifier is able to classify the corpus with an accuracy of 89%, indicating that the standard readability measures are less accurate in classifying fiction texts than linguistic variables. Due to its higher accuracy, the latter classifier is then used to provide a linear complexity or 'readability' rank for each text. The ranking using the linguistic-based classifier provides an more accurate method of determining which texts to choose for students according to their reading levels than the standard readability measures. Importantly, the ranking instantiates a fine-grained increase in complexity. This means that the ranking can be used by an English teacher to select a sequence of texts that represent an increasing challenge to a student without there being a frustratingly discrete rise in difficulty. (Canberra College ・ Southern Taiwan University of Science and Technology)
There have been significant social and political changes in Australian society since federation i... more There have been significant social and political changes in Australian society since federation in 1901. The issues that are considered politically salient have also changed significantly. The purpose of this article is to examine changes in the style and content of election campaign speeches over the period 1901 – 2016. The corpus consists of 88 election campaign speeches delivered by the Prime Minister and Opposition leader for the 45 elections from 1901 to 2016. I use natural language processing to extract from the speeches a number of linguistic variables which serve as independent variables and use the year of delivery as the dependent variable. I then use use machine learning to develop a regression model which explains 77 per cent of the variance in the dependent variable. Examination of the salient independent variables shows that there have been significant linguistic changes in the style and content of election speeches over the study period. In particular, speeches have become less linguistically complex, less analytical, more focused on work and the home, and contain more social references. I discuss these changes in the context of changes in Australian society over the study period.
The purpose of this article is to examine the psychological elements of the ideology of members o... more The purpose of this article is to examine the psychological elements of the ideology of members of the major parties in the Australian federal parliament using computational linguistics. The cohort consists of the 485 Labor, Liberal and National parliamentarians who were in parliament over the period April 1996 to July 2014. I use computational linguistics to extract linguistic variables from first speeches in parliament of those in the cohort. I draw from methods used in machine learning to develop a classifier which has a 74% out of sample (leave-one-out cross validation) accuracy in classifying parliamentarians as liberal (ALP) or conservative (Liberal/National Party Coalition). I then examine the salient variables and find that there are only six linguistic markers of conservative/liberal ideology. Of these, two are consistent with the previous findings that liberals tend to display more psychological 'openness' than conservatives and less psychological 'conscientiousness'. However, one of these variables strongly challenges the idea that conservatives look to the past and liberals to the future. Two of the six linguistic variables are 'suppressor' variables and I discuss these variables in the context of their role in suppressing 'irrelevant' variance in the other independent variables.
English teachers often have difficulty matching the complexity of fiction texts with students' r... more English teachers often have difficulty matching the complexity of fiction texts with students' reading levels. Texts that seem appropriate for students of a given level can turn out to be too difficult. Furthermore, it is difficult to choose a series of texts that represent a smooth gradation of text difficulty. This paper attempts to address both problems by providing a complexity ranking of a corpus of 200 fiction texts consisting of 100 adults' and 100 children's texts. Using machine learning, several standard readability measures are used as variables to create a classifier which is able to classify the corpus with an accuracy of 84%. A classifier created with linguistic variables is able to classify the corpus with an accuracy of 89%. The 'latter classifier is then used to provide a linear complexity rank for each text. The resulting ranking instantiates a fine-grained increase in complexity. This can be used by an English teacher to select a sequence of texts that represent an increasing challenge to a student without there being a frustratingly discrete rise in difficulty.