Michael Oakes | University of Wolverhampton (original) (raw)

Papers by Michael Oakes

Digital Scholarship in the Humanities

This article looks at the provenance of the unfinished novel The Dark Tower, generally attributed... more This article looks at the provenance of the unfinished novel The Dark Tower, generally attributed to C. S. Lewis. The manuscript was purportedly rescued from a bonfire shortly after Lewis's death by his literary executor Walter Hooper, but the quality of the text is hardly vintage Lewis. Using computer stylometric programs made available by Eder et al.'s (2016: Stylometry with R: A package for computational text analysis. R Journal, 8(1): 107-21) 'stylo' package and a word length analysis, samples of each chapter of The Dark Tower were compared with works known to be by Lewis, two books by Hooper and a hoax letter concerning the bonfire by Anthony Marchington. Initial experiments found that the first six chapters of The Dark Tower were stylometrically consistent with Lewis's known works, but the incomplete Chapter 7 was not. This may have been due to an abrupt change in genre, from narrative to pseudoscientific style. Using principal components analysis, it was found that the first and subsequent components were able to separate genre and individual style, and thus a plot of the second against the third principal components enabled the effects of genre to be filtered out. This showed that Chapter 7 was also consistent with the other samples of C. S.

Irish Journal of Psychological Medicine

BackgroundSuicide is a major public health problem, with mental disorders being one of its major ... more BackgroundSuicide is a major public health problem, with mental disorders being one of its major risk factors. The high incidence of suicide on the Isle of Wight has motivated this study, the first of its kind on suicide in this small geographic area.AimThe aim of the study was to identify socio-demographic and clinical risk factors for suicide in the population of service users and non-service users, and gender-related characteristics of suicidal behaviour in a limited geographic region.MethodData were collected on 68 cases of suicide (ICD-10×60-X84) from residents of the Isle of Wight District between January 2006 and December 2009. All data were statistically analysed using Pearson’sχ2test and Yates’ correction for continuity.ResultsThe mean annual suicide rates over the period were 5.65 per 100 000 for women and 19.28 for men. Significantly (p=0.0006), more men than women (male/female ratio 3:1) died as a result of suicide. Relatively (p=0.07) more women (56.2%) than men (32.7%)...

International Journal of Computational Intelligence and Applications

Nowadays, documents are increasingly associated with multi-level category hierarchies rather than... more Nowadays, documents are increasingly associated with multi-level category hierarchies rather than a flat category scheme. As the volume and diversity of documents grow, so do the size and complexity of the corresponding category hierarchies. To be able to access such hierarchically classified documents in real-time, we need fast automatic methods to navigate these hierarchies. Today’s data domains are also very different from each other, such as medicine and politics. These distinct domains can be handled by different classifiers. A document representation system which incorporates the inherent category structure of the data should also add useful semantic content to the data vectors and thus lead to better separability of classes. In this paper, we present a scalable meta-classifier to tackle today’s problem of multi-level data classification in the presence of large datasets. To speed up the classification process, we use a search-based method to detect the level-1 category of a t...

2016 Eleventh International Conference on Digital Information Management (ICDIM), 2016

Corpora, 2016

In this paper, we present experiments using the Linguistic Inquiry and Word Count (LIWC) program,... more In this paper, we present experiments using the Linguistic Inquiry and Word Count (LIWC) program, a ‘closed-class keyword’ (CCK) analysis and a ‘correspondence analysis’ (CA) to examine whether the Scientology texts of L. Ron Hubbard are linguistically and conceptually like those of other religions. A Kruskal–Wallis test comparing the frequencies of LIWC category words in the Scientology texts and the English translations of the texts of five other religions showed that there were eighteen categories for which the Scientology texts differed from the others, and between one and seventeen for the other religions. In the CCK experiment, keywords typical of each religion were found, both by comparing the religious texts with one another and with the Brown corpus of general English. The most typical keywords were looked up in a concordancer and were manually coded with conceptual tags. The set of categories found for the Scientology texts showed little overlap with those found for the ot...

Ranlp, 2005

Page 1. Using Hearst's Rules for the Automatic Acquisition of Hyponyms for Mining a Pharmace... more Page 1. Using Hearst's Rules for the Automatic Acquisition of Hyponyms for Mining a Pharmaceutical Corpus Michael P. Oakes University of Sunderland, England Page 2. Thesauri: Ontologies for Information Retrieval. To standardise semantic terms, many areas ...

International Journal of Hybrid Intelligent Systems, 2011

A vast data repository such as the web contains many broad domains of data which are quite distin... more A vast data repository such as the web contains many broad domains of data which are quite distinct from each other e.g. medicine, education, sports and politics. Each of these domains constitutes a subspace of the data within which the documents are similar to each other but quite distinct from the documents in another subspace. The data within these domains is frequently further divided into many subcategories. In this paper we present a novel hybrid parallel architecture using different types of classifiers trained on different subspaces to improve text classification within these subspaces. The classifier to be used on a particular input and the relevant feature subset to be extracted is determined dynamically by using maximum significance values. We use the conditional significance vector representation which enhances the distinction between classes within the subspace. We further compare the performance of our hybrid architecture with that of a single classifier-full data space learning system and show that it outperforms the single classifier system by a large margin when tested with a variety of hybrid combinations on two different corpora. Our results show that subspace classification accuracy is boosted and learning time reduced significantly with this new hybrid architecture.

Abstract In this chapter, we have used the chi-squared test and Yule&amp;amp;amp;amp;#x27... more Abstract In this chapter, we have used the chi-squared test and Yule&amp;amp;amp;amp;#x27;s Q measure to discover associations in tables of patient audiology data. These records are examples of heterogeneous medical records, since they contain audiograms, tex-tual notes and typical relational ...

In this paper we show how two standard outputs from information extraction (IE) systems-named ent... more In this paper we show how two standard outputs from information extraction (IE) systems-named entity annotations and scenario templates-can be used to enhance access to text collections via a standard text browser. We describe how this information is used in a prototype system designed to support information workers' access to a pharmaceutical news archive as part of their "industry watch" function. We also report results of a preliminary, qualitative user evaluation of the system, which while broadly positive indicates further work needs to be done on the interface to make users aware of the increased potential of IE-enhanced text browsers.

GATE, a General Architecture for Text Engineering, aims to provide a software infrastructure for ... more GATE, a General Architecture for Text Engineering, aims to provide a software infrastructure for researchers and developers working in NLP. GATE has now been widely available for four years. In this paper, we review the objectives which motivated the creation of GATE and the functionality and design of the current system. We describe some of the ways in which GATE has been used during this time, and examine the strengths and weaknesses of the current system, identifying areas for improvement.

Virtual Learning Environments (VLE) systems are currently being used for critical service in Coll... more Virtual Learning Environments (VLE) systems are currently being used for critical service in Colleges and Universities around the UK. In the last decade with web technology enhancements, the increase of the Internet connection speed and the growth of data within the Internet, more systems and tools have been developed in order to facilitate the learning experience within the web space. One of the basic problems of institutions at the moment is linking and presenting the information that is available within online learning spaces. Many general content repositories have been created as well as institutions own sets of learning resources. This is not the sole benefit of the Web for learners. With the current technological advances in information retrieval and learner support systems on the Web, there is more potential to be exploited in providing the learners with help based on the information contained in learning materials. In this paper we describe the theoretical background and applied techniques used to extract topic signatures from an existing learning content document collection. Topic signatures are a set of topic words and other words semantically related to them that together uniquely identify the topic [Biryukov, M., Angheluta, R., Moens M. 2005].The reason why automatic extraction of topic signatures interests us in this project is because of the promising results they bring in a variety of domains such as text summarisation, ontology population and question answering. Topic signatures can provide a strong baseline for developing content driven e-learning applications within VLE’s. The work accomplished so far to acquire topic signatures from learning contents is described in the sections of this paper. This work includes the implementation of different statistical weights (such as TF-IDF, Deviation from Randomness (DFR) and likelihood ratio) to acquire topic and signature terms and also an evaluation of the results obtained from these methods.

Kilgarriff (2001) gives a number of reasons for comparing corpora which are relevant to the theme... more Kilgarriff (2001) gives a number of reasons for comparing corpora which are relevant to the theme of this workshop. In particular, he considers the difficulty, measurable in terms of time and cost, in porting a new corpus to an existing NLP system. Different types of corpora which have been compared in the past include samples of English spoken in different parts of the world, transcribed speech differentiated by such factors as gender and age, writings by different authors in stylometric studies and texts from different genres. Statistical techniques exist to discriminate between these text types, although in these studies the interest has generally been in the types of text per se, rather than their amenability to NLP tools. In an exemplary case study by Sekine (1997), the performance of a parser was affected by the similarity between the training and test corpora. Sekine's measure of corpus similarity was the information theoretic measure of Cross Entropy, and the subjective reasonableness of this measure was demonstrated by hierarchical clustering.

Page 1. AbstractIn this paper, we have used the chi-squared test and Yule's Q measure t... more Page 1. AbstractIn this paper, we have used the chi-squared test and Yule's Q measure to discover associations in tables of patient audiology data. These records are examples of heterogeneous medical records, since they ...

Digital Scholarship in the Humanities

Irish Journal of Psychological Medicine

International Journal of Computational Intelligence and Applications

2016 Eleventh International Conference on Digital Information Management (ICDIM), 2016

Corpora, 2016

Ranlp, 2005

International Journal of Hybrid Intelligent Systems, 2011

In this paper I use the method of Hofland and Johansson (1982) which found words more typical of ... more In this paper I use the method of Hofland and Johansson (1982) which found words more typical of either British and United States English. They studied word frequencies in British and US English, using two collections of machine-readable text each containing about one million words. These collections were the Brown Corpus, compiled in the 1960s to provide a representative sample of US English, using many different authors and genres, and the equivalent LOB (Lancaster-Oslo-Bergen) Corpus of British English. To discover which words were more typical of one form of English than the other, they used a statistical test called chi-square. I have repeated their experiment for 1990s English using the FROWN and FLOB corpora, which are the updated equivalents of the earlier BROWN and LOB corpora. 1. Background: The work of Leech and Fallon. Leech and Fallon (1982) used the lists of words more typical of either US or UK English produced by Hofland & Johansson as evidence of cultural differences. They excluded differences of spelling (e.g. color and colour), lexical choice (e.g. gasoline and petrol) and proper nouns associated with the two nations (e.g. Chicago was more common in the US texts, and London was more common in the UK texts). Such differences they referred to as linguistic contrasts. All other differences they referred to as non-linguistic contrasts. To explore socio-cultural differences, they concentrated on non-linguistic contrasts, and identified 15 categories where interesting differences occurred. The differences found by Leech and Fallon for the 1960s data, and the data found by this author for the 1990s texts for each of these categories are shown in section 3. Examination of the lists of words more typical of one type of English derived from the 1990s data did not reveal any major new categories of vocabulary, except for forms of the verbs be and have, noted previously by Hofland and Johansson (p. 36). 2. Method: The chi-square test. In order to find which words are more typical of either US or UK English, the chi-square test can be used. For every distinct word (type) in each corpus, we find how often it occurs in the US corpus, and how often it occurs in the UK corpus. For example, the word says is found 655 times in the BROWN corpus of US English, but only 310 times in the LOB corpus of UK English. We need to know whether these figures could have arisen by chance, or whether says is genuinely more typical of US English. To run the chi-square test, we also need to know the total number of words in each corpus, so we can work out how may words (tokens) in each corpus are not the word says. The four values found so far are arranged in a 2 by 2 table called a contingency table, as shown below: Number of times says is found in the US corpus Number of times says is found in the UK corpus Number of words other than says in the US corpus Number of words other than says in the UK corpus Since these are values we can count directly, they are called observed values. From the observed values, we can use a formula to calculate the corresponding expected values if there were no tendency for the word to occur more often in either US or UK English. If the differences between the observed and expected values are high, we get a high chi-square value. If the resulting chi-square value is more than 10.83, we can be 99.9% confident that the