Richard Forsyth - Academia.edu (original) (raw)
Uploads
Papers by Richard Forsyth
Recently there has been an upsurge of interest in the problem of text categorization, e.g. of new... more Recently there has been an upsurge of interest in the problem of text categorization, e.g. of newswire stories (Hayes & Weinstein, 1991; Apté et al., 1993). However, classifying documents is not a new problem: workers in the field of stylometry have been grappling with it for over a hundred years (Mendenhall, 1887). Typically, they have given most attention to authorship attribution, while more modern research in text categorization, conducted from within the paradigm of Artificial Intelligence, has concentrated on discrimination based on subject matter. Nevertheless both fields share similar aims, and it is the contention of the present author that they could profit from being more aware of each other. Accordingly, the present study addresses an issue common to both approaches, the problem of finding an effective set of attributes or features for text discrimination. Stylometers, in their quest to capture consistent and distinctive features of linguistic style, have proposed and used a wide variety of textual features or markers (Holmes, 1994), including measures of vocabulary richness (Yule, 1944), grammatical transition frequencies (Wickmann, 1976), rates of usage of frequent function words (Mosteller & Wallace, 1984), and preferences for words in certain semantic categories (Martindale & McKenzie, 1995). In many text-categorization tasks the choice of textual features is a crucial determinant of success, yet it is not usually treated as a major focus of attention. This is often true of AI-based text-categorization studies as well. It would be desirable if this part of the process were better understood. This paper, therefore, reports an empirical comparison of nine different methods of textual feature-finding that: (1) do not depend on subjective judgement; (2) do not need background knowledge external to the texts being analyzed, such as a lexicon or thesaurus; (3) do not presuppose that the texts being analyzed are in the English language; and (4) do not presume that words (or word-based measures) are the only possible textual descriptors. Results of a benchmark test on 13 representative text-classification problems suggest that one of these techniques, here designated Monte-Carlo Feature-Finding, has certain advantages that merit consideration by future workers seeking to characterize stylistic habits efficiently without imposing many preconceptions.
This program uses an evolutionary (Darwinian) optimization technique to perform clustering, i.e. ... more This program uses an evolutionary (Darwinian) optimization technique to perform clustering, i.e. it identifies within a dataset groups of items which in some sense belong together. An important point about CUES is that it decides on the number of groups as part of the optimization process without having to be given the number to find as input -unlike many well-established clustering algorithms. It has been written in Python3 and is released under the GNU Public License for general usage.
Literary and Linguistic Computing, 1999
When his daughter Tullia died in 45 BC, the Roman orator Marcus Tullius Cicero (106-43 BC) was as... more When his daughter Tullia died in 45 BC, the Roman orator Marcus Tullius Cicero (106-43 BC) was assailed by grief which he attempted to assuage by writing a philosophical work now known as the Consolatio. Despite its high reputation in the classical world, only fragments of this text-in the form of quotations by subsequent authors-are known to have survived the fall of Rome. However, in 1583 a book was printed in Venice purporting to be a rediscovery of Cicero's Consolatio. Its editor was a prominent humanist scholar and Ciceronian stylist called Carlo Sigonio. Some of Sigonio's contemporaries, notably Antonio Riccoboni, voiced doubts about the authenticity of this work; and since that time scholarly opinion has differed over the genuineness of the 1583 Consolatio. The main aim of this study is to bring modern stylometric methods to bear on this question in order to see whether internal linguistic evidence supports the belief that the Consolatio of 1583 is a fake, very probably perpetrated by Sigonio himself. A secondary objective is to test the application of methods previously used almost exclusively on English texts to a language with a different structure, namely Latin. Our findings show that language of the 1583 Consolatio is extremely uncharacteristic of Cicero, and indeed that the text is much more likely to have been written during the Renaissance than in classical times. The evidence that Sigonio himself was the author is also strong, though not conclusive.
ACM SIGEVOlution
EuroGP 2016 co-chair James McDermott describes the EvoStar opening keynote talk given by Dr Richa... more EuroGP 2016 co-chair James McDermott describes the EvoStar opening keynote talk given by Dr Richard Forsyth . An early GP pioneer, Richard Forsyth's talk covered some interesting interesting historical pathways and included a description of his BEAGLE system developed in 1981.
Social scientists face an overload of digitized information. In particular, they must often spend... more Social scientists face an overload of digitized information. In particular, they must often spend inordinate amounts of time coding and analyzing transcribed speech. This paper describes a study, in the field of learning science, of the feasibility of semi-automatically coding and scoring verbal data. Transcripts from 48 individual learners comprising 2 separate data sets of 44,000 and 23,000 words were used as test domains for the investigation of three research questions: (1) how well can utterancetype codes assigned to text segments by humans be predicted from the linguistic characteristics of those text segments? (2) how well can learning outcomes be predicted from learners' verbalizations? (3) can the material students are learning from be identified from their language? Initial results indicate that the answers to the third question is yes; and that the answer to the first two questions is: well enough to warrant further development of the text-mining techniques so far employed.
Http Dx Doi Org 10 1080 001401398186603, Nov 10, 2010
ABSTRACT In-depth studies of behavioural factors in road accidents using conventional methods are... more ABSTRACT In-depth studies of behavioural factors in road accidents using conventional methods are often inconclusive and costly. In a series of studies exploring alternative approaches, 200 cross-flow junction road accidents were sampled from the files of Nottinghamshire Constabulary, UK, coded for computer analysis using a specially devised ‘Traffic Related Action Analysis Language’, and then examined using different computational and statistical techniques. The present study employed an AI machine-learning method based on Quinlan's ‘ID3’ algorithm to create decision trees distinguishing the characteristics of accidents that resulted in injury or in damage only; accidents of young male drivers; and those of the relatively more and less dangerous situations. For example the severity of accidents involving turning onto a main road could be determined with 79% accuracy from the nature of the other vehicle, season, junction type, and whether the Turner failed to notice another road user. Accidents involving young male drivers could be identified with 77% accuracy by knowing if the junction was complex, and whether the Turner waited or slowed before turning.
Accident Analysis Prevention, 1999
In-depth studies of behavioural factors in road accidents using conventional methods are often in... more In-depth studies of behavioural factors in road accidents using conventional methods are often inconclusive and costly. In a series of studies exploring alternative approaches, 200 cross-flow junction road accidents were sampled from the files of Nottinghamshire Constabulary, UK, coded for computer analysis using a specially devised Traffic Related Action Analysis Language, and then examined using different computational and statistical techniques. For comparison, the same analyses were carried out on 100 descriptions of safe turns, and 100 descriptions of hypothetical accidents provided by experienced drivers. The present study employed a range of sequence analysis techniques to examine the patterns of events preceding accidents of different types. Differences were found between real accidents, hypothetical ones and safe turns; between accidents turning onto and off a road with the right of way; between the accidents of younger and older drivers; between accidents on minor roads and major roads; and between the accident expectations (but not the real accidents) of male and female drivers. Pairs of successive events often provided particularly good cues for discriminating accident types.
Literary and Linguistic Computing, 2014
Literary and Linguistic Computing, 1999
... These Table 3 Frequencies of substrings in two short poems Marker substrings Younger Yeats &#... more ... These Table 3 Frequencies of substrings in two short poems Marker substrings Younger Yeats 's, an' , whi' , with' , sea' , stars' , we' 's of' , you' 'ping , 'woo' , dee' Total Older Yeats 'what' , can' I? J , int' '. .ii : 'ck' 'hat' 'though' , that' Total Salley Gardens, 1888 (98 words) ...
Recently there has been an upsurge of interest in the problem of text categorization, e.g. of new... more Recently there has been an upsurge of interest in the problem of text categorization, e.g. of newswire stories (Hayes & Weinstein, 1991; Apté et al., 1993). However, classifying documents is not a new problem: workers in the field of stylometry have been grappling with it for over a hundred years (Mendenhall, 1887). Typically, they have given most attention to authorship attribution, while more modern research in text categorization, conducted from within the paradigm of Artificial Intelligence, has concentrated on discrimination based on subject matter. Nevertheless both fields share similar aims, and it is the contention of the present author that they could profit from being more aware of each other. Accordingly, the present study addresses an issue common to both approaches, the problem of finding an effective set of attributes or features for text discrimination. Stylometers, in their quest to capture consistent and distinctive features of linguistic style, have proposed and used a wide variety of textual features or markers (Holmes, 1994), including measures of vocabulary richness (Yule, 1944), grammatical transition frequencies (Wickmann, 1976), rates of usage of frequent function words (Mosteller & Wallace, 1984), and preferences for words in certain semantic categories (Martindale & McKenzie, 1995). In many text-categorization tasks the choice of textual features is a crucial determinant of success, yet it is not usually treated as a major focus of attention. This is often true of AI-based text-categorization studies as well. It would be desirable if this part of the process were better understood. This paper, therefore, reports an empirical comparison of nine different methods of textual feature-finding that: (1) do not depend on subjective judgement; (2) do not need background knowledge external to the texts being analyzed, such as a lexicon or thesaurus; (3) do not presuppose that the texts being analyzed are in the English language; and (4) do not presume that words (or word-based measures) are the only possible textual descriptors. Results of a benchmark test on 13 representative text-classification problems suggest that one of these techniques, here designated Monte-Carlo Feature-Finding, has certain advantages that merit consideration by future workers seeking to characterize stylistic habits efficiently without imposing many preconceptions.
This program uses an evolutionary (Darwinian) optimization technique to perform clustering, i.e. ... more This program uses an evolutionary (Darwinian) optimization technique to perform clustering, i.e. it identifies within a dataset groups of items which in some sense belong together. An important point about CUES is that it decides on the number of groups as part of the optimization process without having to be given the number to find as input -unlike many well-established clustering algorithms. It has been written in Python3 and is released under the GNU Public License for general usage.
Literary and Linguistic Computing, 1999
When his daughter Tullia died in 45 BC, the Roman orator Marcus Tullius Cicero (106-43 BC) was as... more When his daughter Tullia died in 45 BC, the Roman orator Marcus Tullius Cicero (106-43 BC) was assailed by grief which he attempted to assuage by writing a philosophical work now known as the Consolatio. Despite its high reputation in the classical world, only fragments of this text-in the form of quotations by subsequent authors-are known to have survived the fall of Rome. However, in 1583 a book was printed in Venice purporting to be a rediscovery of Cicero's Consolatio. Its editor was a prominent humanist scholar and Ciceronian stylist called Carlo Sigonio. Some of Sigonio's contemporaries, notably Antonio Riccoboni, voiced doubts about the authenticity of this work; and since that time scholarly opinion has differed over the genuineness of the 1583 Consolatio. The main aim of this study is to bring modern stylometric methods to bear on this question in order to see whether internal linguistic evidence supports the belief that the Consolatio of 1583 is a fake, very probably perpetrated by Sigonio himself. A secondary objective is to test the application of methods previously used almost exclusively on English texts to a language with a different structure, namely Latin. Our findings show that language of the 1583 Consolatio is extremely uncharacteristic of Cicero, and indeed that the text is much more likely to have been written during the Renaissance than in classical times. The evidence that Sigonio himself was the author is also strong, though not conclusive.
ACM SIGEVOlution
EuroGP 2016 co-chair James McDermott describes the EvoStar opening keynote talk given by Dr Richa... more EuroGP 2016 co-chair James McDermott describes the EvoStar opening keynote talk given by Dr Richard Forsyth . An early GP pioneer, Richard Forsyth's talk covered some interesting interesting historical pathways and included a description of his BEAGLE system developed in 1981.
Social scientists face an overload of digitized information. In particular, they must often spend... more Social scientists face an overload of digitized information. In particular, they must often spend inordinate amounts of time coding and analyzing transcribed speech. This paper describes a study, in the field of learning science, of the feasibility of semi-automatically coding and scoring verbal data. Transcripts from 48 individual learners comprising 2 separate data sets of 44,000 and 23,000 words were used as test domains for the investigation of three research questions: (1) how well can utterancetype codes assigned to text segments by humans be predicted from the linguistic characteristics of those text segments? (2) how well can learning outcomes be predicted from learners' verbalizations? (3) can the material students are learning from be identified from their language? Initial results indicate that the answers to the third question is yes; and that the answer to the first two questions is: well enough to warrant further development of the text-mining techniques so far employed.
Http Dx Doi Org 10 1080 001401398186603, Nov 10, 2010
ABSTRACT In-depth studies of behavioural factors in road accidents using conventional methods are... more ABSTRACT In-depth studies of behavioural factors in road accidents using conventional methods are often inconclusive and costly. In a series of studies exploring alternative approaches, 200 cross-flow junction road accidents were sampled from the files of Nottinghamshire Constabulary, UK, coded for computer analysis using a specially devised ‘Traffic Related Action Analysis Language’, and then examined using different computational and statistical techniques. The present study employed an AI machine-learning method based on Quinlan's ‘ID3’ algorithm to create decision trees distinguishing the characteristics of accidents that resulted in injury or in damage only; accidents of young male drivers; and those of the relatively more and less dangerous situations. For example the severity of accidents involving turning onto a main road could be determined with 79% accuracy from the nature of the other vehicle, season, junction type, and whether the Turner failed to notice another road user. Accidents involving young male drivers could be identified with 77% accuracy by knowing if the junction was complex, and whether the Turner waited or slowed before turning.
Accident Analysis Prevention, 1999
In-depth studies of behavioural factors in road accidents using conventional methods are often in... more In-depth studies of behavioural factors in road accidents using conventional methods are often inconclusive and costly. In a series of studies exploring alternative approaches, 200 cross-flow junction road accidents were sampled from the files of Nottinghamshire Constabulary, UK, coded for computer analysis using a specially devised Traffic Related Action Analysis Language, and then examined using different computational and statistical techniques. For comparison, the same analyses were carried out on 100 descriptions of safe turns, and 100 descriptions of hypothetical accidents provided by experienced drivers. The present study employed a range of sequence analysis techniques to examine the patterns of events preceding accidents of different types. Differences were found between real accidents, hypothetical ones and safe turns; between accidents turning onto and off a road with the right of way; between the accidents of younger and older drivers; between accidents on minor roads and major roads; and between the accident expectations (but not the real accidents) of male and female drivers. Pairs of successive events often provided particularly good cues for discriminating accident types.
Literary and Linguistic Computing, 2014
Literary and Linguistic Computing, 1999
... These Table 3 Frequencies of substrings in two short poems Marker substrings Younger Yeats &#... more ... These Table 3 Frequencies of substrings in two short poems Marker substrings Younger Yeats 's, an' , whi' , with' , sea' , stars' , we' 's of' , you' 'ping , 'woo' , dee' Total Older Yeats 'what' , can' I? J , int' '. .ii : 'ck' 'hat' 'though' , that' Total Salley Gardens, 1888 (98 words) ...