Will Radford - Academia.edu (original) (raw)

Papers by Will Radford

Research paper thumbnail of Automating Financial Surveillance

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, 2010

Financial surveillance technology alerts analysts to suspicious trading events. Our aim is to ide... more Financial surveillance technology alerts analysts to suspicious trading events. Our aim is to identify explainable false positives (e.g., caused by price-sensitive information in company news) and explainable true positives (e.g., caused by ramping in forums) by aligning these alerts with publicly available information. Our system aligns 99% of alerts, which will speed the analysts' task by helping them to eliminate false positives and gather evidence for true positives more rapidly.

Research paper thumbnail of Learning multilingual named entity recognition from Wikipedia

Artificial Intelligence, 2013

We automatically create enormous, free and multilingual silver-standard training annotations for ... more We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify and classify names of people, locations and organisations in text. This dependence on expensive annotation is the knowledge bottleneck our work overcomes. We first classify each Wikipedia article into named entity (ne) types, training and evaluating on 7200 manually-labelled Wikipedia articles across nine languages. Our crosslingual approach achieves up to 95% accuracy. We transform the links between articles into ne annotations by projecting the target article's classifications onto the anchor text. This approach yields reasonable annotations, but does not immediately compete with existing gold-standard data. By inferring additional links and heuristically tweaking the Wikipedia corpora, we better align our automatic annotations to gold standards. We annotate millions of words in nine languages, evaluating English, German, Spanish, Dutch and Russian Wikipedia-trained models against conll shared task data and other gold-standard corpora. Our approach outperforms other approaches to automatic ne annotation (Richman and Schone, 2008 [61], Mika et al., 2008 [46]) competes with goldstandard training when tested on an evaluation corpus from a different source; and performs 10% better than newswire-trained models on manually-annotated Wikipedia text.

Research paper thumbnail of TAT: an author profiling tool with application to Arabic emails

Proceedings of the Australasian Language Technology Workshop, 2007

This paper reports on the application of the Text Attribution Tool (TAT) to profiling the authors... more This paper reports on the application of the Text Attribution Tool (TAT) to profiling the authors of Arabic emails. The TAT system has been developed for the purpose of language-independent author profiling and has now been trained on two email corpora, English and Arabic. We describe the overall TAT system and the Machine Learning experiments resulting in classifiers for the different author traits. Predictions for demographic and psychometric author traits show improvements over the baseline for some of the ...

Research paper thumbnail of Gendered Ambiguous Pronoun (GAP) Shared Task at the Gender Bias in NLP Workshop 2019

Proceedings of the First Workshop on Gender Bias in Natural Language Processing, 2019

The 1st ACL workshop on Gender Bias in Natural Language Processing included a shared task on gend... more The 1st ACL workshop on Gender Bias in Natural Language Processing included a shared task on gendered ambiguous pronoun (GAP) resolution. This task was based on the coreference challenge defined in Webster et al. (2018), designed to benchmark the ability of systems to resolve pronouns in real-world contexts in a gender-fair way. 263 teams competed via a Kaggle competition, with the winning system achieving logloss of 0.13667 and near gender parity. We review the approaches of eleven systems with accepted description papers, noting their effective use of BERT (Devlin et al., 2019), both via fine-tuning and for feature extraction, as well as ensembling.

Research paper thumbnail of Joint Apposition Extraction with Syntactic and Semantic Constraints

Appositions are adjacent NPs used to add information to a discourse. We propose systems exploitin... more Appositions are adjacent NPs used to add information to a discourse. We propose systems exploiting syntactic and semantic constraints to extract appositions from OntoNotes. Our joint log-linear model outperforms the state-of-the-art Favre and Hakkani-Tür (2009) model by ∼10% on Broadcast News, and achieves 54.3% Fscore on multiple genres.

Research paper thumbnail of (Almost) Total Recall -- SYDNEY_CMCRC at TAC 2012

We explore unsupervised and supervised whole-document approaches to English NEL with naïve and co... more We explore unsupervised and supervised whole-document approaches to English NEL with naïve and context clustering. Our best system uses unsupervised entity linking and naïve clustering and scores 66.5% B 3 + F1 score. Our KB clustering score is competitive with the top systems at 65.6%.

Research paper thumbnail of SYDNEY CMCRC at TAC 2013

We use a supervised whole-document approach to English Entity Linking with simple clustering appr... more We use a supervised whole-document approach to English Entity Linking with simple clustering approaches. The system extends our TAC 2012 system (Radford et al., 2012), introducing new features for modelling local entity description and type-specific matching as well type-specific supervised models and supervised NIL classification. Our rule-based clustering takes advantage of local description and topics to split NIL clusters. The best system uses supervised entity linking and local description type clustering and scores 72.7% B+ F1 score. Our KB clustering score is competitive with the top system at 71.4%.

Research paper thumbnail of Can adult mental health be predicted by childhood future-self narratives? Insights from the CLPsych 2018 Shared Task

Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

The CLPsych 2018 Shared Task B explores how childhood essays can predict psychological distress t... more The CLPsych 2018 Shared Task B explores how childhood essays can predict psychological distress throughout the author's life. Our main aim was to build tools to help our psychologists understand the data, propose features and interpret predictions. We submitted two linear regression models: MODELA uses simple demographic and wordcount features, while MODELB uses linguistic, entity, typographic, expert-gazetteer, and readability features. Our models perform best at younger prediction ages, with our best unofficial score at 23 of 0.426 disattenuated Pearson correlation. This task is challenging and although predictive performance is limited, we propose that tight integration of expertise across computational linguistics and clinical psychology is a productive direction.

Research paper thumbnail of Tracking Information Flow between Primary and Secondary News Sources

Tracking information flow (IFLOW) is crucial to understanding the evolution of news stories. We p... more Tracking information flow (IFLOW) is crucial to understanding the evolution of news stories. We present analysis and experiments for IFLOW between company announcements and newswire. Error analysis shows that many FPs are annotation errors and many FNs are due to coarse-grained document-level modelling. Experiments show that document meta-data features (e.g., category, length, timing) improve f-scores relative to upper bound by 23%.

Research paper thumbnail of Probabilistic matching for dialog state tracking with limited training data

This report details our submission to the fourth Dialog State Tracking Challenge (DSTC4), the fir... more This report details our submission to the fourth Dialog State Tracking Challenge (DSTC4), the first time Xerox has participated. Accordingly, we have taken a segment-specific approach that attempts to identify ontology values as precisely as possible using a statistical model. Our model is inspired by work in Named Entity Linking that extracts mentions, then searches and reranks candidates. This is mainly motivated by the small amount of data available relative to the high complexity of the task. However, we believe this setting is realistic in the industrial environment where few data are generally available for a given dialog context to automate. This relatively simple approach performs reasonably at 38.5% F1 using schedule 2 evaluation, and is the most precise at 59.4% on the DSTC4 test set.

Research paper thumbnail of Learning to generate one-sentence biographies from Wikidata

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We investigate the generation of onesentence Wikipedia biographies from facts derived from Wikida... more We investigate the generation of onesentence Wikipedia biographies from facts derived from Wikidata slot-value pairs. We train a recurrent neural network sequence-to-sequence model with attention to select facts and generate textual summaries. Our model incorporates a novel secondary objective that helps ensure it generates sentences that contain the input facts. The model achieves a BLEU score of 41, improving significantly upon the vanilla sequence-to-sequence model and scoring roughly twice that of a simple template baseline. Human preference evaluation suggests the model is nearly as good as the Wikipedia reference. Manual analysis explores content selection, suggesting the model can trade the ability to infer knowledge against the risk of hallucinating incorrect information.

Research paper thumbnail of Classification of mental health forum posts

Proceedings of the Third Workshop on Computational Lingusitics and Clinical Psychology, 2016

Research paper thumbnail of Discovering Entity Knowledge Bases on the Web

Proceedings of the 5th Workshop on Automated Knowledge Base Construction, 2016

Recognition and disambiguation of named entities in text is a knowledge-intensive task. Systems a... more Recognition and disambiguation of named entities in text is a knowledge-intensive task. Systems are typically bound by the resources and coverage of a single target knowledge base (KB). In place of a fixed knowledge base, we attempt to infer a set of endpoints which reliably disambiguate entity mentions on the web. We propose a method for discovering web KBs and our preliminary results suggest that web KBs allow linking to entities that can be found on the web, but may not merit a major KB entry.

Research paper thumbnail of Naïve but effective NIL clustering baselines - CMCRC at TAC 2011

Abstract This paper describes the CMCRC systems entered in the TAC 2011 entity linking challenge.... more Abstract This paper describes the CMCRC systems entered in the TAC 2011 entity linking challenge. We used our best-performing system from TAC 2010 to link queries, then clustered NIL links. We focused on naıve baselines that group by attributes of the top entity ...

Research paper thumbnail of Linking named entities to Wikipedia

Natural language is fraught with problems of ambiguity, including name reference. A name in text ... more Natural language is fraught with problems of ambiguity, including name reference. A name in text can refer to multiple entities just as an entity can be known by different names. This thesis examines how a mention in text can be linked to an external knowledge base (), in our case, Wikipedia. The named entity linking () task requires systems to identify the entry, or Wikipedia article, that a mention refers to; or, if the does not contain the correct entry, return. Entity linking systems can be complex and we present a framework for analysing their different components. First, mentions must be extracted from the text. The is searched to build a list of candidate entries for a mention. Finally, a disambiguation component will identify the correct entry or propose a link. This provides a lens through which to understand and compare systems, and a way to characterise how performance in one component affects another. We use this framework to comprehensively analyse three seminal systems: Bunescu and Paşca (2006), Cucerzan (2007) and Varma et al. (2009). These are evaluated on a common dataset and we Name:

Research paper thumbnail of Cheap and easy entity evaluation

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2014

The AIDA-YAGO dataset is a popular target for whole-document entity recognition and disambiguatio... more The AIDA-YAGO dataset is a popular target for whole-document entity recognition and disambiguation, despite lacking a shared evaluation tool. We review evaluation regimens in the literature while comparing the output of three approaches, and identify research opportunities. This utilises our open, accessible evaluation tool. We exemplify a new paradigm of distributed, shared evaluation, in which evaluation software and standardised, versioned system outputs are provided online.

Research paper thumbnail of Evaluating Entity Linking with Wikipedia

Artificial Intelligence, 2013

ABSTRACT Named Entity Linking (NEL) grounds entity mentions to their corresponding node in a Know... more ABSTRACT Named Entity Linking (NEL) grounds entity mentions to their corresponding node in a Knowledge Base (KB). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate entities and then disambiguate them, returning either the best candidate or NIL. However, comparison has focused on disambiguation accuracy, making it difficult to determine how search impacts performance. Furthermore, important approaches from the literature have not been systematically compared on standard data sets. We reimplement three seminal NEL systems and present a detailed evaluation of search strategies. Our experiments find that coreference and acronym handling lead to substantial improvement, and search strategies account for much of the variation between systems. This is an interesting finding, because these aspects of the problem have often been neglected in the literature, which has focused largely on complex candidate ranking algorithms.

Research paper thumbnail of Tracking information flow in financial text

Information is fundamental to Finance, and understanding how it flows from official sources to ne... more Information is fundamental to Finance, and understanding how it flows from official sources to news agencies is a central problem. Readers need to digest information rapidly from high volume news feeds, which often contain duplicate and irrelevant stories, to gain a competitive advantage. We propose a text categorisation task over pairs of official announcements and news stories to identify whether the story repeats announcement information and/or adds value. Using features based on the intersection of the texts and relative timing, our system identifies information flow at 89.5% F-score and three types of journalistic contribution at 73.4% to 85.7% Fscore. Evaluation against majority annotator decision performs 13% better than a bag-of-words baseline.

Research paper thumbnail of Email Document Parsing Method and Apparatus

Complete Patent Searching Database and Patent Data Analytics Services.

Research paper thumbnail of Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids}

Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids}, 2010

@Book{CLW:2010, editor = {Michael Piotrowski and Cerstin Mahlow and Robert Dale}, title = {Procee... more @Book{CLW:2010, editor = {Michael Piotrowski and Cerstin Mahlow and Robert Dale}, title = {Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids}, month = {June}, year = {2010}, address = {Los Angeles, CA, USA}, publisher = {Association for Computational Linguistics}, url = {http://www.aclweb.org/anthology/ W10-04} } @InProceedings{rosener:2010:CLW, author = {R\"{o}sener, Christoph}, title = {Computational Linguistics in the Translator's Workflow---Combining Authoring ...

Research paper thumbnail of Automating Financial Surveillance

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, 2010

Financial surveillance technology alerts analysts to suspicious trading events. Our aim is to ide... more Financial surveillance technology alerts analysts to suspicious trading events. Our aim is to identify explainable false positives (e.g., caused by price-sensitive information in company news) and explainable true positives (e.g., caused by ramping in forums) by aligning these alerts with publicly available information. Our system aligns 99% of alerts, which will speed the analysts' task by helping them to eliminate false positives and gather evidence for true positives more rapidly.

Research paper thumbnail of Learning multilingual named entity recognition from Wikipedia

Artificial Intelligence, 2013

We automatically create enormous, free and multilingual silver-standard training annotations for ... more We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify and classify names of people, locations and organisations in text. This dependence on expensive annotation is the knowledge bottleneck our work overcomes. We first classify each Wikipedia article into named entity (ne) types, training and evaluating on 7200 manually-labelled Wikipedia articles across nine languages. Our crosslingual approach achieves up to 95% accuracy. We transform the links between articles into ne annotations by projecting the target article's classifications onto the anchor text. This approach yields reasonable annotations, but does not immediately compete with existing gold-standard data. By inferring additional links and heuristically tweaking the Wikipedia corpora, we better align our automatic annotations to gold standards. We annotate millions of words in nine languages, evaluating English, German, Spanish, Dutch and Russian Wikipedia-trained models against conll shared task data and other gold-standard corpora. Our approach outperforms other approaches to automatic ne annotation (Richman and Schone, 2008 [61], Mika et al., 2008 [46]) competes with goldstandard training when tested on an evaluation corpus from a different source; and performs 10% better than newswire-trained models on manually-annotated Wikipedia text.

Research paper thumbnail of TAT: an author profiling tool with application to Arabic emails

Proceedings of the Australasian Language Technology Workshop, 2007

This paper reports on the application of the Text Attribution Tool (TAT) to profiling the authors... more This paper reports on the application of the Text Attribution Tool (TAT) to profiling the authors of Arabic emails. The TAT system has been developed for the purpose of language-independent author profiling and has now been trained on two email corpora, English and Arabic. We describe the overall TAT system and the Machine Learning experiments resulting in classifiers for the different author traits. Predictions for demographic and psychometric author traits show improvements over the baseline for some of the ...

Research paper thumbnail of Gendered Ambiguous Pronoun (GAP) Shared Task at the Gender Bias in NLP Workshop 2019

Proceedings of the First Workshop on Gender Bias in Natural Language Processing, 2019

The 1st ACL workshop on Gender Bias in Natural Language Processing included a shared task on gend... more The 1st ACL workshop on Gender Bias in Natural Language Processing included a shared task on gendered ambiguous pronoun (GAP) resolution. This task was based on the coreference challenge defined in Webster et al. (2018), designed to benchmark the ability of systems to resolve pronouns in real-world contexts in a gender-fair way. 263 teams competed via a Kaggle competition, with the winning system achieving logloss of 0.13667 and near gender parity. We review the approaches of eleven systems with accepted description papers, noting their effective use of BERT (Devlin et al., 2019), both via fine-tuning and for feature extraction, as well as ensembling.

Research paper thumbnail of Joint Apposition Extraction with Syntactic and Semantic Constraints

Appositions are adjacent NPs used to add information to a discourse. We propose systems exploitin... more Appositions are adjacent NPs used to add information to a discourse. We propose systems exploiting syntactic and semantic constraints to extract appositions from OntoNotes. Our joint log-linear model outperforms the state-of-the-art Favre and Hakkani-Tür (2009) model by ∼10% on Broadcast News, and achieves 54.3% Fscore on multiple genres.

Research paper thumbnail of (Almost) Total Recall -- SYDNEY_CMCRC at TAC 2012

We explore unsupervised and supervised whole-document approaches to English NEL with naïve and co... more We explore unsupervised and supervised whole-document approaches to English NEL with naïve and context clustering. Our best system uses unsupervised entity linking and naïve clustering and scores 66.5% B 3 + F1 score. Our KB clustering score is competitive with the top systems at 65.6%.

Research paper thumbnail of SYDNEY CMCRC at TAC 2013

We use a supervised whole-document approach to English Entity Linking with simple clustering appr... more We use a supervised whole-document approach to English Entity Linking with simple clustering approaches. The system extends our TAC 2012 system (Radford et al., 2012), introducing new features for modelling local entity description and type-specific matching as well type-specific supervised models and supervised NIL classification. Our rule-based clustering takes advantage of local description and topics to split NIL clusters. The best system uses supervised entity linking and local description type clustering and scores 72.7% B+ F1 score. Our KB clustering score is competitive with the top system at 71.4%.

Research paper thumbnail of Can adult mental health be predicted by childhood future-self narratives? Insights from the CLPsych 2018 Shared Task

Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

The CLPsych 2018 Shared Task B explores how childhood essays can predict psychological distress t... more The CLPsych 2018 Shared Task B explores how childhood essays can predict psychological distress throughout the author's life. Our main aim was to build tools to help our psychologists understand the data, propose features and interpret predictions. We submitted two linear regression models: MODELA uses simple demographic and wordcount features, while MODELB uses linguistic, entity, typographic, expert-gazetteer, and readability features. Our models perform best at younger prediction ages, with our best unofficial score at 23 of 0.426 disattenuated Pearson correlation. This task is challenging and although predictive performance is limited, we propose that tight integration of expertise across computational linguistics and clinical psychology is a productive direction.

Research paper thumbnail of Tracking Information Flow between Primary and Secondary News Sources

Tracking information flow (IFLOW) is crucial to understanding the evolution of news stories. We p... more Tracking information flow (IFLOW) is crucial to understanding the evolution of news stories. We present analysis and experiments for IFLOW between company announcements and newswire. Error analysis shows that many FPs are annotation errors and many FNs are due to coarse-grained document-level modelling. Experiments show that document meta-data features (e.g., category, length, timing) improve f-scores relative to upper bound by 23%.

Research paper thumbnail of Probabilistic matching for dialog state tracking with limited training data

This report details our submission to the fourth Dialog State Tracking Challenge (DSTC4), the fir... more This report details our submission to the fourth Dialog State Tracking Challenge (DSTC4), the first time Xerox has participated. Accordingly, we have taken a segment-specific approach that attempts to identify ontology values as precisely as possible using a statistical model. Our model is inspired by work in Named Entity Linking that extracts mentions, then searches and reranks candidates. This is mainly motivated by the small amount of data available relative to the high complexity of the task. However, we believe this setting is realistic in the industrial environment where few data are generally available for a given dialog context to automate. This relatively simple approach performs reasonably at 38.5% F1 using schedule 2 evaluation, and is the most precise at 59.4% on the DSTC4 test set.

Research paper thumbnail of Learning to generate one-sentence biographies from Wikidata

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We investigate the generation of onesentence Wikipedia biographies from facts derived from Wikida... more We investigate the generation of onesentence Wikipedia biographies from facts derived from Wikidata slot-value pairs. We train a recurrent neural network sequence-to-sequence model with attention to select facts and generate textual summaries. Our model incorporates a novel secondary objective that helps ensure it generates sentences that contain the input facts. The model achieves a BLEU score of 41, improving significantly upon the vanilla sequence-to-sequence model and scoring roughly twice that of a simple template baseline. Human preference evaluation suggests the model is nearly as good as the Wikipedia reference. Manual analysis explores content selection, suggesting the model can trade the ability to infer knowledge against the risk of hallucinating incorrect information.

Research paper thumbnail of Classification of mental health forum posts

Proceedings of the Third Workshop on Computational Lingusitics and Clinical Psychology, 2016

Research paper thumbnail of Discovering Entity Knowledge Bases on the Web

Proceedings of the 5th Workshop on Automated Knowledge Base Construction, 2016

Recognition and disambiguation of named entities in text is a knowledge-intensive task. Systems a... more Recognition and disambiguation of named entities in text is a knowledge-intensive task. Systems are typically bound by the resources and coverage of a single target knowledge base (KB). In place of a fixed knowledge base, we attempt to infer a set of endpoints which reliably disambiguate entity mentions on the web. We propose a method for discovering web KBs and our preliminary results suggest that web KBs allow linking to entities that can be found on the web, but may not merit a major KB entry.

Research paper thumbnail of Naïve but effective NIL clustering baselines - CMCRC at TAC 2011

Abstract This paper describes the CMCRC systems entered in the TAC 2011 entity linking challenge.... more Abstract This paper describes the CMCRC systems entered in the TAC 2011 entity linking challenge. We used our best-performing system from TAC 2010 to link queries, then clustered NIL links. We focused on naıve baselines that group by attributes of the top entity ...

Research paper thumbnail of Linking named entities to Wikipedia

Natural language is fraught with problems of ambiguity, including name reference. A name in text ... more Natural language is fraught with problems of ambiguity, including name reference. A name in text can refer to multiple entities just as an entity can be known by different names. This thesis examines how a mention in text can be linked to an external knowledge base (), in our case, Wikipedia. The named entity linking () task requires systems to identify the entry, or Wikipedia article, that a mention refers to; or, if the does not contain the correct entry, return. Entity linking systems can be complex and we present a framework for analysing their different components. First, mentions must be extracted from the text. The is searched to build a list of candidate entries for a mention. Finally, a disambiguation component will identify the correct entry or propose a link. This provides a lens through which to understand and compare systems, and a way to characterise how performance in one component affects another. We use this framework to comprehensively analyse three seminal systems: Bunescu and Paşca (2006), Cucerzan (2007) and Varma et al. (2009). These are evaluated on a common dataset and we Name:

Research paper thumbnail of Cheap and easy entity evaluation

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2014

The AIDA-YAGO dataset is a popular target for whole-document entity recognition and disambiguatio... more The AIDA-YAGO dataset is a popular target for whole-document entity recognition and disambiguation, despite lacking a shared evaluation tool. We review evaluation regimens in the literature while comparing the output of three approaches, and identify research opportunities. This utilises our open, accessible evaluation tool. We exemplify a new paradigm of distributed, shared evaluation, in which evaluation software and standardised, versioned system outputs are provided online.

Research paper thumbnail of Evaluating Entity Linking with Wikipedia

Artificial Intelligence, 2013

ABSTRACT Named Entity Linking (NEL) grounds entity mentions to their corresponding node in a Know... more ABSTRACT Named Entity Linking (NEL) grounds entity mentions to their corresponding node in a Knowledge Base (KB). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate entities and then disambiguate them, returning either the best candidate or NIL. However, comparison has focused on disambiguation accuracy, making it difficult to determine how search impacts performance. Furthermore, important approaches from the literature have not been systematically compared on standard data sets. We reimplement three seminal NEL systems and present a detailed evaluation of search strategies. Our experiments find that coreference and acronym handling lead to substantial improvement, and search strategies account for much of the variation between systems. This is an interesting finding, because these aspects of the problem have often been neglected in the literature, which has focused largely on complex candidate ranking algorithms.

Research paper thumbnail of Tracking information flow in financial text

Information is fundamental to Finance, and understanding how it flows from official sources to ne... more Information is fundamental to Finance, and understanding how it flows from official sources to news agencies is a central problem. Readers need to digest information rapidly from high volume news feeds, which often contain duplicate and irrelevant stories, to gain a competitive advantage. We propose a text categorisation task over pairs of official announcements and news stories to identify whether the story repeats announcement information and/or adds value. Using features based on the intersection of the texts and relative timing, our system identifies information flow at 89.5% F-score and three types of journalistic contribution at 73.4% to 85.7% Fscore. Evaluation against majority annotator decision performs 13% better than a bag-of-words baseline.

Research paper thumbnail of Email Document Parsing Method and Apparatus

Complete Patent Searching Database and Patent Data Analytics Services.

Research paper thumbnail of Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids}

Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids}, 2010

@Book{CLW:2010, editor = {Michael Piotrowski and Cerstin Mahlow and Robert Dale}, title = {Procee... more @Book{CLW:2010, editor = {Michael Piotrowski and Cerstin Mahlow and Robert Dale}, title = {Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids}, month = {June}, year = {2010}, address = {Los Angeles, CA, USA}, publisher = {Association for Computational Linguistics}, url = {http://www.aclweb.org/anthology/ W10-04} } @InProceedings{rosener:2010:CLW, author = {R\"{o}sener, Christoph}, title = {Computational Linguistics in the Translator's Workflow---Combining Authoring ...