Mark Liberman | University of Pennsylvania (original) (raw)

Papers by Mark Liberman

Research paper thumbnail of Identifying and extracting malignancy types in cancer literature

MTag is an application for identifying and extracting clinical descriptions of malignancy present... more MTag is an application for identifying and extracting clinical descriptions of malignancy presented in text. The application uses the machine learning technique Conditional Random Fields and incorporates domain-specific features. MTag was tested with 1,010 training and 432 evaluation documents pertaining to cancer genomics. Our experiments resulted in 0.85 precision, 0.82 recall, and 0.83 F-measure on the evaluation set. Availability: The software is available at http://bioie.

Research paper thumbnail of Perception of disfluency: language differences and listener bias

Proc. INTERSPEECH-2007, 2007

This paper describes a crosslinguistic disfluency perception experiment. We tested the recognizab... more This paper describes a crosslinguistic disfluency perception experiment. We tested the recognizability of pause fillers and partial words in English, German and Mandarin. Subjects were speakers of English with no knowledge of Mandarin or German. We found that subjects could identify disfluent from fluent utterances at a level above chance. Pause fillers were easier to identify than partial words. Accuracy rates were highest for English, followed by German and then Mandarin. Although German accuracy rates were higher ...

Research paper thumbnail of Perception of Disfluency: Language Differences and Listener Bias

Proc. INTERSPEECH-2007, Jan 1, 2007

Research paper thumbnail of UNIPEN Project of on-Line Data Exchange and Recognizer Benchmarks

… , 1994. Vol. 2- …, Jan 1, 2002

We report the status of the UNIPEN project of data exchange and recognizer benchmarks started two... more We report the status of the UNIPEN project of data exchange and recognizer benchmarks started two years ago at the initiative of the International Association of Pattern Recognition (Technical Committee 11). The purpose of the project is to propose and implement solutions to the growing need of handwriting samples for on-line handwriting recognizers used by pen-based computers. Researchers from several companies and universities have agreed on a data format, a platform of data exchange and a protocol for recognizer benchmarks. The on-line handwriting data of concern may include handprint and cursive from various alphabets (including Latin and Chinese), signatures and pen gestures. These data will be compiled and distributed by the Linguistic Data Consortium. The benchmarks will be arbitrated the US National Institute of Standards and Technologies. We give a brief introduction to the UNIPEN format. We explain the protocol of data exchange and benchmarks.

Research paper thumbnail of Electric words: Dictionaries, computers, and meanings

Computers & Mathematics with Applications, 1996

A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden c... more A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden citation behind it? I haven't been able to trace it. 1 Electric Words (henceforth EW, also used to refer jointly to the three authors) is a report on work done to, and with, machine-readable dictionaries, in particular LDOCE, the Longman Dictionary of Contemporary English (1978 edition). To" Machine-readable dictionaries are often nothing else than typesetters' tapes, a far cry from lexical data bases. Somewhere in the middle is EW's concept of machine-tractable dictionaries, where the information is formalized to a certain extent, the extent depending on the nature of the information itself, ranging from the easily formalizable (because simply listable: parts of speech, subcategorization codes) to unformalizable (genuine citations, i.e., unrestricted natural language).

Research paper thumbnail of On finding the iguana

Research paper thumbnail of Session 1: Speech and Natural Language Efforts in the U. S. and Abroad

Naacl, 1991

We see two purposes for this first session: increased communications among research communities i... more We see two purposes for this first session: increased communications among research communities in some danger of drifting apart, and a comparison of alternative goals and organizational structures for such communities. Obviously, a single hour-long session is no more than a symbolic gesture in this direction, even ff the time had not been truncated further by schedule overruns pressing against an inflexible dinner hour, but we feel that the symbol was nevertheless a worthwhile and important one.

Research paper thumbnail of Surviving Tough Times" 20th Annual Southern California Visitor Industry Outlook Conference

Research paper thumbnail of How Hard is Syntax (abstract)

Research paper thumbnail of Computational approaches to analyzing weblogs

Research paper thumbnail of The intonational structure of English

Research paper thumbnail of Error analysis and disfluenc y modeling in the switchbboard domain

Research paper thumbnail of Tutorial on Text Corpora

Research paper thumbnail of Electric Words: Dictionaries, Computers, and Meanings

A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden c... more A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden citation behind it? I haven't been able to trace it. 1 Electric Words (henceforth EW, also used to refer jointly to the three authors) is a report on work done to, and with, machine-readable dictionaries, in particular LDOCE, the Longman Dictionary of Contemporary English (1978 edition). To" Machine-readable dictionaries are often nothing else than typesetters' tapes, a far cry from lexical data bases. Somewhere in the middle is EW's concept of machine-tractable dictionaries, where the information is formalized to a certain extent, the extent depending on the nature of the information itself, ranging from the easily formalizable (because simply listable: parts of speech, subcategorization codes) to unformalizable (genuine citations, i.e., unrestricted natural language).

Research paper thumbnail of Variation and change in the use of hesitation markers in Germanic languages

Language Dynamics and Change

ABSTRACT

Research paper thumbnail of Lessons for Reproducible Science from the DARPA Speech and Language Program

Research paper thumbnail of Annual Review of Linguistics Volume 1, 2015 Introduction

Research paper thumbnail of A Status Report on the ACL/DCI

Computational Linguistics, 1991

Research paper thumbnail of Organizing Committee

Research paper thumbnail of Active Dependency Formation in the Processing of Backwards Anaphora

Research paper thumbnail of Identifying and extracting malignancy types in cancer literature

MTag is an application for identifying and extracting clinical descriptions of malignancy present... more MTag is an application for identifying and extracting clinical descriptions of malignancy presented in text. The application uses the machine learning technique Conditional Random Fields and incorporates domain-specific features. MTag was tested with 1,010 training and 432 evaluation documents pertaining to cancer genomics. Our experiments resulted in 0.85 precision, 0.82 recall, and 0.83 F-measure on the evaluation set. Availability: The software is available at http://bioie.

Research paper thumbnail of Perception of disfluency: language differences and listener bias

Proc. INTERSPEECH-2007, 2007

This paper describes a crosslinguistic disfluency perception experiment. We tested the recognizab... more This paper describes a crosslinguistic disfluency perception experiment. We tested the recognizability of pause fillers and partial words in English, German and Mandarin. Subjects were speakers of English with no knowledge of Mandarin or German. We found that subjects could identify disfluent from fluent utterances at a level above chance. Pause fillers were easier to identify than partial words. Accuracy rates were highest for English, followed by German and then Mandarin. Although German accuracy rates were higher ...

Research paper thumbnail of Perception of Disfluency: Language Differences and Listener Bias

Proc. INTERSPEECH-2007, Jan 1, 2007

Research paper thumbnail of UNIPEN Project of on-Line Data Exchange and Recognizer Benchmarks

… , 1994. Vol. 2- …, Jan 1, 2002

We report the status of the UNIPEN project of data exchange and recognizer benchmarks started two... more We report the status of the UNIPEN project of data exchange and recognizer benchmarks started two years ago at the initiative of the International Association of Pattern Recognition (Technical Committee 11). The purpose of the project is to propose and implement solutions to the growing need of handwriting samples for on-line handwriting recognizers used by pen-based computers. Researchers from several companies and universities have agreed on a data format, a platform of data exchange and a protocol for recognizer benchmarks. The on-line handwriting data of concern may include handprint and cursive from various alphabets (including Latin and Chinese), signatures and pen gestures. These data will be compiled and distributed by the Linguistic Data Consortium. The benchmarks will be arbitrated the US National Institute of Standards and Technologies. We give a brief introduction to the UNIPEN format. We explain the protocol of data exchange and benchmarks.

Research paper thumbnail of Electric words: Dictionaries, computers, and meanings

Computers & Mathematics with Applications, 1996

A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden c... more A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden citation behind it? I haven't been able to trace it. 1 Electric Words (henceforth EW, also used to refer jointly to the three authors) is a report on work done to, and with, machine-readable dictionaries, in particular LDOCE, the Longman Dictionary of Contemporary English (1978 edition). To" Machine-readable dictionaries are often nothing else than typesetters' tapes, a far cry from lexical data bases. Somewhere in the middle is EW's concept of machine-tractable dictionaries, where the information is formalized to a certain extent, the extent depending on the nature of the information itself, ranging from the easily formalizable (because simply listable: parts of speech, subcategorization codes) to unformalizable (genuine citations, i.e., unrestricted natural language).

Research paper thumbnail of On finding the iguana

Research paper thumbnail of Session 1: Speech and Natural Language Efforts in the U. S. and Abroad

Naacl, 1991

We see two purposes for this first session: increased communications among research communities i... more We see two purposes for this first session: increased communications among research communities in some danger of drifting apart, and a comparison of alternative goals and organizational structures for such communities. Obviously, a single hour-long session is no more than a symbolic gesture in this direction, even ff the time had not been truncated further by schedule overruns pressing against an inflexible dinner hour, but we feel that the symbol was nevertheless a worthwhile and important one.

Research paper thumbnail of Surviving Tough Times" 20th Annual Southern California Visitor Industry Outlook Conference

Research paper thumbnail of How Hard is Syntax (abstract)

Research paper thumbnail of Computational approaches to analyzing weblogs

Research paper thumbnail of The intonational structure of English

Research paper thumbnail of Error analysis and disfluenc y modeling in the switchbboard domain

Research paper thumbnail of Tutorial on Text Corpora

Research paper thumbnail of Electric Words: Dictionaries, Computers, and Meanings

A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden c... more A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden citation behind it? I haven't been able to trace it. 1 Electric Words (henceforth EW, also used to refer jointly to the three authors) is a report on work done to, and with, machine-readable dictionaries, in particular LDOCE, the Longman Dictionary of Contemporary English (1978 edition). To" Machine-readable dictionaries are often nothing else than typesetters' tapes, a far cry from lexical data bases. Somewhere in the middle is EW's concept of machine-tractable dictionaries, where the information is formalized to a certain extent, the extent depending on the nature of the information itself, ranging from the easily formalizable (because simply listable: parts of speech, subcategorization codes) to unformalizable (genuine citations, i.e., unrestricted natural language).

Research paper thumbnail of Variation and change in the use of hesitation markers in Germanic languages

Language Dynamics and Change

ABSTRACT

Research paper thumbnail of Lessons for Reproducible Science from the DARPA Speech and Language Program

Research paper thumbnail of Annual Review of Linguistics Volume 1, 2015 Introduction

Research paper thumbnail of A Status Report on the ACL/DCI

Computational Linguistics, 1991

Research paper thumbnail of Organizing Committee

Research paper thumbnail of Active Dependency Formation in the Processing of Backwards Anaphora