Mark Liberman | University of Pennsylvania (original) (raw)
Papers by Mark Liberman
MTag is an application for identifying and extracting clinical descriptions of malignancy present... more MTag is an application for identifying and extracting clinical descriptions of malignancy presented in text. The application uses the machine learning technique Conditional Random Fields and incorporates domain-specific features. MTag was tested with 1,010 training and 432 evaluation documents pertaining to cancer genomics. Our experiments resulted in 0.85 precision, 0.82 recall, and 0.83 F-measure on the evaluation set. Availability: The software is available at http://bioie.
Proc. INTERSPEECH-2007, 2007
This paper describes a crosslinguistic disfluency perception experiment. We tested the recognizab... more This paper describes a crosslinguistic disfluency perception experiment. We tested the recognizability of pause fillers and partial words in English, German and Mandarin. Subjects were speakers of English with no knowledge of Mandarin or German. We found that subjects could identify disfluent from fluent utterances at a level above chance. Pause fillers were easier to identify than partial words. Accuracy rates were highest for English, followed by German and then Mandarin. Although German accuracy rates were higher ...
Proc. INTERSPEECH-2007, Jan 1, 2007
… , 1994. Vol. 2- …, Jan 1, 2002
We report the status of the UNIPEN project of data exchange and recognizer benchmarks started two... more We report the status of the UNIPEN project of data exchange and recognizer benchmarks started two years ago at the initiative of the International Association of Pattern Recognition (Technical Committee 11). The purpose of the project is to propose and implement solutions to the growing need of handwriting samples for on-line handwriting recognizers used by pen-based computers. Researchers from several companies and universities have agreed on a data format, a platform of data exchange and a protocol for recognizer benchmarks. The on-line handwriting data of concern may include handprint and cursive from various alphabets (including Latin and Chinese), signatures and pen gestures. These data will be compiled and distributed by the Linguistic Data Consortium. The benchmarks will be arbitrated the US National Institute of Standards and Technologies. We give a brief introduction to the UNIPEN format. We explain the protocol of data exchange and benchmarks.
Computers & Mathematics with Applications, 1996
A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden c... more A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden citation behind it? I haven't been able to trace it. 1 Electric Words (henceforth EW, also used to refer jointly to the three authors) is a report on work done to, and with, machine-readable dictionaries, in particular LDOCE, the Longman Dictionary of Contemporary English (1978 edition). To" Machine-readable dictionaries are often nothing else than typesetters' tapes, a far cry from lexical data bases. Somewhere in the middle is EW's concept of machine-tractable dictionaries, where the information is formalized to a certain extent, the extent depending on the nature of the information itself, ranging from the easily formalizable (because simply listable: parts of speech, subcategorization codes) to unformalizable (genuine citations, i.e., unrestricted natural language).
Naacl, 1991
We see two purposes for this first session: increased communications among research communities i... more We see two purposes for this first session: increased communications among research communities in some danger of drifting apart, and a comparison of alternative goals and organizational structures for such communities. Obviously, a single hour-long session is no more than a symbolic gesture in this direction, even ff the time had not been truncated further by schedule overruns pressing against an inflexible dinner hour, but we feel that the symbol was nevertheless a worthwhile and important one.
A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden c... more A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden citation behind it? I haven't been able to trace it. 1 Electric Words (henceforth EW, also used to refer jointly to the three authors) is a report on work done to, and with, machine-readable dictionaries, in particular LDOCE, the Longman Dictionary of Contemporary English (1978 edition). To" Machine-readable dictionaries are often nothing else than typesetters' tapes, a far cry from lexical data bases. Somewhere in the middle is EW's concept of machine-tractable dictionaries, where the information is formalized to a certain extent, the extent depending on the nature of the information itself, ranging from the easily formalizable (because simply listable: parts of speech, subcategorization codes) to unformalizable (genuine citations, i.e., unrestricted natural language).
Language Dynamics and Change
ABSTRACT
Computational Linguistics, 1991
MTag is an application for identifying and extracting clinical descriptions of malignancy present... more MTag is an application for identifying and extracting clinical descriptions of malignancy presented in text. The application uses the machine learning technique Conditional Random Fields and incorporates domain-specific features. MTag was tested with 1,010 training and 432 evaluation documents pertaining to cancer genomics. Our experiments resulted in 0.85 precision, 0.82 recall, and 0.83 F-measure on the evaluation set. Availability: The software is available at http://bioie.
Proc. INTERSPEECH-2007, 2007
This paper describes a crosslinguistic disfluency perception experiment. We tested the recognizab... more This paper describes a crosslinguistic disfluency perception experiment. We tested the recognizability of pause fillers and partial words in English, German and Mandarin. Subjects were speakers of English with no knowledge of Mandarin or German. We found that subjects could identify disfluent from fluent utterances at a level above chance. Pause fillers were easier to identify than partial words. Accuracy rates were highest for English, followed by German and then Mandarin. Although German accuracy rates were higher ...
Proc. INTERSPEECH-2007, Jan 1, 2007
… , 1994. Vol. 2- …, Jan 1, 2002
We report the status of the UNIPEN project of data exchange and recognizer benchmarks started two... more We report the status of the UNIPEN project of data exchange and recognizer benchmarks started two years ago at the initiative of the International Association of Pattern Recognition (Technical Committee 11). The purpose of the project is to propose and implement solutions to the growing need of handwriting samples for on-line handwriting recognizers used by pen-based computers. Researchers from several companies and universities have agreed on a data format, a platform of data exchange and a protocol for recognizer benchmarks. The on-line handwriting data of concern may include handprint and cursive from various alphabets (including Latin and Chinese), signatures and pen gestures. These data will be compiled and distributed by the Linguistic Data Consortium. The benchmarks will be arbitrated the US National Institute of Standards and Technologies. We give a brief introduction to the UNIPEN format. We explain the protocol of data exchange and benchmarks.
Computers & Mathematics with Applications, 1996
A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden c... more A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden citation behind it? I haven't been able to trace it. 1 Electric Words (henceforth EW, also used to refer jointly to the three authors) is a report on work done to, and with, machine-readable dictionaries, in particular LDOCE, the Longman Dictionary of Contemporary English (1978 edition). To" Machine-readable dictionaries are often nothing else than typesetters' tapes, a far cry from lexical data bases. Somewhere in the middle is EW's concept of machine-tractable dictionaries, where the information is formalized to a certain extent, the extent depending on the nature of the information itself, ranging from the easily formalizable (because simply listable: parts of speech, subcategorization codes) to unformalizable (genuine citations, i.e., unrestricted natural language).
Naacl, 1991
We see two purposes for this first session: increased communications among research communities i... more We see two purposes for this first session: increased communications among research communities in some danger of drifting apart, and a comparison of alternative goals and organizational structures for such communities. Obviously, a single hour-long session is no more than a symbolic gesture in this direction, even ff the time had not been truncated further by schedule overruns pressing against an inflexible dinner hour, but we feel that the symbol was nevertheless a worthwhile and important one.
A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden c... more A funny title--I surmise that it will often be misquoted as Electronic Words. Is there a hidden citation behind it? I haven't been able to trace it. 1 Electric Words (henceforth EW, also used to refer jointly to the three authors) is a report on work done to, and with, machine-readable dictionaries, in particular LDOCE, the Longman Dictionary of Contemporary English (1978 edition). To" Machine-readable dictionaries are often nothing else than typesetters' tapes, a far cry from lexical data bases. Somewhere in the middle is EW's concept of machine-tractable dictionaries, where the information is formalized to a certain extent, the extent depending on the nature of the information itself, ranging from the easily formalizable (because simply listable: parts of speech, subcategorization codes) to unformalizable (genuine citations, i.e., unrestricted natural language).
Language Dynamics and Change
ABSTRACT
Computational Linguistics, 1991