Self-training for Parsing Biomedical Literature with the Charniak Parser (original) (raw)

Note: If you're looking for our biomedical event extraction software, please see this page instead.

I am now (June 16, 2009) distributing my division of the GENIA 1.0 trees in Penn Treebank format. You can download them here.

Using the above trees, I repeated the self-training experiments from our ACL 2008 paper using GENIA 1.0 trees as the labeled data. This also allowed me to create a GENIA reranker. The results (on the dev set from my division) are quite dramatic:

Model f-score
WSJ 74.9
WSJ + WSJ reranker 76.8
WSJ + PubMed (parsed by WSJ) + WSJ reranker 80.7 [1]
Genia 83.6
Genia + WSJ reranker 84.5
Genia + Genia reranker 85.7
Genia + PubMed (parsed by Genia) + Genia reranker 87.6 [2]

[1] Original self-trained biomedical parsing model (ACL 2008)
[2] Improved self-trained biomedical parsing model (please see my thesis)

Improved self-trained biomedical parsing model

Available here. Please cite my thesis if you use this model:

Original self-trained biomedical parsing model

Available here. This is deprecated and only here for historical purposes.

The DATA/ directory is an alternate data directory, trained from WSJ and 266,664 randomly collected biomedical abstracts from PubMed. Using the standard WSJ-trained reranker (included with the BLLIP reranking parser), this model achieves an _f_-score of 84.3% on the GENIAtreebank beta 2 test set. For more details, please see:

More information about self-training can be found in these papers: