GitHub - dmcc/PyStanfordDependencies: Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies (original) (raw)

PyStanfordDependencies

https://travis-ci.org/dmcc/PyStanfordDependencies.svg?branch=master https://badge.fury.io/py/PyStanfordDependencies.png https://coveralls.io/repos/dmcc/PyStanfordDependencies/badge.png?branch=master

Python interface for converting Penn Treebank trees to Universal Dependenciesand Stanford Dependencies.

Example usage

Start by getting a StanfordDependencies instance withStanfordDependencies.get_instance():

import StanfordDependencies sd = StanfordDependencies.get_instance(backend='subprocess')

get_instance() takes several options. backend can currently be subprocess or jpype (see below). If you have an existingStanford CoreNLP orStanford Parserjar file, use the jar_filename parameter to point to the full path of the jar file. Otherwise, PyStanfordDependencies will download a jar file for you and store it in locally (~/.local/share/pystanforddeps). You can request a specific version with the version flag, e.g.,version='3.4.1'. To convert trees, use the convert_trees() orconvert_tree() method (note that by default, convert_trees() can be considerably faster if you're doing batch conversion). These return a sentence (list of Token objects) or a list of sentences (list of list of Token objects) respectively:

sent = sd.convert_tree('(S1 (NP (DT some) (JJ blue) (NN moose)))') for token in sent: ... print token ... Token(index=1, form='some', cpos='DT', pos='DT', head=3, deprel='det') Token(index=2, form='blue', cpos='JJ', pos='JJ', head=3, deprel='amod') Token(index=3, form='moose', cpos='NN', pos='NN', head=0, deprel='root')

This tells you that moose is the head of the sentence and is modified by some (with a det = determiner relation) and blue(with an amod = adjective modifier relation). Fields on Tokenobjects are readable as attributes. See docs for additional options inconvert_tree() and convert_trees().

Visualization

If you have the asciitreepackage, you can use a prettier ASCII formatter:

print sent.as_asciitree() moose [root] +-- some [det] +-- blue [amod]

If you have Python 2.7 or later, you can use Graphviz to render your graphs. You'll need the Python graphviz package to callas_dotgraph():

dotgraph = sent.as_dotgraph() print dotgraph digraph { 0 [label=root] 1 [label=some] 3 -> 1 [label=det] 2 [label=blue] 3 -> 2 [label=amod] 3 [label=moose] 0 -> 3 [label=root] } dotgraph.render('moose') # renders a PDF by default 'moose.pdf' dotgraph.format = 'svg' dotgraph.render('moose') 'moose.svg'

The Python xdotpackage provides an interactive visualization:

import xdot window = xdot.DotWindow() window.set_dotcode(dotgraph.source)

Both as_asciitree() and as_dotgraph() allow customization. See the docs for additional options.

Backends

Currently PyStanfordDependencies includes two backends:

By default, PyStanfordDependencies will attempt to use the jpypebackend. If jpype isn't available or crashes on startup, PyStanfordDependencies will fallback to subprocess with a warning.

Universal Dependencies status

PyStanfordDependencies supports most features in Universal Dependencies (see issue #10 for the most up to date status). PyStanfordDependencies output matches Universal Dependencies in terms of structure and dependency labels, but Universal POS tags and features are missing. Currently, PyStanfordDependencies will output Universal Dependencies by default (unless you're using Stanford CoreNLP 3.5.1 or earlier).

More information

Licensed under Apache 2.0.

Written by David McClosky (homepage, code)

Bug reports and feature requests: GitHub issue tracker

Release summaries