Vitor R. Carvalho - Software and Datasets (original) (raw)

Software and Datasets

Software:� Jangada

Jangada is an API for signature block extraction and reply-to extraction from email messages. The ideas follow the ideas of the following paper (CEAS2004 - Learning to Extract Signature and Reply Lines from Email ),, but performance was slightly improved by using a new set of features not mentioned in the original reference.

Some Features: Extracts signature blocks and reply lines in email messages with very good accuracy. Can be easily integrated in other Java applications (For instance, the entire email message as a String can be used as input). Can be easily integrated in other Minorthird applications (using the TextLabels format, it accepts as input email messages with other annotations - such as dates, personal names, speech acts, etc)

**Licensing:**University of Illinois/NCSA Open Source License

Documentation: Very poor. An initial javadocs page is here. There is some documentation on how to use Jangada in the example files below.

**Requires:**j2sdk1.4 or later. Uses MinorThird.jar.

Recommended: When using email files as input, results will be better if the messages are in mime (.eml) format.

Usage example:

  1. create a new directory (for instance, jangadaDir)

  2. download jangada.jar,minorThird.jar, the example files, and the email files to jangadaDir

  3. Unzip (gunzip Demos.tar.gz) and Untar (tar �xvf Demos.tar) the example files, as well as the email files.

  4. add jangadaDir, jangadaDir/minorThird.jar �and jangadaDir_/jangada.jar_ to the CLASSPATH

  5. For a quick demo,

  6. compile the example files. For instance: �javac Demo2.java� � (in case of errors, please check you CLASSPATH again)

  7. run the examples on the email files directory: �java Demo2 emails/*�

  8. Check the documentation on the DemoX.java files and try your own application.

Reminder 1: if you�d like to have access to the source code, please send me an email.

Reminder 2: If you used this package, please cite the following reference:

Learning to Extract Signature and Reply Lines from Email _,Vitor R. Carvalho and William W. Cohen, CEAS-2004 (Conference on Email and Anti-Spam), Mountain View,CA,_July 2004

Software:� Ciranda

A java application that predicts the Email-Acts (or email speech-Acts) of email messages. The ideas follow the contents of the following papers (emnlp04and sigir05), but performance was significantly improved by careful feature selection and additional features.

Some Features:

Predicts the following acts: Request, Commit, Deliver, Propose, Meet, dData.

Provides the confidence in each prediction.

Easy way to use these acts as features in your application.

**Licensing:**No guarantees are provided. Lots of bugs for sure. Use at your own risk!

Documentation: Very poor. An initial javadocs page is here. Please check Example.java on how to use it.

**Requires:**j2sdk1.4 or later. Uses MinorThird.jar (see below)

**Questions:**I�ll be happy to help, especially if you tell me what a good Ciranda is� :-)

Usage example:

  1. create a new directory called ciranda, and ciranda/lib

  2. download ciranda.jarand minorThird.jar to ciranda/lib

  3. add _ciranda/_and lib/ciranda.jar to the CLASSPATH

  4. download the example file Example.java to ciranda/

  5. compile it: �javac Example.java� � (in case of errors, please check you CLASSPATH again)

  6. run the example: �java Example�

  7. or run the main application on a directory with emails in text format (without headers)

  8. create the test directory ciranda/testdir

  9. add some emails in text format (such as msg1, msg2, msg3) to ciranda/testdir

  10. run �java �jar� lib/ciranda.jar� testdir�

  11. or try your own application.

**Reminder:**Send me an email if you'd like the source code. If you use this package, please use the following reference:

Learning to Classify Email into �Speech Acts� ,,William W. Cohen, Vitor R. Carvalho and Tom M. Mitchell, EMNLP-2004 (Conference on Empirical Methods in Natural Language Processing), Barcelona, Spain, July 2004

**Dataset:**�Signature and Reply Dataset[Datasets in Minorthird Format]

These 617 email messages have signature lines and reply-to lines annotations. The messages are a subset of the 20 Newsgroups dataset (produced by Ken Lang at CMU in the mid-90's).

Back to Vitor Carvalho�s Home page