Dana Movshovitz-Attias (original) (raw)

Ph.D., Computer Science Department
Carnegie Mellon University

Welcome!

I'm now at Google Research. I received a Ph.D. from the Computer Science Department, in the School of Computer Science at Carnegie Mellon University. My PhD adviser was William Cohen.

I am interested in the intersection of Natural Language Processing, Information Retrieval, and Machine Learning. My research experience includes the following topics:
grounded language learning, learning semantic relations, topic models, mining software repositories and software-focused corpora, bootstrapping on biomedical ontologies, knowledge base population, bootstrap learning and semantic drift, seed set refinement, text alignment with Hidden Markov Models, social media analysis, and computational biology.

Before coming to CMU, I got my M.Sc. and B.Sc. degrees in the Computer Science and Computational Biology program at theSchool of Computer Science and Engineering of The Hebrew University of Jerusalem. During that time, I did research at the Furman Lab (Dept. of Molecular Genetics and Biotechnology), and my adviser was Prof. Ora Schueler-Furman. In this group, we used computational methods to understand protein-protein interactions from a structural bioinformatics perspective. More specifically, we made predictions of the structural changes that take place in proteins during docking.

Apart from doing research I had a chance to get some great industry experience working for IBM, Facebook and Google.

You can find my full research and work history in my CV.

About me

One of the things I enjoy most is hiking and traveling around the world. So far one of my favorite hiking locations has been New-Zealand, and I plan to return! I have an awesome husband, who was also a CSD PhD student at CMU.

PhD Thesis: Grounded Knowledge Bases for Scientific Domains
Dana Movshovitz-Attias, August 2015
Committee: William Cohen, Tom Mitchel, Roni Rosenfeld, Alon Halevi
[pdf] [Thesis oral presentation]

KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts
Dana Movshovitz-Attias and William Cohen, 2015, Association for Computational Linguistics (ACL)
[pdf] [data] [ACL presentation] [bibtex]

Discovering Subsumption Relationships for Web-Based Ontologies
Dana Movshovitz-Attias, Steven Euijong Whang, Natalya Noy, and Alon Halevy, 2015,Proc. 18th International Workshop on the Web and Databases (WebDB) at ACM Sigmod
Winner of the WebDB Best Paper Award.
[pdf] [WebDB presentation] [bibtex]

Grounded Discovery of Coordinate Term Relationships between Software Entities
Dana Movshovitz-Attias and William Cohen, 2015, arXiv preprint arXiv:1505.00277
[pdf] [arXiv link] [bibtex]

Natural Language Models for Predicting Programming Comments
Dana Movshovitz-Attias and William Cohen, 2013, Association for Computational Linguistics (ACL)
[pdf] [corpus] [code (as Eclipse plugin)] [ACL presentation] [bibtex]

Analysis of the Reputation System and User Contributions on a Question Answering Website: StackOverflow
Dana Movshovitz-Attias*, Yair Movshovitz-Attias*, Peter Steenkiste and Christos Faloutsos, 2013, ASONAM
[pdf] [bibtex]

Alignment-HMM-based Extraction of Abbreviations from Biomedical Text
Dana Movshovitz-Attias and William Cohen, 2012, BioNLP in NAACL
[pdf] [github code (within the second-string package)] [code description and downloadable data] [abbreviations extracted from PubMed] [BioNLP presentation] [bibtex]

Bootstrapping Biomedical Ontologies for Scientific Text using NELL
Dana Movshovitz-Attias and William Cohen, 2012, BioNLP in NAACL
[pdf] [tech report] [BioNLP presentation] [bibtex]

Detection of Peptide‐Binding Sites on Protein Surfaces: The First Step Towards the Modeling and Targeting of Peptide‐Mediated Interactions
Assaf Lavi, Chi Ho Ngan, Dana Movshovitz‐Attias, Tanggis Bohnuud, Christine Yueh, Dmitri Beglov, Ora Schueler‐Furman, Dima Kozakov, 2013, Proteins: Structure, Function and Bioinformatics
[pdf] [bibtex]

Can Self-Inhibitory Peptides Be Derived from the Interfaces of Globular Protein-Protein Interactions?
Nir London, Barak Raveh, Dana Movshovitz-Attias and Ora Schueler-Furman, 2010, Proteins: Structure, Function and Bioinformatics
[pubmed] [bibtex]

On The Use of Structural Templates for High-Resolution Docking
Dana Movshovitz-Attias, Nir London and Ora Schueler-Furman, 2010, Proteins: Structure, Function and Bioinformatics
[pdf] [pubmed] [bibtex]
Poster presented at the 11th Israeli Bioinformatics Symposium at Tel-Aviv University, Israel, 4/2008.

The Structural Basis of Peptide-Protein Binding Strategies
Nir London, Dana Movshovitz-Attias and Ora Schueler-Furman, 2010, Structure
[pdf] [pubmed] [bibtex]
Poster presented at the 12th Israeli Bioinformatics Symposium at Weizmann Institute, Israel, 4/2009.

Software, Code, and Data

Code, tools, and research-related data.
If you have questions about this content, or if there is other data you would like to use, please contact me at: dma [at] cs.cmu.edu

KnowledgeBase-LDA (KB-LDA) Data

Dataset based on StackOverflow that was used to train the KB-LDA model from our ACL2015 paper.

Training Data:

The KB-LDA dataset [tar] was extracted from StackOverflow, parsed and cleaned. SVO and concept-instance relations were extracted based on the full data. We also include a clean list of the noun and verb tokens used from a sample of ~60K documents. The included files are:

so2013_svo_clean.csv

Subject-verb-object tuples (37k) extracted from StackOverflow corpus.

Format: id,subject,verb,object,count

so2013_hypernyms_clean.csv

Concept-instance pairs (17k) extracted from StackOverflow corpus.

Format: id,concept,instance,count

so2013_document_nouns_clean.csv

Document nouns (1.3m) from a sample of the StackOverflow corpus.

Format: quetion_id,noun,count

so2013_document_verbs_clean.csv

Document verbs (880k) from a sample of the StackOverflow corpus.

Format: quetion_id,verb,count

Learned Software Knowledge Base:

The following software knowledge base [zip] was learned with KB-LDA, and is evaluated in the paper. The data includes the learned topics, topic names (concepts), a topic hierarchy, and the top 100 learned relations. The README file details the format of the files and how the data was extracted.

Paper:

Dana Movshovitz-Attias and William Cohen, KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts, ACL, 2015

Update: July 23, 2015

Abbreviation Alignment HMM

This is an abbreviation extractor based on a Hidden Markov Model. With this code you can extract abbreviations and their definitions from a text corpus. The Abbreviation Alignment HMM code is a part of the second-string open source package.

Main Classes:

Code:

github code (within the second-string package).

Data: Abbreviations Extracted from PubMed

Using this method we extracted 1.4 million abbreviations from a corpus of 200K full text PubMed articles. The extracted abbreviations are available here.

Each line in this file contains:

  1. The probability of the abbreviation as given by the HMM
  2. The ID of the document in our corpus
  3. The Medline ID of the original text
  4. The Short Form of the abbreviation
  5. The Long Form of the abbreviation

Paper:

Dana Movshovitz-Attias and William Cohen, Alignment-HMM-based Extraction of Abbreviations from Biomedical Text, BioNLP in NAACL, 2012

Software Update: Sep 18, 2012

CMU

Courses and TA experience

TA at CMU

Spring 2014

Fall 2012

Contact

Email

dma [at] cs.cmu.edu

Office

GHC 7513

Office Phone

+1-412-268-3066

Address

Computer Sciences Department,
Carnegie Mellon University,
5000 Forbes Avenue,
Pittsburgh, PA 15213

CV

PDF