Large Scale Syntactic Annotation of written Dutch (original) (raw)
LASSY
... he refused to be a dog just like Lassy was ...
LASSY (Large Scale Syntactic Annotation of written Dutch) was a STEVINproject. STEVIN was the Flemish-Dutch Language and Speech Processing Technology Programme launched by de Nederlandse Taalunie.
A large corpus of written Dutch texts (1,000,000 words) has been syntactically annotated (manually corrected), based on CGN and D-COI. In addition, a very large corpus (more than 700,000,000 words) has been syntactically annotated automatically. The project extends the available syntactically annotated corpora for Dutch both in size as well as with respect to the various text genres and topical domains. In addition, various browse and search tools for syntactically annotated corpora have been developed and made available. Their potential for applications in corpus linguistics and information extraction is illustrated and evaluated in a series of case studies.
Partners
Lassy is carried out by a consortium consisting of the University of Groningen and the Katholieke Universiteit Leuven. Researchers involved in the project include:
Lassy Initiatives
- Lassy sponsored the invited lecture of Anette Frank at the ACL 2007 workshop Deep Linguistic Processing, June 28, 2007 in Prague. Further information is available from the DLP website
- Lassy initiated the local organization of TLT7: the Seventh International Conference on Treebanks and Linguistic Theories. January 23-24, 2009 in Groningen. Further information can be obtained from the TLT7 webpage.
- Lassy has sponsored an invited keynote lecture by Ken Churck (Microsoft Research) at the 30th anniversary TaBu symposium on June 11 and 12, 2009 in Groningen. Further information is available from the conference website.
- Lassy sponsored a workshop entitled Distributional Semantics Workshop, on June 23, 2010 in Groningen. More information can be obtained from the workshop webpage. The sponsorship made possible invited presentations at the workshop by Yves Peirsman (Leuven), Sophia Katrenko (Amsterdam) and Diarmuid Ó Séaghdha (Cambridge).
List of Resources
Descriptions of the project
- Project Proposal
- Short project description in DIXIT (in Dutch)
- A0 portrait poster (june 2008)
- A0 landscape poster (september 2008)
Annotation Manuals
DTD for Lassy XML files
- Use your right mouse button to save the following link: DTD for Lassy Dependency Structures
Tools for Lassy
- DACT, an easy to use corpus tool for Lassy corpora, developed by Daniel de Kok, with help from Jelmer van der Linde.
- command-line tools with similar functionality as Dact, developed by Daniel de Kok, with help from Jelmer van der Lind, Lars Buitinck, Peter Kleiweg
- GrETEL, another tool for querying Lassy treebanks, developed by Liesbeth Augustinus.
- Peter's version of Erik's Search Application, web application for searching pairs of words, initially developed by Erik Tjong Kim Sang, further developed by Peter Kleiweg
- Alpino parser
Some annotated sentences
In Lassy two treebanks have been delivered. The treebanks can be obtained from the TST-Centrale.
- Lassy Small is a 1 million word corpus with manually verified syntactic annotations. Lassy Small contains among others a subset of SONAR500, but for historical reasons, the identifiers of some of the sentences are different. An overview is given here.
- Lassy Large is a 700 million word corpus with automatically assigned syntactic annotations. Lassy Large contains the following corpora. The Wikipedia part is available on-line, as an example.
- Eindhoven corpus. 40 thousand sentences, 713 thousand tokens.
- EMEA corpus. Over 1 million sentences, 13 million tokens.
- Europarl corpus. Over 1 million sentences, 37 million tokens.
- Wikipedia dump of 2011. 9 million sentences, 145 million tokens.
- Senseval corpus of Dutch. 12 thousand sentences, 156 thousand tokens.
- SONAR500 corpus. 41 million sentences, 510 million tokens.
- Small corpus including the annual "Troonrede" of Queen Beatrix since 1990.
User Manuals
- DACT user manual
- DACT cookbook
- User Guide: How to use the Alpino/D-Coi/Lassy Treebank Tools
- User Guide: How to annotate with Alpino.OUT OF DATE
- User Guide: How to use Alpino
Deliverables
- Deliverable 1.1 (report)
- Deliverable 1.2 (report)
- Deliverables 2.1, 2.2, 3.1, 3.2, 3.3 were collectively available as the first release of the Lassy corpus. Current releases are available from the TST-Centrale -- later in 2016 this will be the Instituut voor Nederlandse Taal (INT).
- Deliverable 3.4 (report)
- Deliverable 3.5 (manual)
- Deliverable 4.1 is improved version of Alpino. The most recent version of Alpino is found here.
- Deliverable 4.2 (report)
- Deliverable 5.1 (report)
- Deliverable 5.2 (report)
- Deliverable 5.2 is the first release of the tools. These are part of the Alpino distribution available here.
- Deliverable 6.1 (case study 1)
- Deliverable 6.2 (case study 2)
- Deliverable 6.3 (case study 3)
Internal stuff
- Wiki for STEVIN projects
- Progress report 1 (pdf)
- Progress report 2 (pdf)
- Bijlage Progress report 2 (xls)
- Progress report 3 (pdf)
- Bijlage Progress report 3 (xls)
- Progress report 4 (pdf)
- Bijlage Progress report 4 (xls)
- Progress report 5 (pdf)
- Bijlage Progress report 5 (xls)
- Progress report 6 (pdf)
- Bijlage Progress report 6 (xls)
Publications about Lassy
- Gertjan van Noord, Ineke Schuurman, Vincent Vandeghinste. Syntactic Annotation of Large Corpora in STEVIN. In: LREC 2006. [pdf]
- Gosse Bouma and Geert Kloosterman. Mining Syntatically Annotated Corpora with XQuery. In: LAW 2007, Prague. [pdf]
- Martijn Wieling, Mark-Jan Nederhof, Gertjan van Noord. Parsing Partially Bracketed Input. In: CLIN 2005. Proceedings of the 16th Meeting of Computational Linguistics in the Netherlands. Pages 1--16. 2007. [pdf]
- Nelleke Oostdijk, Martin Reynaert, Paola Monachesi, Gertjan van Noord, Roland Ordelman, Ineke Schuurman, Vincent Vandeghinste. From D-Coi to SoNaR: A reference corpus for Dutch. In: LREC 2008. [pdf]
- Gertjan van Noord. Huge Parsed Corpora in LASSY. In: Frank van Eynde, Anette Frank, Koenraad de Smedt, Gertjan van Noord (editors), Proceedings of the Seventh International Workshop on Treebanks and Linguistic Theories (TLT 7). January 23-24, 2009, Groningen, The Netherlands. LOT Occasional Series. [LOT site]
- Frank van Eynde, Anette Frank, Koenraad de Smedt, Gertjan van Noord (editors), Proceedings of the Seventh International Workshop on Treebanks and Linguistic Theories (TLT 7). January 23-24, 2009, Groningen, The Netherlands. LOT Occasional Series. [LOT site]
- Ineke Schuurman, Veronique Hoste and Paola Monachesi. Cultivating Trees: Adding Several Semantic Layers to the Lassy Treebank in SoNaR. In: Frank van Eynde, Anette Frank, Koenraad de Smedt, Gertjan van Noord (editors), Proceedings of the Seventh International Workshop on Treebanks and Linguistic Theories (TLT 7). January 23-24, 2009, Groningen, The Netherlands. LOT Occasional Series. [LOT site]
- Gertjan van Noord and Gosse Bouma. Parsed Corpora for Linguistics. In: Proceedings of EACL Workshop The Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous? Athens, 2009. pp 33-39. [pdf]
- Gertjan van Noord, Gosse Bouma, Frank van Eynde, Daniel de Kok, Jelmer van der Linde, Ineke Schuurman, Erik Tjong Kim Sang, Vincent Vandeghinste. Large Scale Syntactic Annotation of Written Dutch: Lassy. In: STEVIN volume to be published by Springer.
Research which makes use of the Lassy treebanks
Below, we did not list the various publications of related STEVIN projects which build upon the Lassy corpora, which include the projects SoNaR, Paco-MT, and DPC.
- Gertjan van Noord. Using Self-Trained Bilexical Preferences to Improve Disambiguation Accuracy. In: IWPT2007, Prague, 2007. [pdf]
- Gosse Bouma, Jori Mur, Gertjan van Noord, Lonneke van der Plas, J�rg Tiedemann. Question Answering with Joost at CLEF 2008. CLEF 2008 Working Notes. Aarhus Denmark, 2008. [pdf]
- Barbara Plank and Gertjan van Noord. Exploring An Auxiliary Distribution based approach to Domain Adaptation of a Syntactic Disambiguation Model. In: Coling Workhop 'Cross Framework and Cross Domain Parser Evaluation'. 2008. [pdf]
- Gosse Bouma, Geert Kloosterman, Jori Mur, Gertjan van Noord, Lonneke van der Plas, and J�rg Tiedemann. Question Answering with Joost at CLEF 2007. In: Carol Peters, Valentin Jijkoun, Thomas Mandl, Henning Mueller, Douglas W. Oard, Anselmo Penas, Vivien Petras, Diana Santos (editors), Advances in Multilingual and Multimodal Information Retrieval, 8th workshop of the Cross-Language Evaluation Form, CLEF 2007, Budapest, Hungary, September 19-21, 2007, Revised Selected Papers. Lecture Notes in Computer Science 5152, Springer 2008. pp 257-260.
- Erik Tjong Kim Sang. To Use a Treebank or Not - Which Is Better for Hypernym Extraction? In: Frank van Eynde, Anette Frank, Koenraad de Smedt, Gertjan van Noord (editors), Proceedings of the Seventh International Workshop on Treebanks and Linguistic Theories (TLT 7). January 23-24, 2009, Groningen, The Netherlands. LOT Occasional Series. [LOT site]
- Anna Lobanova, Jennifer Spenader, Tim van de Cruys, Tom van der Kleij and Erik Tjong Kim Sang. Automatic Relation Extraction - Can Synonym Extraction Benefit from Antonym Knowledge? In: Proceedings of WordNets and other Lexical Semantic Resources - between Lexical Semantics, Lexicography, Terminology and Formal Ontologies (NODALIDA2009 workshop), Odense, Denmark, May 2009. [pdf]
- Gosse Bouma and Jennifer Spenader. The Distribution of Weak and Strong Object Reflexives in Dutch. In: Frank van Eynde, Anette Frank, Koenraad de Smedt, Gertjan van Noord (editors), Proceedings of the Seventh International Workshop on Treebanks and Linguistic Theories (TLT 7). January 23-24, 2009, Groningen, The Netherlands. LOT Occasional Series. [LOT site]
- Erik Tjong Kim Sang and Katja Hofmann, Lexical Patterns or Dependency Patterns: Which Is Better for Hypernym Extraction? In: Proceedings of CoNLL-2009, Boulder, CO, USA, June 2009. [pdf]
- Gertjan van Noord, Learning Efficient Parsing. In: EACL 2009. The 12th Conference of the European Chapter of the Association for Computational Linguistics. 30 March - 3 April 2009, Athens, Greece. pp 817-825. [pdf]
- Dani�l de Kok, Jianqiang Ma and Gertjan van Noord, A generalized method for iterative error mining in parsing results. In: ACL2009 Workshop Grammar Engineering Across Frameworks (GEAF), Singapore, 2009. [pdf]
- Vincent Vandeghinste. Tree-based target language modeling. In: 13th Annual conference of the European Association for machine translation pages:152-159. Barcelona 2009.
- Barbara Plank. Improved statistical measures to assess natural language parser performance across domains. In: LREC 2010. [pdf]
- Kostadin Cholakov and Gertjan van Noord. Using Unknown Word Techniques To Learn Known Words. In: EMNLP 2010. [pdf]
- Barbara Plank and Gertjan van Noord. Grammar-driven versus data-driven: which parsing system is more affected by domain shifts? In: NLPLING '10. Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground. Pages 25-33. 2010. [web-page]
- Barbara Plank and Gertjan van Noord. Dutch Dependency Parser Performance Across Domains. In: Proceedings of the 20th Meeting of Computational Linguistics in the Netherlands. [pdf]
- Kostadin Cholakov and Gertjan van Noord. Acquisition of Unknown Word Paradigms for Large Scale Grammars. In: COLING 2010: Poster Volume, pages 153-161. August 23-27, Beijing, China. [pdf]
- Gertjan van Noord. Self-trained Bilexical Preferences to Improve Disambiguation Accuracy. In: Harry Bunt, Paola Merlo and Joakim Nivre (editors), Trends in Parsing Technology. Dependency Parsing, Domain Adaptation, and Deep Parsing. Springer Verlag. pp 183-200. 2010. [draft pdf; book page of publisher]
- Daniel de Kok and Barbara Plank and Gertjan van Noord. Reversible Stochastic Attribute-value Grammars. In: ACL 2011. [pdf]
- Kostadin Cholakov, Gertjan van Noord, Valia Kordoni, Yi Zhang. An empirical comparison of Unknown Word Prediction Methods. In: IJCNLP 2011. [pdf]
- Philip van Oosten, V�ronique Hoste and Dries Tanghe, A Posteriori Agreement as a Quality Measure for Readability Prediction Systems. In: Computational Linguistics and Intelligent Text Processing. Lecture Notes in Computer Science, 2011, Volume 6609/2011, 424-435, DOI: 10.1007/978-3-642-19437-5_35. [web-page]
- Nick Ruiz and Edgar Weiffenbach. Using corpora tools to analyze gradable nouns in Dutch. In: Computational Linguistics in the Netherlands Journal 1 (2011) 41-59. [pdf]
- Philip van Oosten, Veronique Hoste. Readability annotation: replacing the expert by the crowd. In: IUNLPBEA '11 Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications. Pages 120-129. [web-page]
- Daniel de Kok. Discriminative features in reversible stochastic attribute-value grammars. In: Proceedings of the UCNLG+Eval: Language Generation and Evaluation Workshop. EMNLP 2011. [pdf]
- Barbara Plank. Domain Adaptation for Parsing. Ph.D.-thesis University of Groningen, 2011.
- Liesbeth Augustinus, Vincent Vandeghinste, Frank Van Eynde, Example-Based Treebank Querying. In: LREC 2012. Instanbul, 2012. [pdf]