Dina Wonsever | Universidad de la República (Uruguay) (original) (raw)

Papers by Dina Wonsever

Research paper thumbnail of Automatic Curation of Court Documents: Anonymizing Personal Data

Information, 2022

In order to provide open access to data of public interest, it is often necessary to perform seve... more In order to provide open access to data of public interest, it is often necessary to perform several data curation processes. In some cases, such as biological databases, curation involves quality control to ensure reliable experimental support for biological sequence data. In others, such as medical records or judicial files, publication must not interfere with the right to privacy of the persons involved. There are also interventions in the published data with the aim of generating metadata that enable a better experience of querying and navigation. In all cases, the curation process constitutes a bottleneck that slows down general access to the data, so it is of great interest to have automatic or semi-automatic curation processes. In this paper, we present a solution aimed at the automatic curation of our National Jurisprudence Database, with special focus on the process of the anonymization of personal information. The anonymization process aims to hide the names of the partici...

Research paper thumbnail of A computational framework for the analysis of the Uruguayan dictatorship archives

Between 1973 and 1985, a civic-military dictatorship ruled in Uruguay. Systematic violations of h... more Between 1973 and 1985, a civic-military dictatorship ruled in Uruguay. Systematic violations of human rights marked this period. Project Cruzar.uy aims to develop tools and methodologies to analyze historical documents from that period. We present the advances in this ongoing project. We describe a set of tools to automatize the extraction and organization of information from the archives using computational tools including image processing, machine learning, natural language processing, information extraction and integration.

Research paper thumbnail of Towards De-identification of Legal Texts

ArXiv, 2019

In many countries, personal information that can be published or shared between organizations is ... more In many countries, personal information that can be published or shared between organizations is regulated and, therefore, documents must undergo a process of de-identification to eliminate or obfuscate confidential data. Our work focuses on the de-identification of legal texts, where the goal is to hide the names of the actors involved in a lawsuit without losing the sense of the story. We present a first evaluation on our corpus of NLP tools in tasks such as segmentation, tokenization and recognition of named entities, and we analyze several evaluation measures for our de-identification task. Results are meager: 84% of the documents have at least one name not covered by NER tools, something that might lead to the re-identification of involved names. We conclude that tools must be strongly adapted for processing texts of this particular domain.

Research paper thumbnail of TEMANTEX: A Markup Language for Spanish Temporal Expressions and Indicators

Research in Computing Science, 2015

We describe the TEMANTEX annotation scheme for temporal expressions and other lexical indicators ... more We describe the TEMANTEX annotation scheme for temporal expressions and other lexical indicators of temporality and we analyze a first annotation experience. TEMANTEX is mainly a revision of the markup language TIMEX3, but with some additions and a different treatment for relative expressions. Our alternative proposal is justified for two reasons. First, our system aims to cover other temporality-related lexical elements by defining annotations for what we call temporal indicators, which do not have an equivalent in the TimeML system. Second, regarding temporal expressions, our scheme has relevant differences that improve the annotation process and the interpretation potential. A first task of corpus annotation on a set of 2.300 words, comprising 33 temporal expressions and 35 temporal indicators, showed encouraging results.

Research paper thumbnail of Marcadores Del Discurso en Español Análisis y Representación

Research paper thumbnail of Nat-MuItilinguaI: tools for muItilinguaI interfaces in databases

This report describes the beginning, development, and implementation of Atlanta's four-quarter sc... more This report describes the beginning, development, and implementation of Atlanta's four-quarter school year program. Under the plan, students attend any three of the four quarters offered each year, or they may enroll in all four quarters to take remedial or enrichment courses or to graduate early. The report indicates that during the first summer quarter of operation, approximately 39 percent of Atlanta's high school students enrolled in one or more courses. Main sections provide background information on (1) Atlanta and the needs of Atlanta students, (2) educational planning in Atlanta, (3) the development of the quarter plan and the process of informing the community, and (4) implementation of the plan. One section offers answers to frequently asked questions about the plan. ACKNOWLEDGEMENTS This report was developed in response to a large number of requests for a description of the four-quarter school plan of Atlanta. Obviously, the report would not have been possible had not thousands of individuals participated in developing and implementing the program. To them a note of appreciation is expressed. In the preparation of this report, grateful appreciation is expressed specifically to Miss Edith Miller for her assistance in organizing and editing and to Miss Marie Jamhoor for her untiring efforts and invaluable assistance in typing, organizing, and providing other related services during the entire preparation. PLANNING FOR ATLANTA With appropriate adaptations, the Atlanta School System followed the procedure outlined by the metropolitan group in organizing its task forces for the development of the new curriculum for the four-quarter plan. The overall steering committee for the system was composed of all high school principals, area superintendents, the assistant superintendent for instruction, and some members of his staff. Teachers Department Chairmen Committees Small Group To Consolidate Reports Principals With Counselors Small Group To Consolidate Reports Superintenddnt's Staff Board of Education Teachers 7. * Flow charts showing the non-sequential courses in included in pages 24 through 30 9. science and mathematics are presented on pages 31 through 35

Research paper thumbnail of Corpus informatizado: textos del español del Uruguay (CORIN)

Research paper thumbnail of Event annotation schemes and event recognition in spanish texts

Computational Linguistics and Intelligent Text Processing, 2012

This paper presents an annotation scheme for events in Spanish texts, based on TimeML for English... more This paper presents an annotation scheme for events in Spanish texts, based on TimeML for English. This scheme is contrasted with different proposals, all of them based on TimeML, for various Romance languages: Italian, French and Spanish. Two manually annotated corpora for Spanish, under the proposed scheme, are now available. While manual annotation is far from trivial, we obtained a very good event identification agreement (93% of events were identically identified by both annotators). Part of the annotated text was used as a training ...

Research paper thumbnail of Event annotation schemes and event recognition in spanish texts

Computational Linguistics and Intelligent Text Processing, 2012

This paper presents an annotation scheme for events in Spanish texts, based on TimeML for English... more This paper presents an annotation scheme for events in Spanish texts, based on TimeML for English. This scheme is contrasted with different proposals, all of them based on TimeML, for various Romance languages: Italian, French and Spanish. Two manually annotated corpora for Spanish, under the proposed scheme, are now available. While manual annotation is far from trivial, we obtained a very good event identification agreement (93% of events were identically identified by both annotators). Part of the annotated text was used as a training ...

Research paper thumbnail of Improving Speculative Language Detection using Linguistic Knowledge

Extra-Propositional Aspects of Meaning in …, Jul 13, 2012

In this paper we present an iterative methodology to improve classifier performance by incorporat... more In this paper we present an iterative methodology to improve classifier performance by incorporating linguistic knowledge, and propose a way to incorporate domain rules into the learning process. We applied the methodology to the tasks of hedge cue recognition and scope detection and obtained competitive results on a publicly available corpus.

Research paper thumbnail of Contextual Rules for Text Analysis

Lecture Notes in Computer Science, 2001

Research paper thumbnail of Automatic Curation of Court Documents: Anonymizing Personal Data

Information, 2022

In order to provide open access to data of public interest, it is often necessary to perform seve... more In order to provide open access to data of public interest, it is often necessary to perform several data curation processes. In some cases, such as biological databases, curation involves quality control to ensure reliable experimental support for biological sequence data. In others, such as medical records or judicial files, publication must not interfere with the right to privacy of the persons involved. There are also interventions in the published data with the aim of generating metadata that enable a better experience of querying and navigation. In all cases, the curation process constitutes a bottleneck that slows down general access to the data, so it is of great interest to have automatic or semi-automatic curation processes. In this paper, we present a solution aimed at the automatic curation of our National Jurisprudence Database, with special focus on the process of the anonymization of personal information. The anonymization process aims to hide the names of the partici...

Research paper thumbnail of A computational framework for the analysis of the Uruguayan dictatorship archives

Between 1973 and 1985, a civic-military dictatorship ruled in Uruguay. Systematic violations of h... more Between 1973 and 1985, a civic-military dictatorship ruled in Uruguay. Systematic violations of human rights marked this period. Project Cruzar.uy aims to develop tools and methodologies to analyze historical documents from that period. We present the advances in this ongoing project. We describe a set of tools to automatize the extraction and organization of information from the archives using computational tools including image processing, machine learning, natural language processing, information extraction and integration.

Research paper thumbnail of Towards De-identification of Legal Texts

ArXiv, 2019

In many countries, personal information that can be published or shared between organizations is ... more In many countries, personal information that can be published or shared between organizations is regulated and, therefore, documents must undergo a process of de-identification to eliminate or obfuscate confidential data. Our work focuses on the de-identification of legal texts, where the goal is to hide the names of the actors involved in a lawsuit without losing the sense of the story. We present a first evaluation on our corpus of NLP tools in tasks such as segmentation, tokenization and recognition of named entities, and we analyze several evaluation measures for our de-identification task. Results are meager: 84% of the documents have at least one name not covered by NER tools, something that might lead to the re-identification of involved names. We conclude that tools must be strongly adapted for processing texts of this particular domain.

Research paper thumbnail of TEMANTEX: A Markup Language for Spanish Temporal Expressions and Indicators

Research in Computing Science, 2015

We describe the TEMANTEX annotation scheme for temporal expressions and other lexical indicators ... more We describe the TEMANTEX annotation scheme for temporal expressions and other lexical indicators of temporality and we analyze a first annotation experience. TEMANTEX is mainly a revision of the markup language TIMEX3, but with some additions and a different treatment for relative expressions. Our alternative proposal is justified for two reasons. First, our system aims to cover other temporality-related lexical elements by defining annotations for what we call temporal indicators, which do not have an equivalent in the TimeML system. Second, regarding temporal expressions, our scheme has relevant differences that improve the annotation process and the interpretation potential. A first task of corpus annotation on a set of 2.300 words, comprising 33 temporal expressions and 35 temporal indicators, showed encouraging results.

Research paper thumbnail of Marcadores Del Discurso en Español Análisis y Representación

Research paper thumbnail of Nat-MuItilinguaI: tools for muItilinguaI interfaces in databases

This report describes the beginning, development, and implementation of Atlanta's four-quarter sc... more This report describes the beginning, development, and implementation of Atlanta's four-quarter school year program. Under the plan, students attend any three of the four quarters offered each year, or they may enroll in all four quarters to take remedial or enrichment courses or to graduate early. The report indicates that during the first summer quarter of operation, approximately 39 percent of Atlanta's high school students enrolled in one or more courses. Main sections provide background information on (1) Atlanta and the needs of Atlanta students, (2) educational planning in Atlanta, (3) the development of the quarter plan and the process of informing the community, and (4) implementation of the plan. One section offers answers to frequently asked questions about the plan. ACKNOWLEDGEMENTS This report was developed in response to a large number of requests for a description of the four-quarter school plan of Atlanta. Obviously, the report would not have been possible had not thousands of individuals participated in developing and implementing the program. To them a note of appreciation is expressed. In the preparation of this report, grateful appreciation is expressed specifically to Miss Edith Miller for her assistance in organizing and editing and to Miss Marie Jamhoor for her untiring efforts and invaluable assistance in typing, organizing, and providing other related services during the entire preparation. PLANNING FOR ATLANTA With appropriate adaptations, the Atlanta School System followed the procedure outlined by the metropolitan group in organizing its task forces for the development of the new curriculum for the four-quarter plan. The overall steering committee for the system was composed of all high school principals, area superintendents, the assistant superintendent for instruction, and some members of his staff. Teachers Department Chairmen Committees Small Group To Consolidate Reports Principals With Counselors Small Group To Consolidate Reports Superintenddnt's Staff Board of Education Teachers 7. * Flow charts showing the non-sequential courses in included in pages 24 through 30 9. science and mathematics are presented on pages 31 through 35

Research paper thumbnail of Corpus informatizado: textos del español del Uruguay (CORIN)

Research paper thumbnail of Event annotation schemes and event recognition in spanish texts

Computational Linguistics and Intelligent Text Processing, 2012

This paper presents an annotation scheme for events in Spanish texts, based on TimeML for English... more This paper presents an annotation scheme for events in Spanish texts, based on TimeML for English. This scheme is contrasted with different proposals, all of them based on TimeML, for various Romance languages: Italian, French and Spanish. Two manually annotated corpora for Spanish, under the proposed scheme, are now available. While manual annotation is far from trivial, we obtained a very good event identification agreement (93% of events were identically identified by both annotators). Part of the annotated text was used as a training ...

Research paper thumbnail of Event annotation schemes and event recognition in spanish texts

Computational Linguistics and Intelligent Text Processing, 2012

This paper presents an annotation scheme for events in Spanish texts, based on TimeML for English... more This paper presents an annotation scheme for events in Spanish texts, based on TimeML for English. This scheme is contrasted with different proposals, all of them based on TimeML, for various Romance languages: Italian, French and Spanish. Two manually annotated corpora for Spanish, under the proposed scheme, are now available. While manual annotation is far from trivial, we obtained a very good event identification agreement (93% of events were identically identified by both annotators). Part of the annotated text was used as a training ...

Research paper thumbnail of Improving Speculative Language Detection using Linguistic Knowledge

Extra-Propositional Aspects of Meaning in …, Jul 13, 2012

In this paper we present an iterative methodology to improve classifier performance by incorporat... more In this paper we present an iterative methodology to improve classifier performance by incorporating linguistic knowledge, and propose a way to incorporate domain rules into the learning process. We applied the methodology to the tasks of hedge cue recognition and scope detection and obtained competitive results on a publicly available corpus.

Research paper thumbnail of Contextual Rules for Text Analysis

Lecture Notes in Computer Science, 2001