Gerard Casamayor | Pompeu Fabra University (original) (raw)
Uploads
Papers by Gerard Casamayor
Natural Language Generation (NLG) from knowledge bases (KBs) has repeatedly been subject of resea... more Natural Language Generation (NLG) from knowledge bases (KBs) has repeatedly been subject of research. However, most proposals tend to have in common that they start from KBs of limited size that either already contain linguistically-oriented knowledge structures or to whose structures different ways of realization are explicitly assigned. To avoid these limitations, we propose a three layer OWL-based ontology framework in which domain, domain communication and linguistic knowledge structures are clearly separated and show how a large scale instantiation of this framework in the environmental domain serves multilingual NLG.
Environmental Software …, Jan 1, 2011
Citizens are increasingly aware of the influence of environmental and meteorological conditions o... more Citizens are increasingly aware of the influence of environmental and meteorological conditions on the quality of their life. This results in an increasing demand for personalized environmental information, i.e., information that is tailored to citizens' specific context and background. In this work we describe the development of an environmental information system that addresses this demand in its full complexity. Specifically, we aim at developing a system that supports submission of user generated queries related to environmental conditions. From the technical point of view, the system is tuned to discover reliable data in the web and to process these data in order to convert them into knowledge, which is stored in a dedicated repository. At run time, this information is transferred into an ontology-structured knowledge base, from which then information relevant to the specific user is deduced and communicated in the language of their preference.
The Semantic Web: …, Jan 1, 2011
We present a two-layer OWL ontology-based Knowledge Base (KB) that allows for flexible content se... more We present a two-layer OWL ontology-based Knowledge Base (KB) that allows for flexible content selection and discourse structuring in Natural Language text Generation (NLG) and discuss its use for these two tasks. The first layer of the ontology contains an applicationindependent base ontology. It models the domain and was not designed with NLG in mind. The second layer, which is added on top of the base ontology, models entities and events that can be inferred from the base ontology, including inferable logico-semantic relations between individuals. The nodes in the KB are weighted according to learnt models of content selection, such that a subset of them can be extracted. The extraction is done using templates that also consider semantic relations between the nodes and a simple user profile. The discourse structuring submodule maps the semantic relations to discourse relations and forms discourse units to then arrange them into a coherent discourse graph. The approach is illustrated and evaluated on a KB that models the First Spanish Football League.
Citizens are increasingly aware of the influence of environmental and meteorological conditions o... more Citizens are increasingly aware of the influence of environmental and meteorological conditions on the quality of their life. The consequence of this awareness is the demand for personalized environmental information, i.e., information that is tailored to their specific context and background. The EU-funded project PESCaDO addresses this demand in its full complexity. It aims to develop a service that supports the user in questions related to environmental conditions in that it searches for reliable data in the web, processeses these data to deduce the relevant information and communicates this information to the user in the language of their preference. In this paper, we describe the requirements and the working service-based realization of the infrastructure of the service.
Citeseer
Collocations play a significant role in second language acquisition. In order to be able to offer... more Collocations play a significant role in second language acquisition. In order to be able to offer efficient support to learners, an NLP-based CALL environment for learning collocations should be based on a representative collocation error annotated learner corpus. However, so far, no theoretically-motivated collocation error tag set is available. Existing learner corpora tag collocation errors simply as "lexical errors" -which is clearly insufficient given the wide range of different collocation errors that the learners make. In this paper, we present a fine-grained three-dimensional typology of collocation errors that has been derived in an empirical study from the learner corpus CEDEL2 compiled by a team at the Autonomous University of Madrid. The first dimension captures whether the error concerns the collocation as a whole or one of its elements; the second dimension captures the language-oriented error analysis, while the third exemplifies the interpretative error analysis. To facilitate a smooth annotation along this typology, we adapted Knowtator, a flexible off-the-shelf annotation tool implemented as a Protégé plugin.
In accordance with international patent writing regulations, patent claims must be rendered in a ... more In accordance with international patent writing regulations, patent claims must be rendered in a single sentence. As a result, sentences with more than 250 words are not uncommon. In order to achieve an easier comprehension of patent claims, we developed a rule-based paraphrasing and summarization module that consists of three main submodules: the claim simplification submodule, the parsing submodule, and the (multilingual) regeneration submodule. The focus of this paper is on the simplification submodule. Claim simplification is responsible for segmenting claim sentences into clausal discourse units and transforming them into complete sentences, establishing coreference relations and building a discourse structure betweeen discourse units. These stages are absolutely crucial in order to ensure on the one hand that the off-the-shelf dependency parser we use outputs surface-syntactic structures of sufficiently high quality to be used by the regeneration submodule, and, on the other hand, that the regeneration submodule, which takes as input the parser's surface syntactic structures together with the discourse structure and coreference relations, is successful in generating a cohesive and coherent paraphrase/summary.
Proceedings of the …, Jan 1, 2009
With their abstract vocabulary and overly long sentences, patent claims, like several other genre... more With their abstract vocabulary and overly long sentences, patent claims, like several other genres of legal discourse, are notoriously difficult to read and comprehend. The enormous number of both native and non-native users reading patent claims on a daily basis raises the demand for means that make them easier and faster to understand. An obvious way to satisfy this demand is to paraphrase the original material, i.e., to rewrite it in a more appropriate style, or-even better-to summarize it in the language of preference of the reader such that the reader can rapidly grasp its essence. PATExpert is a patent processing service which incorporates, among other technologies, paraphrasing and multilingual summarization of patent claims. With the goal to offer the user the most suitable options and to evaluate alternative techniques that are based on different contextual and linguistic criteria, both paraphrasing and summarization implement "surface-oriented" strategies and "deep" strategies. The surface strategies make use of shallow linguistic criteria such as punctuation and syntactic and lexical markers. The deep strategies operate on deep-syntactic structures of the claims, using a full fledged text generator for synthesis of the paraphrase or summary, respectively.
Natural Language Generation (NLG) from knowledge bases (KBs) has repeatedly been subject of resea... more Natural Language Generation (NLG) from knowledge bases (KBs) has repeatedly been subject of research. However, most proposals tend to have in common that they start from KBs of limited size that either already contain linguistically-oriented knowledge structures or to whose structures different ways of realization are explicitly assigned. To avoid these limitations, we propose a three layer OWL-based ontology framework in which domain, domain communication and linguistic knowledge structures are clearly separated and show how a large scale instantiation of this framework in the environmental domain serves multilingual NLG.
Environmental Software …, Jan 1, 2011
Citizens are increasingly aware of the influence of environmental and meteorological conditions o... more Citizens are increasingly aware of the influence of environmental and meteorological conditions on the quality of their life. This results in an increasing demand for personalized environmental information, i.e., information that is tailored to citizens' specific context and background. In this work we describe the development of an environmental information system that addresses this demand in its full complexity. Specifically, we aim at developing a system that supports submission of user generated queries related to environmental conditions. From the technical point of view, the system is tuned to discover reliable data in the web and to process these data in order to convert them into knowledge, which is stored in a dedicated repository. At run time, this information is transferred into an ontology-structured knowledge base, from which then information relevant to the specific user is deduced and communicated in the language of their preference.
The Semantic Web: …, Jan 1, 2011
We present a two-layer OWL ontology-based Knowledge Base (KB) that allows for flexible content se... more We present a two-layer OWL ontology-based Knowledge Base (KB) that allows for flexible content selection and discourse structuring in Natural Language text Generation (NLG) and discuss its use for these two tasks. The first layer of the ontology contains an applicationindependent base ontology. It models the domain and was not designed with NLG in mind. The second layer, which is added on top of the base ontology, models entities and events that can be inferred from the base ontology, including inferable logico-semantic relations between individuals. The nodes in the KB are weighted according to learnt models of content selection, such that a subset of them can be extracted. The extraction is done using templates that also consider semantic relations between the nodes and a simple user profile. The discourse structuring submodule maps the semantic relations to discourse relations and forms discourse units to then arrange them into a coherent discourse graph. The approach is illustrated and evaluated on a KB that models the First Spanish Football League.
Citizens are increasingly aware of the influence of environmental and meteorological conditions o... more Citizens are increasingly aware of the influence of environmental and meteorological conditions on the quality of their life. The consequence of this awareness is the demand for personalized environmental information, i.e., information that is tailored to their specific context and background. The EU-funded project PESCaDO addresses this demand in its full complexity. It aims to develop a service that supports the user in questions related to environmental conditions in that it searches for reliable data in the web, processeses these data to deduce the relevant information and communicates this information to the user in the language of their preference. In this paper, we describe the requirements and the working service-based realization of the infrastructure of the service.
Citeseer
Collocations play a significant role in second language acquisition. In order to be able to offer... more Collocations play a significant role in second language acquisition. In order to be able to offer efficient support to learners, an NLP-based CALL environment for learning collocations should be based on a representative collocation error annotated learner corpus. However, so far, no theoretically-motivated collocation error tag set is available. Existing learner corpora tag collocation errors simply as "lexical errors" -which is clearly insufficient given the wide range of different collocation errors that the learners make. In this paper, we present a fine-grained three-dimensional typology of collocation errors that has been derived in an empirical study from the learner corpus CEDEL2 compiled by a team at the Autonomous University of Madrid. The first dimension captures whether the error concerns the collocation as a whole or one of its elements; the second dimension captures the language-oriented error analysis, while the third exemplifies the interpretative error analysis. To facilitate a smooth annotation along this typology, we adapted Knowtator, a flexible off-the-shelf annotation tool implemented as a Protégé plugin.
In accordance with international patent writing regulations, patent claims must be rendered in a ... more In accordance with international patent writing regulations, patent claims must be rendered in a single sentence. As a result, sentences with more than 250 words are not uncommon. In order to achieve an easier comprehension of patent claims, we developed a rule-based paraphrasing and summarization module that consists of three main submodules: the claim simplification submodule, the parsing submodule, and the (multilingual) regeneration submodule. The focus of this paper is on the simplification submodule. Claim simplification is responsible for segmenting claim sentences into clausal discourse units and transforming them into complete sentences, establishing coreference relations and building a discourse structure betweeen discourse units. These stages are absolutely crucial in order to ensure on the one hand that the off-the-shelf dependency parser we use outputs surface-syntactic structures of sufficiently high quality to be used by the regeneration submodule, and, on the other hand, that the regeneration submodule, which takes as input the parser's surface syntactic structures together with the discourse structure and coreference relations, is successful in generating a cohesive and coherent paraphrase/summary.
Proceedings of the …, Jan 1, 2009
With their abstract vocabulary and overly long sentences, patent claims, like several other genre... more With their abstract vocabulary and overly long sentences, patent claims, like several other genres of legal discourse, are notoriously difficult to read and comprehend. The enormous number of both native and non-native users reading patent claims on a daily basis raises the demand for means that make them easier and faster to understand. An obvious way to satisfy this demand is to paraphrase the original material, i.e., to rewrite it in a more appropriate style, or-even better-to summarize it in the language of preference of the reader such that the reader can rapidly grasp its essence. PATExpert is a patent processing service which incorporates, among other technologies, paraphrasing and multilingual summarization of patent claims. With the goal to offer the user the most suitable options and to evaluate alternative techniques that are based on different contextual and linguistic criteria, both paraphrasing and summarization implement "surface-oriented" strategies and "deep" strategies. The surface strategies make use of shallow linguistic criteria such as punctuation and syntactic and lexical markers. The deep strategies operate on deep-syntactic structures of the claims, using a full fledged text generator for synthesis of the paraphrase or summary, respectively.