Aparna Garimella - Academia.edu (original) (raw)
Papers by Aparna Garimella
Several linguistic studies have shown the prevalence of various lexical and grammatical patterns ... more Several linguistic studies have shown the prevalence of various lexical and grammatical patterns in texts authored by a person of a particular gender, but models for part-of-speech tagging and dependency parsing have still not adapted to account for these differences. To address this, we annotate the Wall Street Journal part of the Penn Treebank with the gender information of the articles' authors, and build taggers and parsers trained on this data that show performance differences in text written by men and women. Further analyses reveal numerous part-of-speech tags and syntactic relations whose prediction performances benefit from the prevalence of a specific gender in the training data. The results underscore the importance of accounting for gendered differences in syntactic tasks, and outline future venues for developing more accurate taggers and parsers. We release our data to the research community.
International Conference on Computational Linguistics, Dec 1, 2016
Men are from Mars and women are from Venus-or so the genre of relationship literature would have ... more Men are from Mars and women are from Venus-or so the genre of relationship literature would have us believe. But there is some truth in this idea, and researchers in fields as diverse as psychology, sociology, and linguistics have explored ways to better understand the differences between genders. In this paper, we take another look at the problem of gender discrimination and attempt to move beyond the typical surface-level text classification approach, by (1) identifying semantic and psycholinguistic word classes that reflect systematic differences between men and women and (2) finding differences between genders in the ways they use the same words. We describe several experiments and report results on a large collection of blogs authored by men and women.
arXiv (Cornell University), May 23, 2023
In this paper, we study the generation quality of interpolation-based retrieval-augmented languag... more In this paper, we study the generation quality of interpolation-based retrieval-augmented language models (LMs). These methods, best exemplified by the kNN-LM (Khandelwal et al., 2020), interpolate the LM's predicted distribution of the next word with a distribution formed from the most relevant retrievals for a given prefix. While the kNN-LM and related methods yield impressive decreases in perplexity, we discover that they do not exhibit corresponding improvements in open-ended generation quality, as measured by both automatic evaluation metrics (e.g., MAUVE) and human evaluations. Digging deeper, we find that interpolating with a retrieval distribution actually increases perplexity compared to a baseline Transformer LM for the majority of tokens in the WikiText-103 test set, even though the overall perplexity is lower due to a smaller number of tokens for which perplexity dramatically decreases after interpolation. However, when decoding a long sequence at inference time, significant improvements on this smaller subset of tokens are washed out by slightly worse predictions on most tokens. Furthermore, we discover that the entropy of the retrieval distribution increases faster than that of the base LM as the generated sequence becomes longer, which indicates that retrieval is less reliable when using model-generated text as queries (i.e., is subject to exposure bias). We hope that our analysis spurs future work on improved decoding algorithms and interpolation strategies for retrieval-augmented language models.
Variations of word associations across different groups of people can provide insights into peopl... more Variations of word associations across different groups of people can provide insights into people's psychologies and their world views. To capture these variations, we introduce the task of demographicaware word associations. We build a new gold standard dataset consisting of word association responses for approximately 300 stimulus words, collected from more than 800 respondents of different gender (male/female) and from different locations (India/United States), and show that there are significant variations in the word associations made by these groups. We also introduce a new demographic-aware word association model based on a neural net skip-gram architecture, and show how computational methods for measuring word associations that specifically account for writer demographics can outperform generic methods that are agnostic to such information. 1 This work is not centered around comparing different word forms, as one would encounter for example in British English and American English, but rather around different word associations that people with a particular demographic characteristic are inclined to make, e.g., "health" in India is more strongly associated with "wealth", while in the United States it is more strongly associated with "sick."
arXiv (Cornell University), Oct 26, 2021
Contracts are a common type of legal document that frequent in several day-today business workflo... more Contracts are a common type of legal document that frequent in several day-today business workflows. However, there has been very limited NLP research in processing such documents, and even lesser in generating them. These contracts are made up of clauses, and the unique nature of these clauses calls for specific methods to understand and generate such documents. In this paper, we introduce the task of clause recommendation, as a first step to aid and accelerate the authoring of contract documents. We propose a twostaged pipeline to first predict if a specific clause type is relevant to be added in a contract, and then recommend the top clauses for the given type based on the contract context. We pretrain BERT on an existing library of clauses with two additional tasks and use it for our prediction and recommendation. We experiment with classification methods and similarity-based heuristics for clause relevance prediction, and generation-based methods for clause recommendation, and evaluate the results from various methods on several clause types. We provide analyses on the results, and further outline the advantages and limitations of the various methods for this line of research.
arXiv (Cornell University), Jan 30, 2021
Affect preferences vary with user demographics, and tapping into demographic information provides... more Affect preferences vary with user demographics, and tapping into demographic information provides important cues about the users' language preferences. In this paper, we utilize the user demographics, and propose EMPATH-BERT, a demographic-aware framework for empathy prediction based on BERT. Through several comparative experiments, we show that EMPATHBERT surpasses traditional machine learning and deep learning models, and illustrate the importance of user demographics to predict empathy and distress in user responses to stimulative news articles. We also highlight the importance of affect information in the responses by developing affect-aware models to predict user demographic attributes.
arXiv (Cornell University), May 23, 2023
Learning template-based information extraction (IE) from documents is a crucial yet difficult tas... more Learning template-based information extraction (IE) from documents is a crucial yet difficult task. Prior template-based IE approaches assume foreknowledge of the domain's templates. However, many real-world IE scenarios do not have pre-defined schemas. To "figureout-as you go" requires a solution with zero or minimal prior supervision. To quickly bootstrap templates in a real-world setting, we need to induce template slots from the documents with zero or minimal supervision. To address the above needs, we introduce Interac-tiveIE, a human-in-the-loop interactive interface where initially questions are automatically generated from entities in the corpus, followed by explanation-driven clustering of these questions, then allowing the users to modify, add, or otherwise edit questions based on their specific information needs. Besides, we provide agency to the humans at intermediate steps such as: tweaking the automatically generated questions followed by rearranging those in different clusters to generate schema. After conducting empirical human study, we observe that there is a gradual improvement of information mapping to desired slots using InteractiveIE over AI-only baseline with minimum number of interactions with the interface. Our method has been shown to be easily extensible to new domains (biomedical or legal), where procuring training data is expensive. Furthermore, we observe that explanations provided by clustering model fairly helped to guide the users in making sense of IE schema over time.
IEEE Intelligent Systems, Jul 1, 2016
arXiv (Cornell University), Jan 28, 2021
Author stylized rewriting is the task of rewriting an input text in a particular author's style. ... more Author stylized rewriting is the task of rewriting an input text in a particular author's style. Recent works in this area have leveraged Transformer-based language models in a denoising autoencoder setup to generate author stylized text without relying on a parallel corpus of data. However, these approaches are limited by the lack of explicit control of target attributes and being entirely data-driven. In this paper, we propose a Director-Generator framework to rewrite content in the target author's style, specifically focusing on certain target attributes. We show that our proposed framework works well even with a limitedsized target author corpus. Our experiments on corpora consisting of relatively small-sized text authored by three distinct authors show significant improvements upon existing works to rewrite input texts in target author's style. Our quantitative and qualitative analyses further show that our model has better meaning retention and results in more fluent generations. * This was work was carried out when the author was at Adobe Research.
This dissertation is in honor of my father Dr. Venkateswarlu Garimella and mother Dr. Lakshmi VVS... more This dissertation is in honor of my father Dr. Venkateswarlu Garimella and mother Dr. Lakshmi VVS, who give education utmost importance. They have always encouraged me to pursue my passions, even when the society then looked differently at the education of girls.
Findings of the Association for Computational Linguistics: EACL 2023
Proceedings of the Natural Legal Language Processing Workshop 2022
Legal documents such as contracts contain complex and domain-specific jargons, long and nested se... more Legal documents such as contracts contain complex and domain-specific jargons, long and nested sentences, and often present with several details that may be difficult to understand for laypeople without domain expertise. In this paper, we explore the problem of text simplification (TS) in legal domain. The main challenge to this is the lack of availability of complexsimple parallel datasets for the legal domain. We investigate some of the existing datasets, methods, and metrics in the TS literature for simplifying legal texts, and perform human evaluation to analyze the gaps. 1 We present some of the challenges involved, and outline a few open questions that need to be addressed for future research in this direction. * * Equal contribution. † † Work done while at Adobe Research. 1 The model outputs and human ratings are available at https://bit.ly/3U3ddIl. 2 Securities and Exchange Commission contracts.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Transformer-based language models trained on large natural language corpora have been very useful... more Transformer-based language models trained on large natural language corpora have been very useful in downstream entity extraction tasks. However, they often result in poor performances when applied to domains that are different from those they are pretrained on. Continued pretraining using unlabeled data from target domains can help improve the performances of these language models on the downstream tasks. However, using all of the available unlabeled data for pretraining can be time-intensive; also, it can be detrimental to the performance of the downstream tasks, if the unlabeled data is not aligned with the data distribution for the target tasks. Previous works employed external supervision in the form of ontologies for selecting appropriate data samples for pretraining, but external supervision can be quite hard to obtain in low-resource domains. In this paper, we introduce effective ways to select data from unlabeled corpora of target domains for language model pretraining to improve the performances in target entity extraction tasks. Our data selection strategies do not require any external supervision. We conduct extensive experiments for the task of named entity recognition (NER) on seven different domains and show that language models pretrained on target domain unlabeled data obtained using our data selection strategies achieve better performances compared to those using data selection strategies in previous works that use external supervision. We also show that these pretrained language models using our data selection strategies outperform those pretrained on all of the available unlabeled target domain data.
Proceedings of the Natural Legal Language Processing Workshop 2022
Generating domain-specific content such as legal clauses based on minimal user-provided informati... more Generating domain-specific content such as legal clauses based on minimal user-provided information can be of significant benefit in automating legal contract generation. In this paper, we propose a controllable graph-based mechanism that can generate legal clauses using only the topic or type of the legal clauses. Our pipeline consists of two stages involving a graph-based planner followed by a clause generator. The planner outlines the content of a legal clause as a sequence of keywords in the order of generic to more specific clause information based on the input topic using a controllable graph-based mechanism. The generation stage takes in a given plan and generates a clause. The pipeline consists of a graphbased planner followed by text generation. We illustrate the effectiveness of our proposed twostage approach on a broad set of clause topics in contracts.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Legal documents are typically long and written in legalese, which makes it particularly difficult... more Legal documents are typically long and written in legalese, which makes it particularly difficult for laypeople to understand their rights and duties. While natural language understanding technologies can be valuable in supporting such understanding in the legal domain, the limited availability of datasets annotated for deontic modalities in the legal domain, due to the cost of hiring experts and privacy issues, is a bottleneck. To this end, we introduce, LEXDE-MOD, a corpus of English contracts annotated with deontic modality expressed with respect to a contracting party or agent along with the modal triggers. We benchmark this dataset on two tasks: (i) agent-specific multi-label deontic modality classification, and (ii) agent-specific deontic modality and trigger span detection using Transformer-based (Vaswani et al., 2017) language models. Transfer learning experiments show that the linguistic diversity of modal expressions in LEXDEMOD generalizes reasonably from lease to employment and rental agreements. A small case study indicates that a model trained on LEXDEMOD can detect red flags with high recall. We believe our work offers a new research direction for deontic modality detection in the legal domain 1 .
Frontiers in artificial intelligence and applications, Dec 5, 2022
Clause recommendation is the problem of recommending a clause to a legal contract, given the cont... more Clause recommendation is the problem of recommending a clause to a legal contract, given the context of the contract in question and the clause type to which the clause should belong. With not much prior work being done toward the generation of legal contracts, this problem was proposed as a first step toward the bigger problem of contract generation. As an open-ended text generation problem, the distinguishing characteristics of this problem lie in the nature of legal language as a sublanguage and the considerable similarity of textual content within the clauses of a specific type. This similarity aspect in legal clauses drives us to investigate the importance of similar contracts' representation for recommending clauses. In our work, we experiment with generating clauses for 15 commonly occurring clause types in contracts expanding upon the previous work on this problem and analyzing clause recommendations in varying settings using information derived from similar contracts.
arXiv (Cornell University), Jan 7, 2023
Generating domain-specific content such as legal clauses based on minimal user-provided informati... more Generating domain-specific content such as legal clauses based on minimal user-provided information can be of significant benefit in automating legal contract generation. In this paper, we propose a controllable graph-based mechanism that can generate legal clauses using only the topic or type of the legal clauses. Our pipeline consists of two stages involving a graph-based planner followed by a clause generator. The planner outlines the content of a legal clause as a sequence of keywords in the order of generic to more specific clause information based on the input topic using a controllable graph-based mechanism. The generation stage takes in a given plan and generates a clause. The pipeline consists of a graph-based planner followed by text generation. We illustrate the effectiveness of our proposed twostage approach on a broad set of clause topics in contracts.
arXiv (Cornell University), Dec 19, 2022
Reviewing and comprehending key obligations, entitlements, and prohibitions in legal contracts ca... more Reviewing and comprehending key obligations, entitlements, and prohibitions in legal contracts can be a tedious task due to their length and domain-specificity. Furthermore, the key rights and duties requiring review vary for each contracting party. In this work, we propose a new task of party-specific extractive summarization for legal contracts to facilitate faster reviewing and improved comprehension of rights and duties. To facilitate this, we curate a dataset comprising of party-specific pairwise importance comparisons annotated by legal experts, covering ∼293K sentence pairs that include obligations, entitlements, and prohibitions extracted from lease agreements. Using this dataset, we train a pairwise importance ranker and propose a pipeline-based extractive summarization system that generates a party-specific contract summary. We establish the need for incorporating domain-specific notion of importance during summarization by comparing our system against various baselines using both automatic and human evaluation methods 1 .
Computational Linguistics
The availability of personal writings in electronic format provides researchers in the fields of ... more The availability of personal writings in electronic format provides researchers in the fields of linguistics, psychology, and computational linguistics with an unprecedented chance to study, on a large scale, the relationship between language use and the demographic background of writers, allowing us to better understand people across different demographics. In this article, we analyze the relation between language and demographics by developing cross-demographic word models to identify words with usage bias, or words that are used in significantly different ways by speakers of different demographics. Focusing on three demographic categories, namely, location, gender, and industry, we identify words with significant usage differences in each category and investigate various approaches of encoding a word’s usage, allowing us to identify language aspects that contribute to the differences. Our word models using topic-based features achieve at least 20% improvement in accuracy over the...
arXiv (Cornell University), Nov 23, 2022
Legal documents are typically long and written in legalese, which makes it particularly difficult... more Legal documents are typically long and written in legalese, which makes it particularly difficult for laypeople to understand their rights and duties. While natural language understanding technologies can be valuable in supporting such understanding in the legal domain, the limited availability of datasets annotated for deontic modalities in the legal domain, due to the cost of hiring experts and privacy issues, is a bottleneck. To this end, we introduce, LEXDEMOD, a corpus of English contracts annotated with deontic modality expressed with respect to a contracting party or agent along with the modal triggers. We benchmark this dataset on two tasks: (i) agentspecific multi-label deontic modality classification, and (ii) agent-specific deontic modality and trigger span detection using Transformerbased (Vaswani et al., 2017) language models. Transfer learning experiments show that the linguistic diversity of modal expressions in LEXDEMOD generalizes reasonably from lease to employment and rental agreements. A small case study indicates that a model trained on LEXDEMOD can detect red flags with high recall. We believe our work offers a new research direction for deontic modality detection in the legal domain 1 .
Several linguistic studies have shown the prevalence of various lexical and grammatical patterns ... more Several linguistic studies have shown the prevalence of various lexical and grammatical patterns in texts authored by a person of a particular gender, but models for part-of-speech tagging and dependency parsing have still not adapted to account for these differences. To address this, we annotate the Wall Street Journal part of the Penn Treebank with the gender information of the articles' authors, and build taggers and parsers trained on this data that show performance differences in text written by men and women. Further analyses reveal numerous part-of-speech tags and syntactic relations whose prediction performances benefit from the prevalence of a specific gender in the training data. The results underscore the importance of accounting for gendered differences in syntactic tasks, and outline future venues for developing more accurate taggers and parsers. We release our data to the research community.
International Conference on Computational Linguistics, Dec 1, 2016
Men are from Mars and women are from Venus-or so the genre of relationship literature would have ... more Men are from Mars and women are from Venus-or so the genre of relationship literature would have us believe. But there is some truth in this idea, and researchers in fields as diverse as psychology, sociology, and linguistics have explored ways to better understand the differences between genders. In this paper, we take another look at the problem of gender discrimination and attempt to move beyond the typical surface-level text classification approach, by (1) identifying semantic and psycholinguistic word classes that reflect systematic differences between men and women and (2) finding differences between genders in the ways they use the same words. We describe several experiments and report results on a large collection of blogs authored by men and women.
arXiv (Cornell University), May 23, 2023
In this paper, we study the generation quality of interpolation-based retrieval-augmented languag... more In this paper, we study the generation quality of interpolation-based retrieval-augmented language models (LMs). These methods, best exemplified by the kNN-LM (Khandelwal et al., 2020), interpolate the LM's predicted distribution of the next word with a distribution formed from the most relevant retrievals for a given prefix. While the kNN-LM and related methods yield impressive decreases in perplexity, we discover that they do not exhibit corresponding improvements in open-ended generation quality, as measured by both automatic evaluation metrics (e.g., MAUVE) and human evaluations. Digging deeper, we find that interpolating with a retrieval distribution actually increases perplexity compared to a baseline Transformer LM for the majority of tokens in the WikiText-103 test set, even though the overall perplexity is lower due to a smaller number of tokens for which perplexity dramatically decreases after interpolation. However, when decoding a long sequence at inference time, significant improvements on this smaller subset of tokens are washed out by slightly worse predictions on most tokens. Furthermore, we discover that the entropy of the retrieval distribution increases faster than that of the base LM as the generated sequence becomes longer, which indicates that retrieval is less reliable when using model-generated text as queries (i.e., is subject to exposure bias). We hope that our analysis spurs future work on improved decoding algorithms and interpolation strategies for retrieval-augmented language models.
Variations of word associations across different groups of people can provide insights into peopl... more Variations of word associations across different groups of people can provide insights into people's psychologies and their world views. To capture these variations, we introduce the task of demographicaware word associations. We build a new gold standard dataset consisting of word association responses for approximately 300 stimulus words, collected from more than 800 respondents of different gender (male/female) and from different locations (India/United States), and show that there are significant variations in the word associations made by these groups. We also introduce a new demographic-aware word association model based on a neural net skip-gram architecture, and show how computational methods for measuring word associations that specifically account for writer demographics can outperform generic methods that are agnostic to such information. 1 This work is not centered around comparing different word forms, as one would encounter for example in British English and American English, but rather around different word associations that people with a particular demographic characteristic are inclined to make, e.g., "health" in India is more strongly associated with "wealth", while in the United States it is more strongly associated with "sick."
arXiv (Cornell University), Oct 26, 2021
Contracts are a common type of legal document that frequent in several day-today business workflo... more Contracts are a common type of legal document that frequent in several day-today business workflows. However, there has been very limited NLP research in processing such documents, and even lesser in generating them. These contracts are made up of clauses, and the unique nature of these clauses calls for specific methods to understand and generate such documents. In this paper, we introduce the task of clause recommendation, as a first step to aid and accelerate the authoring of contract documents. We propose a twostaged pipeline to first predict if a specific clause type is relevant to be added in a contract, and then recommend the top clauses for the given type based on the contract context. We pretrain BERT on an existing library of clauses with two additional tasks and use it for our prediction and recommendation. We experiment with classification methods and similarity-based heuristics for clause relevance prediction, and generation-based methods for clause recommendation, and evaluate the results from various methods on several clause types. We provide analyses on the results, and further outline the advantages and limitations of the various methods for this line of research.
arXiv (Cornell University), Jan 30, 2021
Affect preferences vary with user demographics, and tapping into demographic information provides... more Affect preferences vary with user demographics, and tapping into demographic information provides important cues about the users' language preferences. In this paper, we utilize the user demographics, and propose EMPATH-BERT, a demographic-aware framework for empathy prediction based on BERT. Through several comparative experiments, we show that EMPATHBERT surpasses traditional machine learning and deep learning models, and illustrate the importance of user demographics to predict empathy and distress in user responses to stimulative news articles. We also highlight the importance of affect information in the responses by developing affect-aware models to predict user demographic attributes.
arXiv (Cornell University), May 23, 2023
Learning template-based information extraction (IE) from documents is a crucial yet difficult tas... more Learning template-based information extraction (IE) from documents is a crucial yet difficult task. Prior template-based IE approaches assume foreknowledge of the domain's templates. However, many real-world IE scenarios do not have pre-defined schemas. To "figureout-as you go" requires a solution with zero or minimal prior supervision. To quickly bootstrap templates in a real-world setting, we need to induce template slots from the documents with zero or minimal supervision. To address the above needs, we introduce Interac-tiveIE, a human-in-the-loop interactive interface where initially questions are automatically generated from entities in the corpus, followed by explanation-driven clustering of these questions, then allowing the users to modify, add, or otherwise edit questions based on their specific information needs. Besides, we provide agency to the humans at intermediate steps such as: tweaking the automatically generated questions followed by rearranging those in different clusters to generate schema. After conducting empirical human study, we observe that there is a gradual improvement of information mapping to desired slots using InteractiveIE over AI-only baseline with minimum number of interactions with the interface. Our method has been shown to be easily extensible to new domains (biomedical or legal), where procuring training data is expensive. Furthermore, we observe that explanations provided by clustering model fairly helped to guide the users in making sense of IE schema over time.
IEEE Intelligent Systems, Jul 1, 2016
arXiv (Cornell University), Jan 28, 2021
Author stylized rewriting is the task of rewriting an input text in a particular author's style. ... more Author stylized rewriting is the task of rewriting an input text in a particular author's style. Recent works in this area have leveraged Transformer-based language models in a denoising autoencoder setup to generate author stylized text without relying on a parallel corpus of data. However, these approaches are limited by the lack of explicit control of target attributes and being entirely data-driven. In this paper, we propose a Director-Generator framework to rewrite content in the target author's style, specifically focusing on certain target attributes. We show that our proposed framework works well even with a limitedsized target author corpus. Our experiments on corpora consisting of relatively small-sized text authored by three distinct authors show significant improvements upon existing works to rewrite input texts in target author's style. Our quantitative and qualitative analyses further show that our model has better meaning retention and results in more fluent generations. * This was work was carried out when the author was at Adobe Research.
This dissertation is in honor of my father Dr. Venkateswarlu Garimella and mother Dr. Lakshmi VVS... more This dissertation is in honor of my father Dr. Venkateswarlu Garimella and mother Dr. Lakshmi VVS, who give education utmost importance. They have always encouraged me to pursue my passions, even when the society then looked differently at the education of girls.
Findings of the Association for Computational Linguistics: EACL 2023
Proceedings of the Natural Legal Language Processing Workshop 2022
Legal documents such as contracts contain complex and domain-specific jargons, long and nested se... more Legal documents such as contracts contain complex and domain-specific jargons, long and nested sentences, and often present with several details that may be difficult to understand for laypeople without domain expertise. In this paper, we explore the problem of text simplification (TS) in legal domain. The main challenge to this is the lack of availability of complexsimple parallel datasets for the legal domain. We investigate some of the existing datasets, methods, and metrics in the TS literature for simplifying legal texts, and perform human evaluation to analyze the gaps. 1 We present some of the challenges involved, and outline a few open questions that need to be addressed for future research in this direction. * * Equal contribution. † † Work done while at Adobe Research. 1 The model outputs and human ratings are available at https://bit.ly/3U3ddIl. 2 Securities and Exchange Commission contracts.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Transformer-based language models trained on large natural language corpora have been very useful... more Transformer-based language models trained on large natural language corpora have been very useful in downstream entity extraction tasks. However, they often result in poor performances when applied to domains that are different from those they are pretrained on. Continued pretraining using unlabeled data from target domains can help improve the performances of these language models on the downstream tasks. However, using all of the available unlabeled data for pretraining can be time-intensive; also, it can be detrimental to the performance of the downstream tasks, if the unlabeled data is not aligned with the data distribution for the target tasks. Previous works employed external supervision in the form of ontologies for selecting appropriate data samples for pretraining, but external supervision can be quite hard to obtain in low-resource domains. In this paper, we introduce effective ways to select data from unlabeled corpora of target domains for language model pretraining to improve the performances in target entity extraction tasks. Our data selection strategies do not require any external supervision. We conduct extensive experiments for the task of named entity recognition (NER) on seven different domains and show that language models pretrained on target domain unlabeled data obtained using our data selection strategies achieve better performances compared to those using data selection strategies in previous works that use external supervision. We also show that these pretrained language models using our data selection strategies outperform those pretrained on all of the available unlabeled target domain data.
Proceedings of the Natural Legal Language Processing Workshop 2022
Generating domain-specific content such as legal clauses based on minimal user-provided informati... more Generating domain-specific content such as legal clauses based on minimal user-provided information can be of significant benefit in automating legal contract generation. In this paper, we propose a controllable graph-based mechanism that can generate legal clauses using only the topic or type of the legal clauses. Our pipeline consists of two stages involving a graph-based planner followed by a clause generator. The planner outlines the content of a legal clause as a sequence of keywords in the order of generic to more specific clause information based on the input topic using a controllable graph-based mechanism. The generation stage takes in a given plan and generates a clause. The pipeline consists of a graphbased planner followed by text generation. We illustrate the effectiveness of our proposed twostage approach on a broad set of clause topics in contracts.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Legal documents are typically long and written in legalese, which makes it particularly difficult... more Legal documents are typically long and written in legalese, which makes it particularly difficult for laypeople to understand their rights and duties. While natural language understanding technologies can be valuable in supporting such understanding in the legal domain, the limited availability of datasets annotated for deontic modalities in the legal domain, due to the cost of hiring experts and privacy issues, is a bottleneck. To this end, we introduce, LEXDE-MOD, a corpus of English contracts annotated with deontic modality expressed with respect to a contracting party or agent along with the modal triggers. We benchmark this dataset on two tasks: (i) agent-specific multi-label deontic modality classification, and (ii) agent-specific deontic modality and trigger span detection using Transformer-based (Vaswani et al., 2017) language models. Transfer learning experiments show that the linguistic diversity of modal expressions in LEXDEMOD generalizes reasonably from lease to employment and rental agreements. A small case study indicates that a model trained on LEXDEMOD can detect red flags with high recall. We believe our work offers a new research direction for deontic modality detection in the legal domain 1 .
Frontiers in artificial intelligence and applications, Dec 5, 2022
Clause recommendation is the problem of recommending a clause to a legal contract, given the cont... more Clause recommendation is the problem of recommending a clause to a legal contract, given the context of the contract in question and the clause type to which the clause should belong. With not much prior work being done toward the generation of legal contracts, this problem was proposed as a first step toward the bigger problem of contract generation. As an open-ended text generation problem, the distinguishing characteristics of this problem lie in the nature of legal language as a sublanguage and the considerable similarity of textual content within the clauses of a specific type. This similarity aspect in legal clauses drives us to investigate the importance of similar contracts' representation for recommending clauses. In our work, we experiment with generating clauses for 15 commonly occurring clause types in contracts expanding upon the previous work on this problem and analyzing clause recommendations in varying settings using information derived from similar contracts.
arXiv (Cornell University), Jan 7, 2023
Generating domain-specific content such as legal clauses based on minimal user-provided informati... more Generating domain-specific content such as legal clauses based on minimal user-provided information can be of significant benefit in automating legal contract generation. In this paper, we propose a controllable graph-based mechanism that can generate legal clauses using only the topic or type of the legal clauses. Our pipeline consists of two stages involving a graph-based planner followed by a clause generator. The planner outlines the content of a legal clause as a sequence of keywords in the order of generic to more specific clause information based on the input topic using a controllable graph-based mechanism. The generation stage takes in a given plan and generates a clause. The pipeline consists of a graph-based planner followed by text generation. We illustrate the effectiveness of our proposed twostage approach on a broad set of clause topics in contracts.
arXiv (Cornell University), Dec 19, 2022
Reviewing and comprehending key obligations, entitlements, and prohibitions in legal contracts ca... more Reviewing and comprehending key obligations, entitlements, and prohibitions in legal contracts can be a tedious task due to their length and domain-specificity. Furthermore, the key rights and duties requiring review vary for each contracting party. In this work, we propose a new task of party-specific extractive summarization for legal contracts to facilitate faster reviewing and improved comprehension of rights and duties. To facilitate this, we curate a dataset comprising of party-specific pairwise importance comparisons annotated by legal experts, covering ∼293K sentence pairs that include obligations, entitlements, and prohibitions extracted from lease agreements. Using this dataset, we train a pairwise importance ranker and propose a pipeline-based extractive summarization system that generates a party-specific contract summary. We establish the need for incorporating domain-specific notion of importance during summarization by comparing our system against various baselines using both automatic and human evaluation methods 1 .
Computational Linguistics
The availability of personal writings in electronic format provides researchers in the fields of ... more The availability of personal writings in electronic format provides researchers in the fields of linguistics, psychology, and computational linguistics with an unprecedented chance to study, on a large scale, the relationship between language use and the demographic background of writers, allowing us to better understand people across different demographics. In this article, we analyze the relation between language and demographics by developing cross-demographic word models to identify words with usage bias, or words that are used in significantly different ways by speakers of different demographics. Focusing on three demographic categories, namely, location, gender, and industry, we identify words with significant usage differences in each category and investigate various approaches of encoding a word’s usage, allowing us to identify language aspects that contribute to the differences. Our word models using topic-based features achieve at least 20% improvement in accuracy over the...
arXiv (Cornell University), Nov 23, 2022
Legal documents are typically long and written in legalese, which makes it particularly difficult... more Legal documents are typically long and written in legalese, which makes it particularly difficult for laypeople to understand their rights and duties. While natural language understanding technologies can be valuable in supporting such understanding in the legal domain, the limited availability of datasets annotated for deontic modalities in the legal domain, due to the cost of hiring experts and privacy issues, is a bottleneck. To this end, we introduce, LEXDEMOD, a corpus of English contracts annotated with deontic modality expressed with respect to a contracting party or agent along with the modal triggers. We benchmark this dataset on two tasks: (i) agentspecific multi-label deontic modality classification, and (ii) agent-specific deontic modality and trigger span detection using Transformerbased (Vaswani et al., 2017) language models. Transfer learning experiments show that the linguistic diversity of modal expressions in LEXDEMOD generalizes reasonably from lease to employment and rental agreements. A small case study indicates that a model trained on LEXDEMOD can detect red flags with high recall. We believe our work offers a new research direction for deontic modality detection in the legal domain 1 .