Ali Payani - Academia.edu (original) (raw)
Papers by Ali Payani
CHESSFL: Clustering Hierarchical Embeddings for Semi-Supervised Federated Learning
arXiv (Cornell University), May 30, 2024
Large language models (LLMs) have shown great progress in responding to user questions, allowing ... more Large language models (LLMs) have shown great progress in responding to user questions, allowing for a multitude of diverse applications. Yet, the quality of LLM outputs heavily depends on the prompt design, where a good prompt might enable the LLM to answer a very challenging question correctly. Therefore, recent works developed many strategies for improving the prompt, including both manual crafting and in-domain optimization. However, their efficacy in unrestricted scenarios remains questionable, as the former depends on human design for specific questions and the latter usually generalizes poorly to unseen scenarios. To address these problems, we give LLMs the freedom to design the best prompts according to themselves. Specifically, we include a hierarchy of LLMs, first constructing a prompt with precise instructions and accurate wording in a hierarchical manner, and then using this prompt to generate the final answer to the user query. We term this pipeline Hierarchical Multi-Agent Workflow, or HMAW. In contrast with prior works, HMAW imposes no human restriction and requires no training, and is completely task-agnostic while capable of adjusting to the nuances of the underlying task. Through both quantitative and qualitative experiments across multiple benchmarks, we verify that despite its simplicity, the proposed approach can create detailed and suitable prompts, further boosting the performance of current LLMs. Preprint. Under review.
arXiv (Cornell University), May 29, 2024
Differentially Private Stochastic Gradient Descent (DP-SGD) and its variants have been proposed t... more Differentially Private Stochastic Gradient Descent (DP-SGD) and its variants have been proposed to ensure rigorous privacy for fine-tuning large-scale pre-trained language models. However, they rely heavily on the Gaussian mechanism, which may overly perturb the gradients and degrade the accuracy, especially in stronger privacy regimes (e.g., the privacy budget ϵ < 3). 1 To address such limitations, we propose a novel Language Model-based Optimal Differential Privacy (LMO-DP) mechanism, which takes the first step to enable the tight composition of accurately fine-tuning (large) language models with a sub-optimal DP mechanism, even in strong privacy regimes (e.g., 0.1 ≤ ϵ < 3). Furthermore, we propose a novel offline optimal noise search method to efficiently derive the sub-optimal DP that significantly reduces the noise magnitude. For instance, fine-tuning RoBERTalarge (with 300M parameters) on the SST-2 dataset can achieve an accuracy of 92.20% (given ϵ = 0.3, δ = 10 -10 ) by drastically outperforming the Gaussian mechanism (e.g., ∼ 50% for small ϵ and δ). We also draw similar findings on the text generation tasks on GPT-2. Finally, to our best knowledge, LMO-DP is also the first solution to accurately fine-tune Llama-2 with strong differential privacy guarantees. The code will be released soon and available upon request. However, deep learning models have been proven to be vulnerable to privacy threats during training . Similar risks are also present in training or fine-tuning (large) language models, which 1 Most state-of-the-art (SOTA) methods have demonstrated high accuracy in case of relatively weaker DP guarantees, e.g., ϵ ≥ 3, but not small ϵ.
arXiv (Cornell University), May 28, 2024
In this paper, we will present SketchQL, a video database management system (VDBMS) for retrievin... more In this paper, we will present SketchQL, a video database management system (VDBMS) for retrieving video moments with a sketch-based query interface. This novel interface allows users to specify object trajectory events with simple mouse drag-and-drop operations. Users can use trajectories of single objects as building blocks to compose complex events. Using a pre-trained model that encodes trajectory similarity, SketchQL achieves zero-shot video moments retrieval by performing similarity searches over the video to identify clips that are the most similar to the visual query. In this demonstration, we introduce the graphic user interface of SketchQL and detail its functionalities and interaction mechanisms. We also demonstrate the end-to-end usage of SketchQL from query composition to video moments retrieval using real-world scenarios.
arXiv (Cornell University), Apr 14, 2024
Generalization error bounds from learning theory provide statistical guarantees on how well an al... more Generalization error bounds from learning theory provide statistical guarantees on how well an algorithm will perform on previously unseen data. In this paper, we characterize the impacts of data non-IIDness due to censored feedback (a.k.a. selective labeling bias) on such bounds. We first derive an extension of the well-known Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, which characterizes the gap between empirical and theoretical CDFs given IID data, to problems with non-IID data due to censored feedback. We then use this CDF error bound to provide a bound on the generalization error guarantees of a classifier trained on such non-IID data. We show that existing generalization error bounds (which do not account for censored feedback) fail to correctly capture the model's generalization guarantees, verifying the need for our bounds. We further analyze the effectiveness of (pure and bounded) exploration techniques, proposed by recent literature as a way to alleviate censored feedback, on improving our error bounds. Together, our findings illustrate how a decision maker should account for the trade-off between strengthening the generalization guarantees of an algorithm and the costs incurred in data collection when future data availability is limited by censored feedback.
Classifying Functional Brain Graphs Using Graph Hypervector Representation
arXiv (Cornell University), Mar 25, 2024
Federated Learning (FL) emerged as a practical approach to training a model from decentralized da... more Federated Learning (FL) emerged as a practical approach to training a model from decentralized data. The proliferation of FL led to the development of numerous FL algorithms and mechanisms. Many prior efforts have given their primary focus on accuracy of those approaches, but there exists little understanding of other aspects such as computational overheads, performance and training stability, etc. To bridge this gap, we conduct extensive performance evaluation on several canonical FL algorithms (FedAvg, FedProx, FedYogi, FedAdam, SCAFFOLD, and FedDyn) by leveraging an open-source federated learning framework called Flame. Our comprehensive measurement study reveals that no single algorithm works best across different performance metrics. A few key observations are: (1) While some state-of-the-art algorithms achieve higher accuracy than others, they incur either higher computation overheads (FedDyn) or communication overheads (SCAFFOLD). (2) Recent algorithms present smaller standard deviation in accuracy across clients than FedAvg, indicating that the advanced algorithms' performances are stable. (3) However, algorithms such as FedDyn and SCAFFOLD are more prone to catastrophic failures without the support of additional techniques such as gradient clipping. We hope that our empirical study can help the community to build best practices in evaluating FL algorithms.
arXiv (Cornell University), Mar 18, 2024
Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns repre... more Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding small subsets of training data that provably generalize the best has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance. Our extensive experiments on ConceptualCaptions3M and ConceptualCap-tions12M demonstrate that subsets found by ClipCov achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline. The code is available at: .
arXiv (Cornell University), Mar 6, 2024
With the advancement of large language models, language-based forecasting has recently emerged as... more With the advancement of large language models, language-based forecasting has recently emerged as an innovative approach for predicting human mobility patterns. The core idea is to use prompts to transform the raw mobility data given as numerical values into natural language sentences so that the language models can be leveraged to generate the description for future observations. However, previous studies have only employed fixed and manually designed templates to transform numerical values into sentences. Since the forecasting performance of language models heavily relies on prompts, using fixed templates for prompting may limit the forecasting capability of language models. In this paper, we propose a novel framework for prompt mining in language-based mobility forecasting, aiming to explore diverse prompt design strategies. Specifically, the framework includes a prompt generation stage based on the information entropy of prompts and a prompt refinement stage to integrate mechanisms such as the chain of thought. Experimental results on real-world large-scale data demonstrate the superiority of generated prompts from our prompt mining pipeline. Additionally, the comparison of different prompt variants shows that the proposed prompt refinement process is effective. Our study presents a promising direction for further advancing language-based mobility forecasting. • Applied computing → Forecasting; • Computing methodologies → Natural language generation.
arXiv (Cornell University), Feb 16, 2024
In this paper, we examine how large language models (LLMs) solve multi-step problems under a lang... more In this paper, we examine how large language models (LLMs) solve multi-step problems under a language agent framework with three components: a generator, a discriminator, and a planning method. We investigate the practical utility of two advanced planning methods, iterative correction and tree search. We present a comprehensive analysis of how discrimination accuracy affects the overall performance of agents when using these two methods or a simpler method, re-ranking. Experiments on two tasks, text-to-SQL parsing and mathematical reasoning, show that: (1) advanced planning methods demand discriminators with at least 90% accuracy to achieve significant improvements over re-ranking; (2) current LLMs' discrimination abilities have not met the needs of advanced planning methods to achieve such improvements; (3) with LLM-based discriminators, advanced planning methods may not adequately balance accuracy and efficiency. For example, compared to the other two methods, tree search is at least 10-20 times slower but leads to negligible performance gains, which hinders its real-world applications. 1
arXiv (Cornell University), Dec 24, 2023
Conventional embedding-based models approach event time prediction in temporal knowledge graphs (... more Conventional embedding-based models approach event time prediction in temporal knowledge graphs (TKGs) as a ranking problem. However, they often fall short in capturing essential temporal relationships such as order and distance. In this paper, we propose TEILP, a logical reasoning framework that naturally integrates such temporal elements into knowledge graph predictions. We first convert TKGs into a temporal event knowledge graph (TEKG) which has a more explicit representation of time in term of nodes of the graph. The TEKG equips us to develop a differentiable random walk approach to time prediction. Finally, we introduce conditional probability density functions, associated with the logical rules involving the query interval, using which we arrive at the time prediction. We compare TEILP with state-of-the-art methods on five benchmark datasets. We show that our model achieves a significant improvement over baselines while providing interpretable explanations. In particular, we consider several scenarios where training samples are limited, event types are imbalanced, and forecasting the time of future events based on only past events is desired. In all these cases, TEILP outperforms state-of-the-art methods in terms of robustness.
arXiv (Cornell University), Nov 14, 2023
This work investigates the potential of undermining both fairness and detection performance in ab... more This work investigates the potential of undermining both fairness and detection performance in abusive language detection. In a dynamic and complex digital world, it is crucial to investigate the vulnerabilities of these detection models to adversarial fairness attacks to improve their fairness robustness. We propose a simple yet effective framework FABLE that leverages backdoor attacks as they allow targeted control over the fairness and detection performance. FABLE explores three types of trigger designs (i.e., rare, artificial, and natural triggers) and novel sampling strategies. Specifically, the adversary can inject triggers into samples in the minority group with the favored outcome (i.e., "non-abusive") and flip their labels to the unfavored outcome, i.e., "abusive". Experiments on benchmark datasets demonstrate the effectiveness of FABLE attacking fairness and utility in abusive language detection. • Social and professional topics → Fairness and equity; Ethics; • Computing methodologies → Machine learning; Natural language processing; • Security and privacy → Human and societal aspects of security and privacy.
arXiv (Cornell University), Nov 14, 2023
Continual Learning (CL) has generated attention as a method of avoiding Catastrophic Forgetting (... more Continual Learning (CL) has generated attention as a method of avoiding Catastrophic Forgetting (CF) in the sequential training of neural networks, improving network efficiency and adaptability to different tasks. Additionally, CL serves as an ideal setting for studying network behavior and Forward Knowledge Transfer (FKT) between tasks. Pruning methods for CL train subnetworks to handle the sequential tasks which allows us to take a structured approach to investigating FKT. Sharing prior subnetworks' weights leverages past knowledge for the current task through FKT. Understanding which weights to share is important as sharing all weights can yield sub-optimal accuracy. This paper investigates how different sharing decisions affect the FKT between tasks. Through this lens we demonstrate how task complexity and similarity influence the optimal weight sharing decisions, giving insights into the relationships between tasks and helping inform decision making in similar CL methods. We implement three sequential datasets designed to emphasize variation in task complexity and similarity, reporting results for both ResNet-18 and VGG-16. By sharing in accordance with the decisions supported by our findings, we show that we can improve task accuracy compared to other sharing decisions. 1. To improve interpretability we implement a pruning and sharing strategy which ensures that a given filter's feature representation remains consistent in both its original task and any task for which it's shared. 2. We methodically investigate how sharing decisions can be made based on the properties of available subnetworks to improve accuracy on a new task. 3. We evaluate different sharing strategies leveraging these subnetwork properties on three CL datasets. Learned Weight Sharing Over the past several years a handful of papers have been published investigating potential methods for sharing previously trained weights during pruning-based Continual Learning (CL). Initial work on pruningbased methods for CL learn subnetworks for each task but shared all frozen weights for new tasks (Mallya and Lazebnik,
arXiv (Cornell University), Jan 11, 2024
While large language models (LLMs) have demonstrated remarkable reasoning capabilities, they are ... more While large language models (LLMs) have demonstrated remarkable reasoning capabilities, they are not without their flaws and inaccuracies. Recent studies have introduced various methods to mitigate these limitations. Temporal reasoning (TR), in particular, presents a significant challenge for LLMs due to its reliance on diverse temporal concepts and intricate temporal logic. In this paper, we propose TG-LLM, a novel framework towards languagebased TR. Instead of reasoning over the original context, we adopt a latent representation, temporal graph (TG) that enhances the learning of TR. A synthetic dataset (TGQA), which is fully controllable and requires minimal supervision, is constructed for fine-tuning LLMs on this text-to-TG translation task. We confirmed in experiments that the capability of TG translation learned on our dataset can be transferred to other TR tasks and benchmarks. On top of that, we teach LLM to perform deliberate reasoning over the TGs via Chain-of-Thought (CoT) bootstrapping and graph data augmentation. We observed that those strategies, which maintain a balance between usefulness and diversity, bring more reliable CoTs and final results than the vanilla CoT distillation. 1
arXiv (Cornell University), May 22, 2023
Despite recent progress in text-to-SQL parsing, current semantic parsers are still not accurate e... more Despite recent progress in text-to-SQL parsing, current semantic parsers are still not accurate enough for practical use. In this paper, we investigate how to build automatic text-to-SQL error correction models. Noticing that tokenlevel edits are out of context and sometimes ambiguous, we propose building clause-level edit models instead. Besides, while most language models of code are not specifically pre-trained for SQL, they know common data structures and their operations in programming languages such as Python. Thus, we propose a novel representation for SQL queries and their edits that adheres more closely to the pre-training corpora of language models of code. Our error correction model improves the exact set match accuracy of different parsers by 2.4-6.5 and obtains up to 4.3 point absolute improvement over two strong baselines. 1
arXiv (Cornell University), May 16, 2023
The issue of group fairness in machine learning models, where certain subpopulations or groups ar... more The issue of group fairness in machine learning models, where certain subpopulations or groups are favored over others, has been recognized for some time. While many mitigation strategies have been proposed in centralized learning, many of these methods are not directly applicable in federated learning, where data is privately stored on multiple clients. To address this, many proposals try to mitigate bias at the level of clients before aggregation, which we call locally fair training. However, the effectiveness of these approaches is not well understood. In this work, we investigate the theoretical foundation of locally fair training by studying the relationship between global model fairness and local model fairness. Additionally, we prove that for a broad class of fairness metrics, the global model's fairness can be obtained using only summary statistics from local clients. Based on that, we propose a globally fair training algorithm that directly minimizes the penalized empirical loss. Real-data experiments demonstrate the promising performance of our proposed approach for enhancing fairness while retaining high accuracy compared to locally fair training methods. Preprint. Under review.
IEEE Signal Processing Magazine, Mar 1, 2018
Compression for seismic data acquisition T he next generation of oil and gas exploration technolo... more Compression for seismic data acquisition T he next generation of oil and gas exploration technology is moving toward large-scale seismic acquisition, automation, and flexibility. This phenomenon has accelerated the interest in moving away from traditional seismic acquisition systems that are heavily mechanical. Currently, on a daily basis, a seismic survey may require 800 or more crew members to place more than 200,000 prewired geophones over a field of several square miles. As such, the cost of cabling accounts for up to 50% of the total operating cost of a typical land survey, and up to 75% of the total equipment weight. This labor-intensive deployment of the prewired geophones, in addition to cost, prolongs the survey time and places a huge barrier on scaling the seismic acquisition and its adaptation/automation. Therefore, there has been a growing interest to switch from prewired geophones to wireless seismic acquisition. On the other hand, a typical seismic survey may generate tens of terabytes of raw seismic data per day. Hence, wireless communication faces great challenges in light of the enormous amounts of data that must be transmitted from geophones to on-site data collection centers.
Memory-assisted compression of seismic data: Tackling a large alphabet-size problem by statistical methods
Learning dictionary for efficient signal compression
We consider the problem of learning dictionaries for data compression. Different from ordinary le... more We consider the problem of learning dictionaries for data compression. Different from ordinary learning methods, the objective is to design a dictionary such that the signal has a low entropy representation in the basis of the dictionary, rather than giving a sparse or low-energy representation. To achieve this goal, we need to consider the effect of quantization on the rate-distortion curve as well as an estimation of the distributions of the coefficients. Based on this probability estimation, the coefficients are computed, quantized and then entropy-coded. As such, we have developed algorithms for different classes of dictionaries; orthonormal, union of orthonormals and general dictionaries with unit-norm atoms, to iteratively learn the dictionary and the distribution models of the coefficients. A mixture of Gaussians is adopted to estimate the probability and is updated using the expectation maximization algorithm together with the dictionary learning. Simulation results on the real seismic data show the effectiveness of the proposed algorithm compared to ordinary dictionary learning methods.
arXiv (Cornell University), May 24, 2023
Translating natural language sentences to first-order logic (NL-FOL translation) is a longstandin... more Translating natural language sentences to first-order logic (NL-FOL translation) is a longstanding challenge in the NLP and formal logic literature. This paper introduces LOGICLLAMA, a LLaMA-7B model fine-tuned for NL-FOL translation using LoRA on a single GPU. LOGICLLAMA is capable of directly translating natural language into FOL rules, which outperforms GPT-3.5. LOGICLLAMA is also equipped to correct FOL rules predicted by GPT-3.5, and can achieve similar performance as GPT-4 with a fraction of the cost. This correction ability was achieved by a novel supervised fine-tuning (SFT) + reinforcement learning with human feedback (RLHF) framework, which initially trains on synthetically perturbed NL-FOL pairs to encourage chain-of-thought reasoning and then finetunes with RLHF on GPT-3.5 outputs using a FOL verifier as the reward model. To train LOGICLLAMA, we present MALLS (large language Model generAted NL-FOL pairS), a dataset of 34K high-quality and diverse sentence-level NL-FOL pairs collected from GPT-4. The dataset was created by implementing a pipeline that prompts GPT-4 for pairs, and dynamically adjusts the prompts to ensure the collection of pairs with rich and diverse contexts at different levels of complexity, and verifies the validity of the generated FOL rules. Codes, weights, and data are available at https://github.com/gblackout/LogicLLaMA. Preprint. Under review.
CHESSFL: Clustering Hierarchical Embeddings for Semi-Supervised Federated Learning
arXiv (Cornell University), May 30, 2024
Large language models (LLMs) have shown great progress in responding to user questions, allowing ... more Large language models (LLMs) have shown great progress in responding to user questions, allowing for a multitude of diverse applications. Yet, the quality of LLM outputs heavily depends on the prompt design, where a good prompt might enable the LLM to answer a very challenging question correctly. Therefore, recent works developed many strategies for improving the prompt, including both manual crafting and in-domain optimization. However, their efficacy in unrestricted scenarios remains questionable, as the former depends on human design for specific questions and the latter usually generalizes poorly to unseen scenarios. To address these problems, we give LLMs the freedom to design the best prompts according to themselves. Specifically, we include a hierarchy of LLMs, first constructing a prompt with precise instructions and accurate wording in a hierarchical manner, and then using this prompt to generate the final answer to the user query. We term this pipeline Hierarchical Multi-Agent Workflow, or HMAW. In contrast with prior works, HMAW imposes no human restriction and requires no training, and is completely task-agnostic while capable of adjusting to the nuances of the underlying task. Through both quantitative and qualitative experiments across multiple benchmarks, we verify that despite its simplicity, the proposed approach can create detailed and suitable prompts, further boosting the performance of current LLMs. Preprint. Under review.
arXiv (Cornell University), May 29, 2024
Differentially Private Stochastic Gradient Descent (DP-SGD) and its variants have been proposed t... more Differentially Private Stochastic Gradient Descent (DP-SGD) and its variants have been proposed to ensure rigorous privacy for fine-tuning large-scale pre-trained language models. However, they rely heavily on the Gaussian mechanism, which may overly perturb the gradients and degrade the accuracy, especially in stronger privacy regimes (e.g., the privacy budget ϵ < 3). 1 To address such limitations, we propose a novel Language Model-based Optimal Differential Privacy (LMO-DP) mechanism, which takes the first step to enable the tight composition of accurately fine-tuning (large) language models with a sub-optimal DP mechanism, even in strong privacy regimes (e.g., 0.1 ≤ ϵ < 3). Furthermore, we propose a novel offline optimal noise search method to efficiently derive the sub-optimal DP that significantly reduces the noise magnitude. For instance, fine-tuning RoBERTalarge (with 300M parameters) on the SST-2 dataset can achieve an accuracy of 92.20% (given ϵ = 0.3, δ = 10 -10 ) by drastically outperforming the Gaussian mechanism (e.g., ∼ 50% for small ϵ and δ). We also draw similar findings on the text generation tasks on GPT-2. Finally, to our best knowledge, LMO-DP is also the first solution to accurately fine-tune Llama-2 with strong differential privacy guarantees. The code will be released soon and available upon request. However, deep learning models have been proven to be vulnerable to privacy threats during training . Similar risks are also present in training or fine-tuning (large) language models, which 1 Most state-of-the-art (SOTA) methods have demonstrated high accuracy in case of relatively weaker DP guarantees, e.g., ϵ ≥ 3, but not small ϵ.
arXiv (Cornell University), May 28, 2024
In this paper, we will present SketchQL, a video database management system (VDBMS) for retrievin... more In this paper, we will present SketchQL, a video database management system (VDBMS) for retrieving video moments with a sketch-based query interface. This novel interface allows users to specify object trajectory events with simple mouse drag-and-drop operations. Users can use trajectories of single objects as building blocks to compose complex events. Using a pre-trained model that encodes trajectory similarity, SketchQL achieves zero-shot video moments retrieval by performing similarity searches over the video to identify clips that are the most similar to the visual query. In this demonstration, we introduce the graphic user interface of SketchQL and detail its functionalities and interaction mechanisms. We also demonstrate the end-to-end usage of SketchQL from query composition to video moments retrieval using real-world scenarios.
arXiv (Cornell University), Apr 14, 2024
Generalization error bounds from learning theory provide statistical guarantees on how well an al... more Generalization error bounds from learning theory provide statistical guarantees on how well an algorithm will perform on previously unseen data. In this paper, we characterize the impacts of data non-IIDness due to censored feedback (a.k.a. selective labeling bias) on such bounds. We first derive an extension of the well-known Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, which characterizes the gap between empirical and theoretical CDFs given IID data, to problems with non-IID data due to censored feedback. We then use this CDF error bound to provide a bound on the generalization error guarantees of a classifier trained on such non-IID data. We show that existing generalization error bounds (which do not account for censored feedback) fail to correctly capture the model's generalization guarantees, verifying the need for our bounds. We further analyze the effectiveness of (pure and bounded) exploration techniques, proposed by recent literature as a way to alleviate censored feedback, on improving our error bounds. Together, our findings illustrate how a decision maker should account for the trade-off between strengthening the generalization guarantees of an algorithm and the costs incurred in data collection when future data availability is limited by censored feedback.
Classifying Functional Brain Graphs Using Graph Hypervector Representation
arXiv (Cornell University), Mar 25, 2024
Federated Learning (FL) emerged as a practical approach to training a model from decentralized da... more Federated Learning (FL) emerged as a practical approach to training a model from decentralized data. The proliferation of FL led to the development of numerous FL algorithms and mechanisms. Many prior efforts have given their primary focus on accuracy of those approaches, but there exists little understanding of other aspects such as computational overheads, performance and training stability, etc. To bridge this gap, we conduct extensive performance evaluation on several canonical FL algorithms (FedAvg, FedProx, FedYogi, FedAdam, SCAFFOLD, and FedDyn) by leveraging an open-source federated learning framework called Flame. Our comprehensive measurement study reveals that no single algorithm works best across different performance metrics. A few key observations are: (1) While some state-of-the-art algorithms achieve higher accuracy than others, they incur either higher computation overheads (FedDyn) or communication overheads (SCAFFOLD). (2) Recent algorithms present smaller standard deviation in accuracy across clients than FedAvg, indicating that the advanced algorithms' performances are stable. (3) However, algorithms such as FedDyn and SCAFFOLD are more prone to catastrophic failures without the support of additional techniques such as gradient clipping. We hope that our empirical study can help the community to build best practices in evaluating FL algorithms.
arXiv (Cornell University), Mar 18, 2024
Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns repre... more Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding small subsets of training data that provably generalize the best has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance. Our extensive experiments on ConceptualCaptions3M and ConceptualCap-tions12M demonstrate that subsets found by ClipCov achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline. The code is available at: .
arXiv (Cornell University), Mar 6, 2024
With the advancement of large language models, language-based forecasting has recently emerged as... more With the advancement of large language models, language-based forecasting has recently emerged as an innovative approach for predicting human mobility patterns. The core idea is to use prompts to transform the raw mobility data given as numerical values into natural language sentences so that the language models can be leveraged to generate the description for future observations. However, previous studies have only employed fixed and manually designed templates to transform numerical values into sentences. Since the forecasting performance of language models heavily relies on prompts, using fixed templates for prompting may limit the forecasting capability of language models. In this paper, we propose a novel framework for prompt mining in language-based mobility forecasting, aiming to explore diverse prompt design strategies. Specifically, the framework includes a prompt generation stage based on the information entropy of prompts and a prompt refinement stage to integrate mechanisms such as the chain of thought. Experimental results on real-world large-scale data demonstrate the superiority of generated prompts from our prompt mining pipeline. Additionally, the comparison of different prompt variants shows that the proposed prompt refinement process is effective. Our study presents a promising direction for further advancing language-based mobility forecasting. • Applied computing → Forecasting; • Computing methodologies → Natural language generation.
arXiv (Cornell University), Feb 16, 2024
In this paper, we examine how large language models (LLMs) solve multi-step problems under a lang... more In this paper, we examine how large language models (LLMs) solve multi-step problems under a language agent framework with three components: a generator, a discriminator, and a planning method. We investigate the practical utility of two advanced planning methods, iterative correction and tree search. We present a comprehensive analysis of how discrimination accuracy affects the overall performance of agents when using these two methods or a simpler method, re-ranking. Experiments on two tasks, text-to-SQL parsing and mathematical reasoning, show that: (1) advanced planning methods demand discriminators with at least 90% accuracy to achieve significant improvements over re-ranking; (2) current LLMs' discrimination abilities have not met the needs of advanced planning methods to achieve such improvements; (3) with LLM-based discriminators, advanced planning methods may not adequately balance accuracy and efficiency. For example, compared to the other two methods, tree search is at least 10-20 times slower but leads to negligible performance gains, which hinders its real-world applications. 1
arXiv (Cornell University), Dec 24, 2023
Conventional embedding-based models approach event time prediction in temporal knowledge graphs (... more Conventional embedding-based models approach event time prediction in temporal knowledge graphs (TKGs) as a ranking problem. However, they often fall short in capturing essential temporal relationships such as order and distance. In this paper, we propose TEILP, a logical reasoning framework that naturally integrates such temporal elements into knowledge graph predictions. We first convert TKGs into a temporal event knowledge graph (TEKG) which has a more explicit representation of time in term of nodes of the graph. The TEKG equips us to develop a differentiable random walk approach to time prediction. Finally, we introduce conditional probability density functions, associated with the logical rules involving the query interval, using which we arrive at the time prediction. We compare TEILP with state-of-the-art methods on five benchmark datasets. We show that our model achieves a significant improvement over baselines while providing interpretable explanations. In particular, we consider several scenarios where training samples are limited, event types are imbalanced, and forecasting the time of future events based on only past events is desired. In all these cases, TEILP outperforms state-of-the-art methods in terms of robustness.
arXiv (Cornell University), Nov 14, 2023
This work investigates the potential of undermining both fairness and detection performance in ab... more This work investigates the potential of undermining both fairness and detection performance in abusive language detection. In a dynamic and complex digital world, it is crucial to investigate the vulnerabilities of these detection models to adversarial fairness attacks to improve their fairness robustness. We propose a simple yet effective framework FABLE that leverages backdoor attacks as they allow targeted control over the fairness and detection performance. FABLE explores three types of trigger designs (i.e., rare, artificial, and natural triggers) and novel sampling strategies. Specifically, the adversary can inject triggers into samples in the minority group with the favored outcome (i.e., "non-abusive") and flip their labels to the unfavored outcome, i.e., "abusive". Experiments on benchmark datasets demonstrate the effectiveness of FABLE attacking fairness and utility in abusive language detection. • Social and professional topics → Fairness and equity; Ethics; • Computing methodologies → Machine learning; Natural language processing; • Security and privacy → Human and societal aspects of security and privacy.
arXiv (Cornell University), Nov 14, 2023
Continual Learning (CL) has generated attention as a method of avoiding Catastrophic Forgetting (... more Continual Learning (CL) has generated attention as a method of avoiding Catastrophic Forgetting (CF) in the sequential training of neural networks, improving network efficiency and adaptability to different tasks. Additionally, CL serves as an ideal setting for studying network behavior and Forward Knowledge Transfer (FKT) between tasks. Pruning methods for CL train subnetworks to handle the sequential tasks which allows us to take a structured approach to investigating FKT. Sharing prior subnetworks' weights leverages past knowledge for the current task through FKT. Understanding which weights to share is important as sharing all weights can yield sub-optimal accuracy. This paper investigates how different sharing decisions affect the FKT between tasks. Through this lens we demonstrate how task complexity and similarity influence the optimal weight sharing decisions, giving insights into the relationships between tasks and helping inform decision making in similar CL methods. We implement three sequential datasets designed to emphasize variation in task complexity and similarity, reporting results for both ResNet-18 and VGG-16. By sharing in accordance with the decisions supported by our findings, we show that we can improve task accuracy compared to other sharing decisions. 1. To improve interpretability we implement a pruning and sharing strategy which ensures that a given filter's feature representation remains consistent in both its original task and any task for which it's shared. 2. We methodically investigate how sharing decisions can be made based on the properties of available subnetworks to improve accuracy on a new task. 3. We evaluate different sharing strategies leveraging these subnetwork properties on three CL datasets. Learned Weight Sharing Over the past several years a handful of papers have been published investigating potential methods for sharing previously trained weights during pruning-based Continual Learning (CL). Initial work on pruningbased methods for CL learn subnetworks for each task but shared all frozen weights for new tasks (Mallya and Lazebnik,
arXiv (Cornell University), Jan 11, 2024
While large language models (LLMs) have demonstrated remarkable reasoning capabilities, they are ... more While large language models (LLMs) have demonstrated remarkable reasoning capabilities, they are not without their flaws and inaccuracies. Recent studies have introduced various methods to mitigate these limitations. Temporal reasoning (TR), in particular, presents a significant challenge for LLMs due to its reliance on diverse temporal concepts and intricate temporal logic. In this paper, we propose TG-LLM, a novel framework towards languagebased TR. Instead of reasoning over the original context, we adopt a latent representation, temporal graph (TG) that enhances the learning of TR. A synthetic dataset (TGQA), which is fully controllable and requires minimal supervision, is constructed for fine-tuning LLMs on this text-to-TG translation task. We confirmed in experiments that the capability of TG translation learned on our dataset can be transferred to other TR tasks and benchmarks. On top of that, we teach LLM to perform deliberate reasoning over the TGs via Chain-of-Thought (CoT) bootstrapping and graph data augmentation. We observed that those strategies, which maintain a balance between usefulness and diversity, bring more reliable CoTs and final results than the vanilla CoT distillation. 1
arXiv (Cornell University), May 22, 2023
Despite recent progress in text-to-SQL parsing, current semantic parsers are still not accurate e... more Despite recent progress in text-to-SQL parsing, current semantic parsers are still not accurate enough for practical use. In this paper, we investigate how to build automatic text-to-SQL error correction models. Noticing that tokenlevel edits are out of context and sometimes ambiguous, we propose building clause-level edit models instead. Besides, while most language models of code are not specifically pre-trained for SQL, they know common data structures and their operations in programming languages such as Python. Thus, we propose a novel representation for SQL queries and their edits that adheres more closely to the pre-training corpora of language models of code. Our error correction model improves the exact set match accuracy of different parsers by 2.4-6.5 and obtains up to 4.3 point absolute improvement over two strong baselines. 1
arXiv (Cornell University), May 16, 2023
The issue of group fairness in machine learning models, where certain subpopulations or groups ar... more The issue of group fairness in machine learning models, where certain subpopulations or groups are favored over others, has been recognized for some time. While many mitigation strategies have been proposed in centralized learning, many of these methods are not directly applicable in federated learning, where data is privately stored on multiple clients. To address this, many proposals try to mitigate bias at the level of clients before aggregation, which we call locally fair training. However, the effectiveness of these approaches is not well understood. In this work, we investigate the theoretical foundation of locally fair training by studying the relationship between global model fairness and local model fairness. Additionally, we prove that for a broad class of fairness metrics, the global model's fairness can be obtained using only summary statistics from local clients. Based on that, we propose a globally fair training algorithm that directly minimizes the penalized empirical loss. Real-data experiments demonstrate the promising performance of our proposed approach for enhancing fairness while retaining high accuracy compared to locally fair training methods. Preprint. Under review.
IEEE Signal Processing Magazine, Mar 1, 2018
Compression for seismic data acquisition T he next generation of oil and gas exploration technolo... more Compression for seismic data acquisition T he next generation of oil and gas exploration technology is moving toward large-scale seismic acquisition, automation, and flexibility. This phenomenon has accelerated the interest in moving away from traditional seismic acquisition systems that are heavily mechanical. Currently, on a daily basis, a seismic survey may require 800 or more crew members to place more than 200,000 prewired geophones over a field of several square miles. As such, the cost of cabling accounts for up to 50% of the total operating cost of a typical land survey, and up to 75% of the total equipment weight. This labor-intensive deployment of the prewired geophones, in addition to cost, prolongs the survey time and places a huge barrier on scaling the seismic acquisition and its adaptation/automation. Therefore, there has been a growing interest to switch from prewired geophones to wireless seismic acquisition. On the other hand, a typical seismic survey may generate tens of terabytes of raw seismic data per day. Hence, wireless communication faces great challenges in light of the enormous amounts of data that must be transmitted from geophones to on-site data collection centers.
Memory-assisted compression of seismic data: Tackling a large alphabet-size problem by statistical methods
Learning dictionary for efficient signal compression
We consider the problem of learning dictionaries for data compression. Different from ordinary le... more We consider the problem of learning dictionaries for data compression. Different from ordinary learning methods, the objective is to design a dictionary such that the signal has a low entropy representation in the basis of the dictionary, rather than giving a sparse or low-energy representation. To achieve this goal, we need to consider the effect of quantization on the rate-distortion curve as well as an estimation of the distributions of the coefficients. Based on this probability estimation, the coefficients are computed, quantized and then entropy-coded. As such, we have developed algorithms for different classes of dictionaries; orthonormal, union of orthonormals and general dictionaries with unit-norm atoms, to iteratively learn the dictionary and the distribution models of the coefficients. A mixture of Gaussians is adopted to estimate the probability and is updated using the expectation maximization algorithm together with the dictionary learning. Simulation results on the real seismic data show the effectiveness of the proposed algorithm compared to ordinary dictionary learning methods.
arXiv (Cornell University), May 24, 2023
Translating natural language sentences to first-order logic (NL-FOL translation) is a longstandin... more Translating natural language sentences to first-order logic (NL-FOL translation) is a longstanding challenge in the NLP and formal logic literature. This paper introduces LOGICLLAMA, a LLaMA-7B model fine-tuned for NL-FOL translation using LoRA on a single GPU. LOGICLLAMA is capable of directly translating natural language into FOL rules, which outperforms GPT-3.5. LOGICLLAMA is also equipped to correct FOL rules predicted by GPT-3.5, and can achieve similar performance as GPT-4 with a fraction of the cost. This correction ability was achieved by a novel supervised fine-tuning (SFT) + reinforcement learning with human feedback (RLHF) framework, which initially trains on synthetically perturbed NL-FOL pairs to encourage chain-of-thought reasoning and then finetunes with RLHF on GPT-3.5 outputs using a FOL verifier as the reward model. To train LOGICLLAMA, we present MALLS (large language Model generAted NL-FOL pairS), a dataset of 34K high-quality and diverse sentence-level NL-FOL pairs collected from GPT-4. The dataset was created by implementing a pipeline that prompts GPT-4 for pairs, and dynamically adjusts the prompts to ensure the collection of pairs with rich and diverse contexts at different levels of complexity, and verifies the validity of the generated FOL rules. Codes, weights, and data are available at https://github.com/gblackout/LogicLLaMA. Preprint. Under review.