Shuai Li | Shanghai Jiao Tong University (original) (raw)
Papers by Shuai Li
arXiv (Cornell University), May 29, 2021
A. The bandit problem with graph feedback, proposed in [Mannor and Shamir, NeurIPS 2011], is mode... more A. The bandit problem with graph feedback, proposed in [Mannor and Shamir, NeurIPS 2011], is modeled by a directed graph = (,) where is the collection of bandit arms, and once an arm is triggered, all its incident arms are observed. A fundamental question is how the structure of the graph affects the min-max regret. We propose the notions of the fractional weak domination number * and the-packing independence number capturing upper bound and lower bound for the regret respectively. We show that the two notions are inherently connected via aligning them with the linear program of the weakly dominating set and its dual-the fractional vertex packing set respectively. Based on this connection, we utilize the strong duality theorem to prove a general regret upper bound (* log | |) 1 3 2 3 and a lower bound Ω (* /) 1 3 2 3 where is the integrality gap of the dual linear program. Therefore, our bounds are tight up to a (log | |) 1 3 factor on graphs with bounded integrality gap for the vertex packing problem including trees and graphs with bounded degree. Moreover, we show that for several special families of graphs, we can get rid of the (log | |) 1 3 factor and establish optimal regret.
ArXiv, 2019
We introduce a new model for online ranking in which the click probability factors into an examin... more We introduce a new model for online ranking in which the click probability factors into an examination and attractiveness function and the attractiveness function is a linear function of a feature vector and an unknown parameter. Only relatively mild assumptions are made on the examination function. A novel algorithm for this setup is analysed, showing that the dependence on the number of items is replaced by a dependence on the dimension, allowing the new algorithm to handle a large number of items. When reduced to the orthogonal case, the regret of the algorithm improves on the state-of-the-art.
ArXiv, 2021
We study the problem of stochastic bandits with adversarial corruptions in the cooperative multi-... more We study the problem of stochastic bandits with adversarial corruptions in the cooperative multi-agent setting, where V agents interact with a common K-armed bandit problem, and each pair of agents can communicate with each other to expedite the learning process. In the problem, the rewards are independently sampled from distributions across all agents and rounds, but they may be corrupted by an adversary. Our goal is to minimize both the overall regret and communication cost across all agents. We first show that an additive term of corruption is unavoidable for any algorithm in this problem. Then, we propose a new algorithm that is agnostic to the level of corruption. Our algorithm not only achieves near-optimal regret in the stochastic setting, but also obtains a regret with an additive term of corruption in the corrupted setting, while maintaining efficient communication. The algorithm is also applicable for the single-agent corruption problem, and achieves a high probability reg...
ArXiv, 2021
Motivated by problems of learning to rank long item sequences, we introduce a variant of the casc... more Motivated by problems of learning to rank long item sequences, we introduce a variant of the cascading bandit model that considers flexible length sequences with varying rewards and losses. We formulate two generative models for this problem within the generalized linear setting, and design and analyze upper confidence algorithms for it. Our analysis delivers tight regret bounds which, when specialized to vanilla cascading bandits, results in sharper guarantees than previously available in the literature. We evaluate our algorithms on a number of real-world datasets, and show significantly improved empirical performance as compared to known cascading bandit baselines.
Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining
Temporal difference (TD) learning with nonlinear function approximation (nonlinear TD learning fo... more Temporal difference (TD) learning with nonlinear function approximation (nonlinear TD learning for short) permits evaluating policies using neural networks, and is a core component in modern deep reinforcement learning. Though significant advances have been made to improve its effectiveness, little attention has been paid to the data privacy faced when applying it in real applications. To mitigate the privacy concerns in practical applications of nonlinear TD learning, in this paper, we consider preserving its privacy under the notion of differential privacy (DP). This problem is challenging since nonlinear TD learning is usually studied in the formulation of stochastic nonconvex-strongly-concave optimization to obtain finite-sample analysis, which requires simultaneously preserving privacy on both primal and dual sides. To this end, we adopt a single-timescale algorithm, which optimizes both sides using learning rates of the same order, to avoid unnecessary privacy costs. Further, we achieve a good trade-off between the privacy and utility guarantees by perturbing gradients on both sides using Gaussian noises with well-calibrated variances. Consequently, our algorithm achieves rigorous (,)-DP guarantee with the utility upper bounded by O (log(1/)) 1/8 () 1/4 where is the trajectory length and is the ambient dimension of the feature space. Extensive experiments conducted in OpenAI Gym validate the advantages of our algorithm. CCS CONCEPTS • Security and privacy → Formal security models; • Computing methodologies → Machine learning algorithms.
Proceedings of the 31st ACM International Conference on Information & Knowledge Management
The recent advances of conversational recommendations provide a promising way to efficiently elic... more The recent advances of conversational recommendations provide a promising way to efficiently elicit users' preferences via conversational interactions. To achieve this, the recommender system conducts conversations with users, asking their preferences for different items or item categories. Most existing conversational recommender systems for cold-start users utilize a multi-armed bandit framework to learn users' preference in an online manner. However, they rely on a pre-defined conversation frequency for asking about item categories instead of individual items, which may incur excessive conversational interactions that hurt user experience. To enable more flexible questioning about key-terms, we formulate a new conversational bandit problem that allows the recommender system to choose either a key-term or an item to recommend at each round and explicitly models the rewards of these actions. This motivates us to handle a new exploration-exploitation (EE) trade-off between key-term asking and item recommendation, which requires us to accurately model the relationship between key-term and item rewards. We conduct a survey and analyze a real-world dataset to find that, unlike assumptions made in prior works, key-term rewards are mainly affected by rewards of representative items. We propose two bandit algorithms, Hier-UCB and Hier-LinUCB, that leverage this observed relationship and the hierarchical structure between key-terms and items to efficiently learn which items to recommend. We theoretically prove that our algorithm can reduce the regret bound's dependency on the total number of items from previous work. We validate our proposed algorithms and regret bound on both synthetic and real-world data. CCS CONCEPTS • Information systems → Recommender systems; • Theory of computation → Online learning algorithms.
Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Interactive recommender systems (IRS) have received wide attention in recent years. To capture us... more Interactive recommender systems (IRS) have received wide attention in recent years. To capture users' dynamic preferences and maximize their long-term engagement, IRS are usually formulated as reinforcement learning (RL) problems. Despite the promise to solve complex decision-making problems, RL-based methods generally require a large amount of online interaction, restricting their applications due to economic considerations. One possible direction to alleviate this issue is cross-domain recommendation that aims to leverage abundant logged interaction data from a source domain (e.g., adventure genre in movie recommendation) to improve the recommendation quality in the target domain (e.g., crime genre). Nevertheless, prior studies mostly focus on adapting the static representations of users/items. Few have explored how the temporally dynamic user-item interaction patterns transform across domains. Motivated by the above consideration, we propose DACIR, a novel Doubly-Adaptive deep RL-based framework for Cross-domain Interactive Recommendation. We first pinpoint how users behave differently in two domains and highlight the potential to leverage the shared user dynamics to boost IRS. To transfer static user preferences across domains, DACIR enforces consistency of item representation by aligning embeddings into a shared latent space. In addition, given the user dynamics in IRS, DACIR calibrates the dynamic interaction patterns in two domains via reward correlation. Once the double adaptation narrows the cross-domain gap, we are able to learn a transferable policy for the target recommender by leveraging logged data. Experiments on real-world datasets validate the superiority of our approach, which consistently achieves significant improvements over the baselines.
ArXiv, 2021
Motivated by the common strategic activities in crowdsourcing labeling, we study the problem of s... more Motivated by the common strategic activities in crowdsourcing labeling, we study the problem of sequential eliciting information without verification (EIWV) for workers with a heterogeneous and unknown crowd. We propose a reinforcement learning-based approach that is effective against a wide range of settings including potential irrationality and collusion among workers. With the aid of a costly oracle and the inference method, our approach dynamically decides the oracle calls and gains robustness even under the presence of frequent collusion activities. Extensive experiments show the advantage of our approach. Our results also present the first comprehensive experiments of EIWV on large-scale real datasets and the first thorough study of the effects of environmental variables.
Proceedings of the AAAI Conference on Artificial Intelligence
Online learning to rank (OLTR) interactively learns to choose lists of items from a large collect... more Online learning to rank (OLTR) interactively learns to choose lists of items from a large collection based on certain click models that describe users' click behaviors. Most recent works for this problem focus on the stochastic environment where the item attractiveness is assumed to be invariant during the learning process. In many real-world scenarios, however, the environment could be dynamic or even arbitrarily changing. This work studies the OLTR problem in both stochastic and adversarial environments under the position-based model (PBM). We propose a method based on the follow-the-regularized-leader (FTRL) framework with Tsallis entropy and develop a new self-bounding constraint especially designed for PBM. We prove the proposed algorithm simultaneously achieves O(log T) regret in the stochastic environment and O(m√nT) regret in the adversarial environment, where T is the number of rounds, n is the number of items and m is the number of positions. We also provide a lower bo...
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This paper studies differential privacy (DP) and local differential privacy (LDP) in cascading ba... more This paper studies differential privacy (DP) and local differential privacy (LDP) in cascading bandits. Under DP, we propose an algorithm which guaranteesindistinguishability and a regret of O((log T) 1+ξ) for an arbitrarily small ξ. This is a significant improvement from the previous work of O(log 3 T) regret. Under (,δ)-LDP, we relax the K 2 dependence through the tradeoff between privacy budget and error probability δ, and obtain a regret of O(K log(1/δ) log T 2), where K is the size of the arm subset. This result holds for both Gaussian mechanism and Laplace mechanism by analyses on the composition. Our results extend to combinatorial semi-bandit. We show respective lower bounds for DP and LDP cascading bandits. Extensive experiments corroborate our theoretic findings. Preprint. Under review.
Proceedings of the ACM Web Conference 2022
Conversational recommender systems (CRSs) have been proposed recently to mitigate the cold-start ... more Conversational recommender systems (CRSs) have been proposed recently to mitigate the cold-start problem suffered by the traditional recommender systems. By introducing conversational keyterms, existing conversational recommenders can effectively reduce the need for extensive exploration and elicit the user preferences faster and more accurately. However, existing conversational recommenders leveraging key-terms heavily rely on the availability and quality of the key-terms, and their performances might degrade significantly when the key-terms are incomplete or not well labeled, which usually happens when there are new items being consistently incorporated into the systems and involving lots of human efforts to acquire well-labeled key-terms is costly. Besides, existing CRS methods leverage the feedback to different conversational key-terms separately, without considering the underlying relations between the key-terms. In this case, the learning of the conversational recommenders is sample inefficient, especially when there is a large number of candidate conversational key-terms. In this paper, we propose a knowledge-aware conversational preference elicitation framework and a bandit-based algorithm GraphConUCB. To achieve efficient preference elicitation given items with incompletely labeled key-terms, our algorithm leverage the underlying relations between the key-terms, guided by the knowledge graph. Being knowledge-aware, our algorithm propagates the user preferences via a pseudo graph feedback module, which also accelerates the exploration in the large action space of key-terms and improves the conversational sample efficiency. To select the most informative conversational key-terms in the graphs to conduct conversations, we further devise a graph-based optimal design module which leverages the graph structure. We provide the theoretical analysis of the regret upper bound for GraphConUCB.
Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 2022
Strategic behavior against sequential learning methods, such as "click framing" in real recommend... more Strategic behavior against sequential learning methods, such as "click framing" in real recommendation systems, have been widely observed. Motivated by such behavior we study the problem of combinatorial multi-armed bandits (CMAB) under strategic manipulations of rewards, where each arm can modify the emitted reward signals for its own interest. This characterization of the adversarial behavior is a relaxation of previously well-studied settings such as adversarial attacks and adversarial corruption. We propose a strategic variant of the combinatorial UCB algorithm, which has a regret of at most (log +) under strategic manipulations, where is the time horizon, is the number of arms, and is the maximum budget of an arm. We provide lower bounds on the budget for arms to incur certain regret of the bandit algorithm. Extensive experiments on online worker selection for crowdsourcing systems, online influence maximization and online recommendations with both synthetic and real datasets corroborate our theoretical findings on robustness and regret bounds, in a variety of regimes of manipulation budgets.
ArXiv, 2020
We analyze the Gambler's problem, a simple reinforcement learning problem where the gambler h... more We analyze the Gambler's problem, a simple reinforcement learning problem where the gambler has the chance to double or lose their bets until the target is reached. This is an early example introduced in the reinforcement learning textbook by \cite{sutton2018reinforcement}, where they mention an interesting pattern of the optimal value function with high-frequency components and repeating non-smooth points but without further investigation. We provide the exact formula for the optimal value function for both the discrete and the continuous case. Though simple as it might seem, the value function is pathological: fractal, self-similar, non-smooth on any interval, zero derivative almost everywhere, and not written as elementary functions. Sharing these properties with the Cantor function, it holds a complexity that has been uncharted thus far. With the analysis, our work could lead insights on improving value function approximation, Q-learning, and gradient-based algorithms in rea...
Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021
Conversational recommender systems elicit user preference via interactive conversational interact... more Conversational recommender systems elicit user preference via interactive conversational interactions. By introducing conversational key-terms, existing conversational recommenders can effectively reduce the need for extensive exploration in a traditional interactive recommender. However, there are still limitations of existing conversational recommender approaches eliciting user preference via key-terms. First, the key-term data of the items needs to be carefully labeled, which requires a lot of human efforts. Second, the number of the human labeled key-terms is limited and the granularity of the key-terms is fixed, while the elicited user preference is usually from coarse-grained to fine-grained during the conversations. In this paper, we propose a clustering of conversational bandits algorithm. To avoid the human labeling efforts and automatically learn the key-terms with the proper granularity, we online cluster the items and generate meaningful key-terms for the items during the conversational interactions. Our algorithm is general and can also be used in the user clustering when the feedback from multiple users is available, which further leads to more accurate learning and generations of conversational key-terms. We analyze the regret bound of our learning algorithm. In the empirical evaluations, without using any human labeled key-terms, our algorithm effectively generates meaningful coarse-to-fine grained key-terms and performs as well as or better than the state-of-the-art baseline. CCS CONCEPTS • Information systems → Recommender systems; Users and interactive retrieval; • Computing methodologies → Online learning settings.
Proceedings of the 29th ACM International Conference on Multimedia, 2021
Proceedings of the AAAI Conference on Artificial Intelligence
We consider a new setting of online clustering of contextual cascading bandits, an online learnin... more We consider a new setting of online clustering of contextual cascading bandits, an online learning problem where the underlying cluster structure over users is unknown and needs to be learned from a random prefix feedback. More precisely, a learning agent recommends an ordered list of items to a user, who checks the list and stops at the first satisfactory item, if any. We propose an algorithm of CLUB-cascade for this setting and prove an n-step regret bound of order O(√n). Previous work corresponds to the degenerate case of only one cluster, and our general regret bound in this special case also significantly improves theirs. We conduct experiments on both synthetic and real data, and demonstrate the effectiveness of our algorithm and the advantage of incorporating online clustering method.
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018
Proceedings of the AAAI Conference on Artificial Intelligence, 2020
We consider a problem of stochastic online learning with general probabilistic graph feedback, wh... more We consider a problem of stochastic online learning with general probabilistic graph feedback, where each directed edge in the feedback graph has probability pij. Two cases are covered. (a) The one-step case, where after playing arm i the learner observes a sample reward feedback of arm j with independent probability pij. (b) The cascade case where after playing arm i the learner observes feedback of all arms j in a probabilistic cascade starting from i – for each (i,j) with probability pij, if arm i is played or observed, then a reward sample of arm j would be observed with independent probability pij. Previous works mainly focus on deterministic graphs which corresponds to one-step case with pij ∈ {0,1}, an adversarial sequence of graphs with certain topology guarantees, or a specific type of random graphs. We analyze the asymptotic lower bounds and design algorithms in both cases. The regret upper bounds of the algorithms match the lower bounds with high probability.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019
We generalize the setting of online clustering of bandits by allowing non-uniform distribution ov... more We generalize the setting of online clustering of bandits by allowing non-uniform distribution over user frequencies. A more efficient algorithm is proposed with simple set structures to represent clusters. We prove a regret bound for the new algorithm which is free of the minimal frequency over users. The experiments on both synthetic and real datasets consistently show the advantage of the new algorithm over existing methods.
arXiv (Cornell University), May 29, 2021
A. The bandit problem with graph feedback, proposed in [Mannor and Shamir, NeurIPS 2011], is mode... more A. The bandit problem with graph feedback, proposed in [Mannor and Shamir, NeurIPS 2011], is modeled by a directed graph = (,) where is the collection of bandit arms, and once an arm is triggered, all its incident arms are observed. A fundamental question is how the structure of the graph affects the min-max regret. We propose the notions of the fractional weak domination number * and the-packing independence number capturing upper bound and lower bound for the regret respectively. We show that the two notions are inherently connected via aligning them with the linear program of the weakly dominating set and its dual-the fractional vertex packing set respectively. Based on this connection, we utilize the strong duality theorem to prove a general regret upper bound (* log | |) 1 3 2 3 and a lower bound Ω (* /) 1 3 2 3 where is the integrality gap of the dual linear program. Therefore, our bounds are tight up to a (log | |) 1 3 factor on graphs with bounded integrality gap for the vertex packing problem including trees and graphs with bounded degree. Moreover, we show that for several special families of graphs, we can get rid of the (log | |) 1 3 factor and establish optimal regret.
ArXiv, 2019
We introduce a new model for online ranking in which the click probability factors into an examin... more We introduce a new model for online ranking in which the click probability factors into an examination and attractiveness function and the attractiveness function is a linear function of a feature vector and an unknown parameter. Only relatively mild assumptions are made on the examination function. A novel algorithm for this setup is analysed, showing that the dependence on the number of items is replaced by a dependence on the dimension, allowing the new algorithm to handle a large number of items. When reduced to the orthogonal case, the regret of the algorithm improves on the state-of-the-art.
ArXiv, 2021
We study the problem of stochastic bandits with adversarial corruptions in the cooperative multi-... more We study the problem of stochastic bandits with adversarial corruptions in the cooperative multi-agent setting, where V agents interact with a common K-armed bandit problem, and each pair of agents can communicate with each other to expedite the learning process. In the problem, the rewards are independently sampled from distributions across all agents and rounds, but they may be corrupted by an adversary. Our goal is to minimize both the overall regret and communication cost across all agents. We first show that an additive term of corruption is unavoidable for any algorithm in this problem. Then, we propose a new algorithm that is agnostic to the level of corruption. Our algorithm not only achieves near-optimal regret in the stochastic setting, but also obtains a regret with an additive term of corruption in the corrupted setting, while maintaining efficient communication. The algorithm is also applicable for the single-agent corruption problem, and achieves a high probability reg...
ArXiv, 2021
Motivated by problems of learning to rank long item sequences, we introduce a variant of the casc... more Motivated by problems of learning to rank long item sequences, we introduce a variant of the cascading bandit model that considers flexible length sequences with varying rewards and losses. We formulate two generative models for this problem within the generalized linear setting, and design and analyze upper confidence algorithms for it. Our analysis delivers tight regret bounds which, when specialized to vanilla cascading bandits, results in sharper guarantees than previously available in the literature. We evaluate our algorithms on a number of real-world datasets, and show significantly improved empirical performance as compared to known cascading bandit baselines.
Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining
Temporal difference (TD) learning with nonlinear function approximation (nonlinear TD learning fo... more Temporal difference (TD) learning with nonlinear function approximation (nonlinear TD learning for short) permits evaluating policies using neural networks, and is a core component in modern deep reinforcement learning. Though significant advances have been made to improve its effectiveness, little attention has been paid to the data privacy faced when applying it in real applications. To mitigate the privacy concerns in practical applications of nonlinear TD learning, in this paper, we consider preserving its privacy under the notion of differential privacy (DP). This problem is challenging since nonlinear TD learning is usually studied in the formulation of stochastic nonconvex-strongly-concave optimization to obtain finite-sample analysis, which requires simultaneously preserving privacy on both primal and dual sides. To this end, we adopt a single-timescale algorithm, which optimizes both sides using learning rates of the same order, to avoid unnecessary privacy costs. Further, we achieve a good trade-off between the privacy and utility guarantees by perturbing gradients on both sides using Gaussian noises with well-calibrated variances. Consequently, our algorithm achieves rigorous (,)-DP guarantee with the utility upper bounded by O (log(1/)) 1/8 () 1/4 where is the trajectory length and is the ambient dimension of the feature space. Extensive experiments conducted in OpenAI Gym validate the advantages of our algorithm. CCS CONCEPTS • Security and privacy → Formal security models; • Computing methodologies → Machine learning algorithms.
Proceedings of the 31st ACM International Conference on Information & Knowledge Management
The recent advances of conversational recommendations provide a promising way to efficiently elic... more The recent advances of conversational recommendations provide a promising way to efficiently elicit users' preferences via conversational interactions. To achieve this, the recommender system conducts conversations with users, asking their preferences for different items or item categories. Most existing conversational recommender systems for cold-start users utilize a multi-armed bandit framework to learn users' preference in an online manner. However, they rely on a pre-defined conversation frequency for asking about item categories instead of individual items, which may incur excessive conversational interactions that hurt user experience. To enable more flexible questioning about key-terms, we formulate a new conversational bandit problem that allows the recommender system to choose either a key-term or an item to recommend at each round and explicitly models the rewards of these actions. This motivates us to handle a new exploration-exploitation (EE) trade-off between key-term asking and item recommendation, which requires us to accurately model the relationship between key-term and item rewards. We conduct a survey and analyze a real-world dataset to find that, unlike assumptions made in prior works, key-term rewards are mainly affected by rewards of representative items. We propose two bandit algorithms, Hier-UCB and Hier-LinUCB, that leverage this observed relationship and the hierarchical structure between key-terms and items to efficiently learn which items to recommend. We theoretically prove that our algorithm can reduce the regret bound's dependency on the total number of items from previous work. We validate our proposed algorithms and regret bound on both synthetic and real-world data. CCS CONCEPTS • Information systems → Recommender systems; • Theory of computation → Online learning algorithms.
Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Interactive recommender systems (IRS) have received wide attention in recent years. To capture us... more Interactive recommender systems (IRS) have received wide attention in recent years. To capture users' dynamic preferences and maximize their long-term engagement, IRS are usually formulated as reinforcement learning (RL) problems. Despite the promise to solve complex decision-making problems, RL-based methods generally require a large amount of online interaction, restricting their applications due to economic considerations. One possible direction to alleviate this issue is cross-domain recommendation that aims to leverage abundant logged interaction data from a source domain (e.g., adventure genre in movie recommendation) to improve the recommendation quality in the target domain (e.g., crime genre). Nevertheless, prior studies mostly focus on adapting the static representations of users/items. Few have explored how the temporally dynamic user-item interaction patterns transform across domains. Motivated by the above consideration, we propose DACIR, a novel Doubly-Adaptive deep RL-based framework for Cross-domain Interactive Recommendation. We first pinpoint how users behave differently in two domains and highlight the potential to leverage the shared user dynamics to boost IRS. To transfer static user preferences across domains, DACIR enforces consistency of item representation by aligning embeddings into a shared latent space. In addition, given the user dynamics in IRS, DACIR calibrates the dynamic interaction patterns in two domains via reward correlation. Once the double adaptation narrows the cross-domain gap, we are able to learn a transferable policy for the target recommender by leveraging logged data. Experiments on real-world datasets validate the superiority of our approach, which consistently achieves significant improvements over the baselines.
ArXiv, 2021
Motivated by the common strategic activities in crowdsourcing labeling, we study the problem of s... more Motivated by the common strategic activities in crowdsourcing labeling, we study the problem of sequential eliciting information without verification (EIWV) for workers with a heterogeneous and unknown crowd. We propose a reinforcement learning-based approach that is effective against a wide range of settings including potential irrationality and collusion among workers. With the aid of a costly oracle and the inference method, our approach dynamically decides the oracle calls and gains robustness even under the presence of frequent collusion activities. Extensive experiments show the advantage of our approach. Our results also present the first comprehensive experiments of EIWV on large-scale real datasets and the first thorough study of the effects of environmental variables.
Proceedings of the AAAI Conference on Artificial Intelligence
Online learning to rank (OLTR) interactively learns to choose lists of items from a large collect... more Online learning to rank (OLTR) interactively learns to choose lists of items from a large collection based on certain click models that describe users' click behaviors. Most recent works for this problem focus on the stochastic environment where the item attractiveness is assumed to be invariant during the learning process. In many real-world scenarios, however, the environment could be dynamic or even arbitrarily changing. This work studies the OLTR problem in both stochastic and adversarial environments under the position-based model (PBM). We propose a method based on the follow-the-regularized-leader (FTRL) framework with Tsallis entropy and develop a new self-bounding constraint especially designed for PBM. We prove the proposed algorithm simultaneously achieves O(log T) regret in the stochastic environment and O(m√nT) regret in the adversarial environment, where T is the number of rounds, n is the number of items and m is the number of positions. We also provide a lower bo...
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This paper studies differential privacy (DP) and local differential privacy (LDP) in cascading ba... more This paper studies differential privacy (DP) and local differential privacy (LDP) in cascading bandits. Under DP, we propose an algorithm which guaranteesindistinguishability and a regret of O((log T) 1+ξ) for an arbitrarily small ξ. This is a significant improvement from the previous work of O(log 3 T) regret. Under (,δ)-LDP, we relax the K 2 dependence through the tradeoff between privacy budget and error probability δ, and obtain a regret of O(K log(1/δ) log T 2), where K is the size of the arm subset. This result holds for both Gaussian mechanism and Laplace mechanism by analyses on the composition. Our results extend to combinatorial semi-bandit. We show respective lower bounds for DP and LDP cascading bandits. Extensive experiments corroborate our theoretic findings. Preprint. Under review.
Proceedings of the ACM Web Conference 2022
Conversational recommender systems (CRSs) have been proposed recently to mitigate the cold-start ... more Conversational recommender systems (CRSs) have been proposed recently to mitigate the cold-start problem suffered by the traditional recommender systems. By introducing conversational keyterms, existing conversational recommenders can effectively reduce the need for extensive exploration and elicit the user preferences faster and more accurately. However, existing conversational recommenders leveraging key-terms heavily rely on the availability and quality of the key-terms, and their performances might degrade significantly when the key-terms are incomplete or not well labeled, which usually happens when there are new items being consistently incorporated into the systems and involving lots of human efforts to acquire well-labeled key-terms is costly. Besides, existing CRS methods leverage the feedback to different conversational key-terms separately, without considering the underlying relations between the key-terms. In this case, the learning of the conversational recommenders is sample inefficient, especially when there is a large number of candidate conversational key-terms. In this paper, we propose a knowledge-aware conversational preference elicitation framework and a bandit-based algorithm GraphConUCB. To achieve efficient preference elicitation given items with incompletely labeled key-terms, our algorithm leverage the underlying relations between the key-terms, guided by the knowledge graph. Being knowledge-aware, our algorithm propagates the user preferences via a pseudo graph feedback module, which also accelerates the exploration in the large action space of key-terms and improves the conversational sample efficiency. To select the most informative conversational key-terms in the graphs to conduct conversations, we further devise a graph-based optimal design module which leverages the graph structure. We provide the theoretical analysis of the regret upper bound for GraphConUCB.
Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 2022
Strategic behavior against sequential learning methods, such as "click framing" in real recommend... more Strategic behavior against sequential learning methods, such as "click framing" in real recommendation systems, have been widely observed. Motivated by such behavior we study the problem of combinatorial multi-armed bandits (CMAB) under strategic manipulations of rewards, where each arm can modify the emitted reward signals for its own interest. This characterization of the adversarial behavior is a relaxation of previously well-studied settings such as adversarial attacks and adversarial corruption. We propose a strategic variant of the combinatorial UCB algorithm, which has a regret of at most (log +) under strategic manipulations, where is the time horizon, is the number of arms, and is the maximum budget of an arm. We provide lower bounds on the budget for arms to incur certain regret of the bandit algorithm. Extensive experiments on online worker selection for crowdsourcing systems, online influence maximization and online recommendations with both synthetic and real datasets corroborate our theoretical findings on robustness and regret bounds, in a variety of regimes of manipulation budgets.
ArXiv, 2020
We analyze the Gambler's problem, a simple reinforcement learning problem where the gambler h... more We analyze the Gambler's problem, a simple reinforcement learning problem where the gambler has the chance to double or lose their bets until the target is reached. This is an early example introduced in the reinforcement learning textbook by \cite{sutton2018reinforcement}, where they mention an interesting pattern of the optimal value function with high-frequency components and repeating non-smooth points but without further investigation. We provide the exact formula for the optimal value function for both the discrete and the continuous case. Though simple as it might seem, the value function is pathological: fractal, self-similar, non-smooth on any interval, zero derivative almost everywhere, and not written as elementary functions. Sharing these properties with the Cantor function, it holds a complexity that has been uncharted thus far. With the analysis, our work could lead insights on improving value function approximation, Q-learning, and gradient-based algorithms in rea...
Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021
Conversational recommender systems elicit user preference via interactive conversational interact... more Conversational recommender systems elicit user preference via interactive conversational interactions. By introducing conversational key-terms, existing conversational recommenders can effectively reduce the need for extensive exploration in a traditional interactive recommender. However, there are still limitations of existing conversational recommender approaches eliciting user preference via key-terms. First, the key-term data of the items needs to be carefully labeled, which requires a lot of human efforts. Second, the number of the human labeled key-terms is limited and the granularity of the key-terms is fixed, while the elicited user preference is usually from coarse-grained to fine-grained during the conversations. In this paper, we propose a clustering of conversational bandits algorithm. To avoid the human labeling efforts and automatically learn the key-terms with the proper granularity, we online cluster the items and generate meaningful key-terms for the items during the conversational interactions. Our algorithm is general and can also be used in the user clustering when the feedback from multiple users is available, which further leads to more accurate learning and generations of conversational key-terms. We analyze the regret bound of our learning algorithm. In the empirical evaluations, without using any human labeled key-terms, our algorithm effectively generates meaningful coarse-to-fine grained key-terms and performs as well as or better than the state-of-the-art baseline. CCS CONCEPTS • Information systems → Recommender systems; Users and interactive retrieval; • Computing methodologies → Online learning settings.
Proceedings of the 29th ACM International Conference on Multimedia, 2021
Proceedings of the AAAI Conference on Artificial Intelligence
We consider a new setting of online clustering of contextual cascading bandits, an online learnin... more We consider a new setting of online clustering of contextual cascading bandits, an online learning problem where the underlying cluster structure over users is unknown and needs to be learned from a random prefix feedback. More precisely, a learning agent recommends an ordered list of items to a user, who checks the list and stops at the first satisfactory item, if any. We propose an algorithm of CLUB-cascade for this setting and prove an n-step regret bound of order O(√n). Previous work corresponds to the degenerate case of only one cluster, and our general regret bound in this special case also significantly improves theirs. We conduct experiments on both synthetic and real data, and demonstrate the effectiveness of our algorithm and the advantage of incorporating online clustering method.
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018
Proceedings of the AAAI Conference on Artificial Intelligence, 2020
We consider a problem of stochastic online learning with general probabilistic graph feedback, wh... more We consider a problem of stochastic online learning with general probabilistic graph feedback, where each directed edge in the feedback graph has probability pij. Two cases are covered. (a) The one-step case, where after playing arm i the learner observes a sample reward feedback of arm j with independent probability pij. (b) The cascade case where after playing arm i the learner observes feedback of all arms j in a probabilistic cascade starting from i – for each (i,j) with probability pij, if arm i is played or observed, then a reward sample of arm j would be observed with independent probability pij. Previous works mainly focus on deterministic graphs which corresponds to one-step case with pij ∈ {0,1}, an adversarial sequence of graphs with certain topology guarantees, or a specific type of random graphs. We analyze the asymptotic lower bounds and design algorithms in both cases. The regret upper bounds of the algorithms match the lower bounds with high probability.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019
We generalize the setting of online clustering of bandits by allowing non-uniform distribution ov... more We generalize the setting of online clustering of bandits by allowing non-uniform distribution over user frequencies. A more efficient algorithm is proposed with simple set structures to represent clusters. We prove a regret bound for the new algorithm which is free of the minimal frequency over users. The experiments on both synthetic and real datasets consistently show the advantage of the new algorithm over existing methods.