Bandit Algorithms Fairness, Welfare, and Applications in Causal Inference

By: Contributor(s): Material type: TextTextPublication details: Bangalore : Indian Institute of Science, 2024.Description: 130 p. : col. ill. e- ThesisSubject(s): DDC classification:
  • 005.133  SAW
Online resources: Dissertation note: MTech(Res);2024;Computer Science and Automation Summary: We study regret in online learning from a welfarist perspective and explore an application of bandit algorithms in causal inference. We introduce Nash regret, which measures the difference between the optimal action choices and the algorithm’s performance in terms of the Nash social welfare function. By providing bounds on Nash regret, we establish principled fairness guarantees for online learning algorithms. We investigate different online learning settings and derive tight bounds on Nash regret. Furthermore, we study the problem of finding optimal interventions in causal graphs by providing theoretical guarantees within the context of the causal bandit problem. In the first part, we focus on the classic multi-armed bandit (MAB) framework and develop an algorithm that achieves a tight Nash regret bound. Specifically, given a horizon of play T , our algorithm achieves a Nash regret of O ✓q k log T T ◆ , where k represents the number of arms in the MAB instance. The lower bound on average regret applies to Nash regret as well, making our guarantee essentially tight. Additionally, we propose an anytime algorithm with a Nash regret guarantee of O ✓q k log T T log T ◆ . In the second part, we study the stochastic linear bandits problem with non-negative, ⌫-sub Poisson rewards. We present an algorithm that achieves a Nash regret bound of O ⇣q d⌫ T log(T |X|) ⌘ , where X denotes the set of arms in ambient dimension d and T represents the number of rounds. Furthermore, for linear bandit instances with potentially infinite arm sets, we derive a Nash re- gret upper bound of O ⇣ d 5/4 ⌫ 1/2 pT log(T ) ⌘ . Our algorithm builds upon the successive elimination method and incorporates novel techniques such as tailored concentration bounds and sampling via the John ellipsoid in conjunction with the Kiefer-Wolfowitz optimal design. In the third part, we investigate Nash regret in the context of online concave optimization and the Experts problem, assuming adversarially chosen reward functions. Our algorithm achieves Nash regret of O log N T for the Experts problem where N is the number of experts. We provide a lower bound for this setting that is essentially tight with respect to the upper bound. Additionally, for online concave optimization, we provide a Nash regret guarantee of O d log T T , where d denotes the ambient dimension. In the final part of this thesis, we focus on the causal bandit problem, which involves iden- tifying near-optimal interventions in a causal graph. Previous works have provided a bound of eO(N/pT ) for simple regret for causal graphs with N vertices, constant in-degree, and Bernoulli random variables. In this work, we present a new approach for exploration using covering in- terventions. This allows us to achieve a significant improvement and provide a tighter simple regret guarantee of eO(pN/T ). Furthermore, we extend our algorithm to handle the most general case of causal graphs with unobservable variables
Tags from this library: No tags from this library for this title. Log in to add tags.
Star ratings
    Average rating: 0.0 (0 votes)
Holdings
Item type Current library Call number URL Status Date due Barcode
Thesis Thesis JRD Tata Memorial Library 005.133 SAW (Browse shelf(Opens below)) Link to resource Not for loan ET00508

Include bibliographic refrences

MTech(Res);2024;Computer Science and Automation

We study regret in online learning from a welfarist perspective and explore an application of bandit algorithms in causal inference. We introduce Nash regret, which measures the difference between the optimal action choices and the algorithm’s performance in terms of the Nash social welfare function. By providing bounds on Nash regret, we establish principled fairness guarantees for online learning algorithms. We investigate different online learning settings and derive tight bounds on Nash regret. Furthermore, we study the problem of finding optimal interventions in causal graphs by providing theoretical guarantees within the context of the causal bandit problem. In the first part, we focus on the classic multi-armed bandit (MAB) framework and develop an algorithm that achieves a tight Nash regret bound. Specifically, given a horizon of play T , our algorithm achieves a Nash regret of O ✓q k log T T ◆ , where k represents the number of arms in the MAB instance. The lower bound on average regret applies to Nash regret as well, making our guarantee essentially tight. Additionally, we propose an anytime algorithm with a Nash regret guarantee of O ✓q k log T T log T ◆ . In the second part, we study the stochastic linear bandits problem with non-negative, ⌫-sub Poisson rewards. We present an algorithm that achieves a Nash regret bound of O ⇣q d⌫ T log(T |X|) ⌘ , where X denotes the set of arms in ambient dimension d and T represents the number of rounds. Furthermore, for linear bandit instances with potentially infinite arm sets, we derive a Nash re- gret upper bound of O ⇣ d 5/4 ⌫ 1/2 pT log(T ) ⌘ . Our algorithm builds upon the successive elimination method and incorporates novel techniques such as tailored concentration bounds and sampling via the John ellipsoid in conjunction with the Kiefer-Wolfowitz optimal design. In the third part, we investigate Nash regret in the context of online concave optimization and the Experts problem, assuming adversarially chosen reward functions. Our algorithm achieves Nash regret of O log N T for the Experts problem where N is the number of experts. We provide a lower bound for this setting that is essentially tight with respect to the upper bound. Additionally, for online concave optimization, we provide a Nash regret guarantee of O d log T T , where d denotes the ambient dimension. In the final part of this thesis, we focus on the causal bandit problem, which involves iden- tifying near-optimal interventions in a causal graph. Previous works have provided a bound of eO(N/pT ) for simple regret for causal graphs with N vertices, constant in-degree, and Bernoulli random variables. In this work, we present a new approach for exploration using covering in- terventions. This allows us to achieve a significant improvement and provide a tighter simple regret guarantee of eO(pN/T ). Furthermore, we extend our algorithm to handle the most general case of causal graphs with unobservable variables

There are no comments on this title.

to post a comment.

                                                                                                                                                                                                    Facebook    Twitter

                             Copyright © 2024. J.R.D. Tata Memorial Library, Indian Institute of Science, Bengaluru - 560012

                             Contact   Phone: +91 80 2293 2832