Information per unit of interaction in stochastic sequential decision making

Fabien Pesquerel

Résumé

In this thesis, we wonder about the rate at which one can solve an unknown stochastic problem. To this purpose we introduce two research fields known as Bandit and Reinforcement Learning. In these two settings, a learner must sequentially makes decision that will affect a reward signal that the learner receive. The learner does not know the environment with which it is interaction, yet wish to maximize its average reward in the long run. More specifically, we are interested in studying some form of stochastic decision problem under the average-reward criterion in which a learning algorithm interacts sequentially with a dynamical system, without any reset, in a single and infinite sequence of observations, actions, and rewards while trying to maximize its total accumulated rewards over time. We first introduce Bandit, in which the set of decision is constant and introduce what is meant by solving the problem. Amongst those learners, some are better than all the others, and called optimal. We first focus on how to make the most out of each interaction with the system by revisiting an optimal algorithm, and reduce its numerical complexity. Therefore, the information extracted from each sample, per-time-step, is larger since the optimality remains. Then we study an interesting structured problem in which one can exploit the structure without estimating it. Afterward we introduce Reinforcement Learning, in which the decision a learner can make depend on a notion of state. Each time a learner makes a decision, it receives a reward and the state change according to transition law on the set of states. In some setting, known as ergodic, an optimal rate of solving is known and we introduce a knew algorithm that we can prove to be optimal and show to be numerically efficient. In a final chapter, we make a step in the direction of removing the ergodic assumption by considering the a priori simpler problem where the transitions are known. Yet, correctly understanding the rate at which information can be acquired about an optimal solution is already not easy.

Dans cette thèse, nous nous interrogeons sur la vitesse à laquelle on peut résoudre un problème stochastique inconnu. À cette fin, nous introduisons deux domaines de recherche connus sous le nom de Bandit et d’Apprentissage par Renforcement. Dans ces deux champs d’étude, un agent doit séquentiellement prendre des décisions qui affecteront un signal de récompense qu’il reçoit. L’agent ne connaît pas l’environnement avec lequel il interagit, mais pourtant souhaite maximiser sa récompense moyenne à long terme. Plus précisément, on étudie des problèmes de décision stochastique dans lesquels l’agent cherche à maximiser sa récompense moyenne. Dans ces problèmes dit d’apprentissage stochastique, l’agent interagit séquentiellement avec un système dynamique, sans aucune réinitialisation, dans une suite unique, infinie et ininterrompue d’observations, d’actions et de récompenses tout en essayant de maximiser ses récompenses totales accumulées au fil du temps. Nous commençons par présenter le problème de Bandit, dans lequel l’ensemble des décisions est constant, et définissons ce que l’on entend par résoudre le problème. Parmi ces agents, certains sont meilleurs que tous les autres et sont dits optimaux. Nous nous concentrons d’abord sur la manière de tirer le maximum d’information de chaque interaction avec le système en revisitant un algorithme optimal et en réduisant sa complexité numérique. Tout en conservant l’optimalité de la méthode initiale, la méthode proposée réduit la complexité numérique et permet donc d’extraire d’avantage d’information d’un échantillon par unité de temps de calcul. Nous étudions ensuite un problème structuré intéressant dans lequel il est possible d’exploiter la structure sans l’estimer. Ensuite nous nous consacrons à l’Apprentissage par Renforcement, dans lequel les décisions qu’un agent peut prendre dépendent d’une notion d’état. Chaque fois qu’un agent prend une décision, il reçoit une récompense et l’état change selon une loi de transition sur les états. Sous une certaine hypothèse, dite ergodique, un taux optimal de résolution par unité d’interaction est connu et nous introduisons un algorithme dont nous pouvons prouver qu’il est optimal et nous montrons qu’il est numériquement efficace. Dans un dernier chapitre, nous tentons de mieux comprendre ce qu’implique la suppression de l’hypothèse d’ergodicité. Nous considérons le problème a priori plus simple où les transitions sont connues. Cependant, même sous cette hypothèse, il n’est pas évident de comprendre correctement la vitesse à laquelle des informations peuvent être acquises sur une solution optimale.

Origine	Fichiers produits par l'(les) auteur(s)
licence	Paternité

Information per unit of interaction in stochastic sequential decision making

Quantité d’information par unité d’interaction en apprentissage séquentiel stochastique

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Partager