Soutenance de thèse - Gabriel Sulem (LNC) : "An ordinal generative model of Bayesian inference for Human decision-making in continuous reward environments"


14 septembre 2017
À 2h30, salle des Actes


Jury :
Peter Dayan (UCL, Londres)
Pierre-Yves Oudeyer (INRIA Bordeaux)
Mathias Pessigione (ICM, Paris)
Mehdi Khamassi (ISIR, Paris)
Etienne Koechlin (ENS, Paris)

Résumé :
Our thesis aims at understanding how human behavior adapts to an environment where rewards are continuous. Many works have studied environments with binary rewards (win/lose) and have shown that human behavior could be accounted for by Bayesian inference algorithms. Bayesian inference is very efficient when there are a discrete number of possible environmental states to identify and when the events to be classified (here rewards) are also discrete.

A general Bayesian algorithm works in a continuous environment provided that it is based on a “generative” model of the environment, which is a structural assumption about environmental contingencies which limits the number of possible interpretations of observations and structures the aggregation of data across time. By contrast reinforcement learning algorithms remain efficient with continuous reward scales by efficiently adapting and building value expectations and selecting best options.

The issue we address in this thesis is to characterize which kind of generative model of continuous rewards characterizes human decision-making within a Bayesian inference framework.
One putative hypothesis is to consider that each action attributes rewards as noisy samples of the true action value, typically distributed as a Gaussian distribution. Statistics on a few samples enable to infer the relevant information (mean and standard deviation) for subsequent choices. We propose instead a general generative model using assumptions about the relationship between the values of the different actions available and the existence of a reliable ordering of action values. This structural assumption enables to simulate mentally counterfactual rewards and to learn simultaneously reward distributions associated with all actions. This limits the need of exploratory choices and changes in environmental contingencies are detected when obtained rewards departs from learned distributions. To validate our model, we ran three behavioral experiments on healthy subjects in a setting where reward distributions associated to actions were continuous and changed across time. Our proposed model described correctly participants’ behavior in all three tasks, while other competitive models, including especially Gaussian models failed.
Our results extend the implementation of Bayesian algorithms to continuous rewards which are frequent in everyday environments. Our proposed model establishes which rewards are “good” and desirable according to the current context. Additionally, it selects actions according to the probability that it is better than the others rather than following actions’ expected values. Lastly, our model answers to evolutionarily constraints by adapting quickly, while performing correctly in many different settings including the ones in which the assumptions of the generative model are not verified.