Strategies and Representations Developed by Deep Reinforcement Learning Agents in a Risk-Taking Learning Task
Andy Liu¹, Elliot Smith², Alla Borisyuk*¹
¹ Department of Mathematics, University of Utah, Salt Lake City, USA
² Department of Neurosurgery, University of Utah, Salt Lake City, USA
*Email: borisyuk@math.utah.edu
Introduction
Balloon Analogue Risk Task (BART) [1], a risky decision-making
task designed to measure impulsivity in humans. Subjects inflate a virtual balloon
until they choose to stop. Points are awarded based on the final balloon size,
but if the balloon inflates past its maximum, it pops, and no points are
gained. BART performance is used to measure impulsivity, which correlates with
natural impulsive behaviors like drug use and gambling [2].
We train deep reinforcement learning (RL) agents on several
variations of BART. Our aim is to understand the diversity of strategies that
arise in individuals and artificial agents learning BART and the internal
representations in brains and neural networks that support these strategies.
Methods
An agent's neural network consists of a shared feed-forward
layer, an RNN layer, and then splits into actor and critic feed-forward layers
to output the policy and value estimations. Training deep RL agents follows a
standard proximal policy optimization (PPO) [3] method. State, action, reward tuples
are collected at each time-step of the simulation to form a batch of data, and PPO
is used to optimize the neural network on this batch of data. Then the
collection and training loop repeats.
We analyze agents' neural network activities, using
dimension reduction and k-means clustering, to reveal natural groupings of
agents that have different behavioral and strategic tendencies.
Results
We find that agents develop qualitatively distinct
strategies, including one involving cautious exploration (where agents
gradually increase their inflation times (IT)) and one bimodal, trading off
precision for efficiency.
A detailed analysis reveals node and network activity patterns
that are correlated with these agent strategies. In particular, the efficient
strategy makes use of a simplifying bimodal manifold on the level of a whole
network layer activity, and cautious explorers tend to have a higher frequency
of nodes that are sensitive to recent trial outcomes.
We also find a type of agent that has few network nodes
encoding balloon IT history and hypothesize that their representations hint at
possible pop-driven exploration policies.
Discussion
We relate our findings to human data, collected from
neurosurgical patients with intracranial electrodes performing BART. Some of
the human strategies are comparable to those found in RL agents. We predict corresponding
neuronal activity patterns to search for in human brain recordings. First, we
suggest looking for individual neurons with correlation between mean firing
rate and recent banked balloon trials. We predict that such nodes should be
more common in individuals using cautious exploration. We also propose a new
BART variation where reward information is sparser, which gave rise to bimodal
ITs in agents. In this variation, one would then look for bimodal manifolds in
dimension-reduced neural activity.
We thank members of Smith and Borisyuk groups for productive discussions. The support and resources from the Center for High Performance Computing at the University of Utah are also gratefully acknowledged.
[1] Lejuez, C.W., Read, J.P., et al. (2002) Evaluation of a behavioral measure of risk taking: the balloon analogue risk task (bart). Journal of Experimental Psychology: Applied, 8(2), 75.
[2] Lejuez, C.W., Aklin, W.M., Jones, H.A., Richards, J.B., Strong, D.R., Kahler, C.W., & Read, J.P. (2003) The balloon analogue risk task (bart) differentiates smokers and nonsmokers. Experimental and clinical psychopharmacology, 11(1):26.
[3] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017) Proximal policy optimization algorithms. arXiv:1707.06347.