Claire Vernade


Claire is a Group Leader at the University of Tuebingen, in the Cluster of Excellence Machine Learning for Science. She was awarded an Emmy Noether award under the AI Initiative call in 2022

Her research is on sequential decision making. It mostly spans bandit problems, and theoretical Reinforcement Learning, but her research interests extend to Learning Theory and principled learning algorithms. While keeping in mind concrete problems, she focuses on theoretical approaches, aiming for provably optimal algorithms. 

Previously, she was a Research Scientist at DeepMind in London UK since November 2018 in the Foundations team lead by Prof. Csaba Szepesvari. She did a post-doc in 2018 with Prof. Alexandra Carpentier at the University of Magdeburg in Germany while working part-time as an Applied Scientist at Amazon in Berlin. She received her PhD from Telecom ParisTech in October 2017, under the guidance of Prof. Olivier Cappé

I am co-leading the Women in Learning Theory initiative, please reach out if you have questions or if you'd like to help.

contact: first.last @ uni-tuebingen . de



For the full list of my publications, see my Google Scholar:

Research projects and selected publications

Lifelong Reinforcement  Learning

In standard RL models, the dynamics of the environment and the reward function to be optimized are assumed to be fixed (and unknown). What if they can vary, perhaps with some constraints, or within some boundaries? What is the agent must actually sequentially solve various tasks that share some similarities? What is the cost of meta-learning, how should exploration be dealt with when the agent knows from the get-go that there will be many different tasks? These are Meta-Reinforcement Learning questions and I use the Lifelong term to insist on the sequential nature of the process (the agent does not see all the tasks at once and cannot jump freely from one another). Initial ideas for these questions were sparked by reflections on Non-stationary bandits, where the environment can slowly vary, and Sparse bandits, where only a subset of the actions have positive value.  

Twisting bandit models for more realistic applications

During my PhD and later when at Amazon and at Deepmind, I studied various aspect of this beautiful sequential learning problem that are Bandits. I started by exploring combinatorial action spaces (e.g. ranking problems), and the impact of delays and their censoring effect on binary observations.  

Delays and censoring: 

Combinatorial action spaces:

Foundations of Bandit Algorithms  and Reinforcement Learning

Sometimes (often), even the fundamental models have remaining open questions and since 2018, I've spent some time thinking about a few questions.  First, we explored off-policy evaluation estimators for Contextual Bandits such as the clever Self-Normalized Importance Weighted one, and according confidence bounds. We also closed an important gap in Linear Bandits by showing that Information Directed Sampling can have optimal regret bounds both in the minimax and problem-dependent sense. Recently, we studied Distributional Reinforcement Learning and its power and limitations.

Game-theoretic algorithm design for PCA and beyond

I got to collaborate with Brian Mc Williams, Ian Gemp and others at Deepmind on an elegant new method for computing the PCA of very large matrices. The idea is simple but the optimisation ideas are subtle and the results are just mind-blowing. 

PhD Thesis   Bandit models for interactive applications (please reach out if you really want to read it)

Selected Talks