Safely guiding a no-regret learner to the equilibrium

Is it possible to build an algorithm capable of teaching a complex rational agent, as a human, how to play a game? To answer this question, it is necessary to explain what we mean by both human and teaching. As far as the latter is concerned, Game Theory proposes as solution the notion of Equilibrium, that is, a playing strategy which guarantees a certain level of reward whatever the opponent strategy is. Thus, learning how to play means reaching this equilibrium, while teaching means carrying the opponent to it. In order to model the human properly, it is necessary to deal with at least two aspects: first, every person has different learning ability, second, an incentive to continue the game is needed to avoid a premature interruption of the learning path. The different learning abilities will be represented by the assumption that the opponent, which must be guided by our algorithm to the equilibrium, may employ an entire family of learning algorithms (technically, No-Regret); finally, the incentive to keep playing the game will be modeled by the safety property, which guarantees the opponent’s reward to lie in an interval in every round. The interval is chosen as hyperparameter of the algorithm in order to prevent the human from getting bored, due to easy victories, or from giving up, due to tremendous defeats. In this thesis we will show how, with proper assumptions, it is possible to build an algorithm that can guide the opponent to the equilibrium, without knowing the specific algorithm of the adversary and ensuring at the same time, not only the safety property, but also a sublinear learning time (namely, Dynamic Regret). The problem will be tackled both in setting in which the "teacher" has a full feedback, and in setting in which he has a partial one, showing the differences in terms of theoretical results.

E’ possibile costruire un algoritmo in grado di insegnare ad un agente razionale complesso, quale un umano, come approcciarsi ad un gioco? E’ innanzitutto doveroso chiarire cosa si intenda sia per umano che per insegnare. Per quanto concerne questo secondo aspetto, la Teoria dei Giochi propone come soluzione il concetto di Equilibrio, cioè una strategia di gioco tale da garantire un certo tipo di risultato qualsiasi sia la strategia del proprio avversario. Imparare a giocare significa dunque raggiungere questo equilibrio, mentre insegnare significa condurci il proprio avversario. Per modellare in maniera consona un umano bisogna invece considerare almeno due aspetti: in primis, ogni persona ha capacità di apprendimento diverse, in secundis, è necessario che ci sia un incentivo a continuare il gioco, altrimenti il percorso di apprendimento verrebbe prematuramente interrotto. Le diverse capacità verranno declinate dall’assunzione che l’avversario, il quale deve essere condotto dal nostro algoritmo all’equilibrio, possa sfruttare non uno, bensì una famiglia di algoritmi di apprendimento (tecnicamente, algoritmi No-Regret); infine, l’incentivo a continuare il gioco verrà modellato dalla proprietà di safety, la quale garantisce che il risultato ottenuto dall’avversario in ogni partita giacia all’interno di un intervallo scelto come iperparametro dell’algoritmo, in modo da evitare che l’umano si annoi vincendo troppo facilmente o demorda perdendo gravemente. In questa tesi mostreremo come, con le necessarie assunzioni, sia possibile costruire un algoritmo che conduca il proprio avversario all’equilibrio, senza conoscere l’algoritmo esatto dello sfidante e garantendo non solo la proprietà di safety, ma anche un tempo di apprendimento (tecnicamente, Regret dinamico) sublineare. Il problema verrà affrontanto sia in situazioni in cui il "maestro" abbia un feedback totale, che un feedback parziale, mostrando le differenze in termini di garanzie teoriche.