Dynamic control frequency in reinforcement learning through action persistence

In Reinforcement Learning, adopting a good control frequency, i.e. the frequency used by the agent to choose the action during the interaction process with the environment, may significantly affect the process of learning the optimal policy. The aim of this project is to propose the implementation of a new method to obtain a dynamic control frequency. Especially, we exploited the action persistence mechanism, that is to say binding the agent to repeat the same action more than once. With this technique the agent is allowed to operate at different frequencies that are multiple of the base control frequency. Moreover, we sought a manner to exploit the information collected with different persistences in order to obtain a lower sample complexity compared to the classical methods. Indeed, we developed a new Bellman operator consisting of two parts. The first part uses collected samples to estimate directly the optimal policy, meanwhile, the second one estimates the policy with higher persistence starting from the partial information previously collected. After a neural network implementation of the proposed method, we have tested our algorithms with different environments and old-gen video-games (Atari 2600). The results obtained point out how the dynamic persistence of the actions allows not only to be able to explore a wider set of states of our environment, but also to reduce the sample complexity, used by the agent to learn the representation of the requested model.

Nel campo del Reinforcement Learning, la possibilità di controllare la frequenza di esecuzione delle scelte, usata dall'agente per interagire con l'ambiente circostante, può avere significative ripercussioni sul processo di apprendimento della policy ottima. L'obiettivo di questo progetto consiste nell'implementazione di un metodo adeguato per poter ottenere una frequenza di controllo dinamica. In particolare, abbiamo sfruttato la persistenza delle azioni, un meccanismo che vincola l'agente a ripetere la stessa azione più volte. Con questa tecnica all'agente è permesso operare a diverse frequenze, che sono multiple della frequenza di controllo base. Inoltre, abbiamo cercato un sistema per sfruttare le informazioni collezionate a varie persistenze per poter avere una sample complexity inferiore rispetto agli algoritmi classici. Per questa ragione abbiamo sviluppato un nuovo operatore di Bellman composto da due parti. La prima utilizza i sample collezionati per stimare la policy ottima, mentre la seconda stima la policy per persistenze più alte utilizzando le informazioni parziali precedentemente raccolte. Dopo aver implementato il sistema ipotizzato tramite l'impiego delle reti neurali, abbiamo eseguito vari esperimenti su diversi videogiochi di vecchia generazione (Atari 2600). I risultati ottenuti hanno evidenziato come la persistenza dinamica delle azioni permetta non solo di poter esplorare un insieme più ampio di stati del nostro ambiente, ma anche di ridurre la sample complexity con cui l'agente riesce ad imparare la rappresentazione del modello richiesto.