Improving generalization in reinforcement learning. An application to Candy Crush Friends Saga

Reinforcement Learning is a promising approach to develop intelligent agents that can help game developers in testing new content. However, applying it to a game with stochastic transitions like Candy Crush Friends Saga (CCFS) presents some challenges. Previous works have proved that an agent trained only to reach the objective of a level is not able to generalize on new levels. Inspired by the way humans approach the game, we develop a two-step solution to tackle the lack of generalization. First, we let multiple agents learn different skills that can be re-used in high-level tasks, training them with rewards that are not directly related to the objective of a level. Then, we design two hybrid architectures, called High-Speed Hierarchy (HSH) and Average Bagging (AB), which allow us to combine the skills together and choose the action to take in the environment by considering multiple factors at the same time. Our results on CCFS highlight that learning skills with the proposed reward functions is effective, and leads to a higher proficiency than the baselines applying state of the art. Moreover, we show that AB exhibits a win rate on unseen levels that is twice as high as that of an agent trained only on reaching the objective of a level, and even surpasses human performance on one level. Overall, our solution is a step in the right direction to develop an automated agent that can be used in production, and we believe that with some extensions it can yield even better results

Il Reinforcement Learning `e un promettente approccio per sviluppare agenti intelligenti che possono assistere gli sviluppatori di giochi nel testare nuovi contenuti. Tuttavia, usarlo in un gioco con transizioni stocastiche come Candy Crush Friends Saga (CCFS) pone delle sfide. Precedenti ricerche hanno dimostrato che un agente trainato esclusivamente per raggiungere l’obiettivo di un livello non sia in grado di generalizzare su nuovi livelli. Traendo ispirazione da come gli esseri umani approcciano il gioco, abbiamo sviluppato una soluzione in due punti per trattare il problema della mancanza di generalizzazione. Per prima cosa, facciamo s`ı che una serie di agenti imparino varie abilità da poter utilizzare per completare incarichi ad alto livello, trainandoli con reward slegati dall’obiettivo dei livelli. Dopodiché, definiamo due architetture ibride, chiamate High-Speed Hierarchy (HSH) e Average Bagging (AB), le quali permettono di combinare le abilità insieme per scegliere l’azione da fare nell’environment tenendo conto di vari fattori allo stesso tempo. I nostri risultati su CCFS mostrano che apprendere le abilità con le reward function da noi proposte sia efficace, e che porti ad un livello di maestria nettamente migliore dello stato dell’arte. Inoltre, diamo prova di come AB abbia una win rate su nuovi livelli pari al doppio di quella di un agente trainato esclusivamente per raggiungere l’obbiettivo di un livello, e che sorpassi anche la performance di un giocatore umano su un livello. Complessivamente, la nostra soluzione `e un passo nella giusta direzione per sviluppare un agente intelligente che possa essere usato in produzione, e siamo convinti che con alcune estensioni sia possibile ottenere risultati ancora migliori.