Evaluating reinforcement learning models for market making in limit order books: a multi-agent simulation study

Reinforcement Learning (RL) has started to be employed to tackle sequential decisionmaking problems in numerous different fields, including finance. In particular, we will focus on the problem of market making, where an agent has to continuously quote a bid price at which he is willing to buy and an ask price at which he is willing to sell a certain asset. Once they are chosen, the agent submits a buy limit order and a sell limit order, which go in the queue of a limit order book. A buy limit order is then executed only if there is an incoming sell order with a price that is lower than or equal to the quoted bid price. Similarly, a sell limit order is executed only if there is an incoming buy order with a price that is greater than or equal to the quoted ask price. The market maker earns a profit equal to the difference between the ask and the bid prices if he manages to execute both orders. Otherwise, if he executes only the bid order, he will hold a positive inventory, while if he executes only the sell order he will hold a negative inventory. In both cases, holding an inventory yields a risk related to the movements of the market. For this reason, the goal of the market maker is to choose the bid and the ask prices so as to maximize the profit he makes from executing the orders while minimizing the inventory. To do so, he needs to learn how to skew prices in order to try to sell when he has a positive inventory and to buy when he has a negative inventory. The main contribution of this thesis is the implementation of an RL agent trained with the PPO algorithm using a simulator of the market called ABIDES (Agent-Based Interactive Discrete-Event Simulation). After trying to replicate some previous works and trying to train the agent with a state space composed of a large number of features, we obtained the best results when we reduced the state space to a single feature: the inventory. Indeed, we confirmed that the fundamental feature for the agent to discriminate on which action to perform is the inventory, which he wants to keep as close to 0 as possible.

L’apprendimento per rinforzo (RL) ha iniziato ad essere impiegato per affrontare i problemi di processi decisionali sequenziali in numerosi campi diversi, inclusa la finanza. In particolare, ci concentreremo sul problema del market making, dove un agente deve continuamente quotare un prezzo bid al quale è disposto ad acquistare e un prezzo ask al quale è disposto a vendere un certo bene. Una volta scelti, l’agente invia un ordine limite di acquisto e un ordine limite di vendita, che vanno nella coda di un limit order book. Un ordine limite di acquisto viene quindi eseguito solo se c’è un ordine di vendita in arrivo con un prezzo inferiore o pari al prezzo di offerta quotato. Allo stesso modo, un ordine limite di vendita viene eseguito solo se esiste un ordine di acquisto in entrata con un prezzo maggiore o uguale al prezzo di domanda quotato. Il market maker guadagna un profitto pari alla differenza tra il prezzo ask e il prezzo bid se riesce a eseguire entrambi gli ordini. Altrimenti, se esegue solo l’ordine di acquisto, deterrà un inventario positivo, mentre se esegue solo l’ordine di vendita deterrà un inventario negativo. In entrambi i casi, detenere un inventario comporta un rischio legato ai movimenti del mercato. Per questo motivo, l’obiettivo del market maker è quello di scegliere i prezzi di bid e ask in modo da massimizzare il profitto che ottiene dall’esecuzione degli ordini riducendo al minimo l’inventario allo stesso tempo. Per fare ciò, ha bisogno di imparare a devire i prezzi per provare a vendere quando ha un inventario positivo e ad acquistare quando ha un inventario negativo. Il contributo principale di questa tesi è l’implementazione di un agente RL addestrato con l’algoritmo PPO che utilizza un simulatore del mercato chiamato ABIDES (Agent- Based Interactive Discrete-Event Simulation). Dopo aver provato a replicare alcuni lavori precedenti e aver provato ad addestrare l’agente con uno spazio degli stati composto da un gran numero di variabili, abbiamo ottenuto i risultati migliori quando abbiamo ridotto lo spazio degli stati a una singola variabile: l’inventario. Infatti, abbiamo confermato che la caratteristica fondamentale perché l’agente possa discriminare su quale azione eseguire è l’inventario, che vuole mantenere il più vicino possibile a 0.