Pursuit evasion in continuous domains: interpretable multi-agent strategies with off-policy

This thesis studies pursuit and evasion in continuous two dimensional domains and asks a simple question: what kinds of intercept and escape patterns appear when both agents learn from experience? The interaction is modeled as a two player Markov game and trained with the TD3 algorithm. Training follows a small, practical plan: phases of self play alternate with short periods where the opponent is kept fixed to stabilize learning targets, and there is light randomization of speeds and turn limits so the policy does not specialize to a single arena shape. As the arenas become more challenging, clear strategies emerge on both sides. In a closed arena, the predator predicts the path and guides the intercept with smooth, well timed turns. The prey waits, keeps a useful distance, then changes direction late with a short move. With obstacles, the prey tries to keep them as a screen, while the predator uses walls and corridors to push the prey into constrained or trapping areas that are unfavorable for it and reduce its options. In the safe zone variant, the predator aims to cut across the route to the safe zone, while the prey chooses timing and angle to reach it. In the three versus one case, the predators try to keep the coverage that is a simple, controlled shape around the prey. Each predator takes one slice of the circle around the prey. They keep about the same distance from the prey. They remain equally spaced from one another. No one cuts in front of another. This forms a tight ring that closes the main escape directions. When the prey moves toward the den, the ring slides and stretches into a straight line across the corridor leading to the den. This line is the barrier. One predator stays in front to block the forward path. The other two hold the other two sides, so a turn to either side is already covered. The coverage breaks when spacing is uneven, distances are not similar, headings are not aligned, or one predator is late. Then a gap opens in the ring or in the barrier line, and the prey can pass through. When the coverage is weak in this way, escapes happen more often.The prey looks for small timing errors. It uses quick turns and nearby obstacles to pull one predator out of position, create a short opening, and reach the den. In the dodge task, a capture happens only if both agents are at exactly the same positionn the same time step. This is the integration step used by the simulator. There is no capture radius, near is not enough. Because of this strict rule, captures are less common, but the behavior remains clean and predictable. The prey waits, keeps enough turning room, and then makes one short and firm dodge at the right moment. This usually creates a near miss without contact. Several models use small grid searches. This means trying a simple table of settings to find values that work well without tuning for one specific arena. The adaptive difficulty control loop then changes a few key settings during training and testing to keep the game fair. Typical examples are the prey speed and the maximum curvature. If the predator wins too often, the loop makes the prey slightly stronger; if the prey wins too often, it does the opposite. The goal is a balanced game over stable evaluation runs. For selected settings, the trajectories are saved as short GIFs. These clips show timing, use of curvature, and the final result at a glance.

Questa tesi analizza i problemi di inseguimento evasione in ambienti bidimensionali continui e pone una domanda semplice: quali schemi di intercetto e di fuga emergono quando entrambi gli agenti apprendono dall’esperienza? Il confronto è modellato come un gioco di Markov a due giocatori e l’addestramento impiega TD3. Il protocollo di addestramento è così composto: fasi di self-play si alternano a brevi intervalli con l’avversario “congelato” per stabilizzare i bersagli di apprendimento; inoltre si applica una leggera variazione di velocità e limiti di curvatura, così da evitare che le strategie si adattino a una singola geometria dell’arena. Le dinamiche tra i due agenti sono molto interessanti e si differenziano per tipologia di arena. In un’arena chiusa, il predatore anticipa la traiettoria e costruisce l’intercetto con svolte fluide; la preda attende, e poi cambia direzione tardi con un’azione breve. In presenza di ostacoli, la preda li usa come schermo; il predatore sfrutta muri e corridoi per convogliarla in zone ristrette o per intrapporlarla. Nella variante con zona sicura, il predatore prova a tagliare la rotta verso l’obiettivo, invece, la preda calibra tempo e angolo per raggiungerlo. Nel tre contro uno l’obiettivo dei predatori è mantenere la “copertura”: una disposizio- ne controllata che ripartisce gli angoli intorno alla preda, mantiene distanze simili ed equispaziamento tra i compagni. In assetto statico ciò genera un anello che chiude le principali vie di fuga; lungo il corridoio verso la tana, lo stesso schema si deforma in una barriera trasversale, con un predatore a bloccare l’avanzamento e gli altri a chiudere i fianchi. La copertura cede quando le spaziature si deformano, le distanze divergono, gli orientamenti si disallineano o un agente resta indietro: si aprono varchi nell’anello o nella barriera e le probabilità di fuga aumentano. La preda ricerca attivamente queste micro asimmetrie, usando cambi rapidi di rotta e sfruttando la presenza di ostacoli per creare aperture temporanee e raggiungere la tana. Nel compito “dodge” la cattura è definita in modo esatto: avviene solo se i due agenti occupano la stessa posizione nello stesso passo di integrazione del simulatore. Questa scelta riduce la frequenza delle catture, incentivando la preda a cercare di schivare il predatore che è molto più veloce e grosso ma con maggiori limiti di sterzata: la preda attende, conserva curvatura disponibile e realizza una singola schivata breve e decisa al momento opportuno, generando dunque mancate collisioni. La selezione dei parametri avviene tramite piccole grid search, sufficienti a individuare configurazioni efficaci senza specializzarsi su un solo scenario. Un ciclo di controllo adattivo della difficoltà (Adaptive Difficulty Control) regola pochi parametri chiave, tipicamente velocità della preda e limite di curvatura, per mantenere l’equilibrio del gioco in valutazioni ripetute: se il predatore prevale troppo spesso, la preda viene leggermente rafforzata, e viceversa. Per alcune configurazioni si salvano le traiettorie in brevi GIF, che offrono una verifica visiva immediata su timing, uso della curvatura ed esito, a complemento delle metriche numeriche.