ConceptBot: enhancing the autonomy of robotic systems through task decomposition with large language models and knowledge graphs

Robotic systems have made significant progress in their various components to achieve increasing levels of autonomy. However, effective planning in unstructured environments remains a substantial challenge. This thesis, therefore, presents ConceptBot, a novel planning system that integrates Large Language Models (LLMs) with external knowledge sources, leveraging Knowledge Graphs, and in particular ConceptNet, to improve task decomposition and context understanding. Unlike traditional approaches that rely on pre-programmed models or extensive training on specialized datasets, ConceptBot employs a modular structure consisting of three key components: Object Properties Extraction (OPE), User Request Processing (URP), and Planner. These modules dynamically retrieve and use semantic relationships to interpret and execute complex and ambiguous user instructions. OPE is responsible for identifying and extracting relevant properties of objects within the environment; URP interprets the user's natural language instructions to formulate clear and actionable tasks; the Planner then synthesizes this information to generate a policy composed of executable and mutually consistent actions, ensuring that the robot's operations are contextual and appropriate to meet the user's needs. A comparison with a planner proposed by Google, SayCan, was conducted to evaluate the effectiveness of ConceptBot. In addition to the evaluation of the generated policies, some meaningful experiments were performed in a simulative environment, using PyBullet, ViLD, and CLIPort, and within the IDSIA laboratory, using the Franka Emika Panda robotic arm equipped with an Intel RealSense camera and YOLO as an object detection system, were also performed. The results show that the proposed solution excels in context awareness, accurate interpretation of user requests, and recognition of feasible actions that can involve even ambiguous objects for a traditional LLM, thus achieving high adaptability without needing to be specifically trained.

I sistemi robotici hanno compiuto progressi significativi nei loro vari componenti per raggiungere livelli crescenti di autonomia. Tuttavia, una pianificazione efficace in ambienti non strutturati rimane una sfida sostanziale. Questa tesi presenta quindi ConceptBot, un nuovo sistema di pianificazione che integra i Large Language Models (LLM) con fonti di conoscenza esterne, sfruttando i Knowledge Graphs, e in particolare ConceptNet, per migliorare la scomposizione dei compiti e la comprensione del contesto. A differenza degli approcci tradizionali che si basano su modelli pre-programmati o su un addestramento estensivo su insiemi di dati specializzati, ConceptBot impiega una struttura modulare costituita da tre componenti chiave: Object Properties Extraction (OPE), User Request Processing (URP) e Planner. Questi moduli recuperano e utilizzano dinamicamente le relazioni semantiche per interpretare ed eseguire istruzioni complesse e ambigue dell'utente. L'OPE è responsabile dell'identificazione e dell'estrazione delle proprietà rilevanti degli oggetti all'interno dell'ambiente; l'URP interpreta le istruzioni in linguaggio naturale dell'utente per formulare compiti chiari e perseguibili; il Planner sintetizza quindi queste informazioni per generare una politica composta da azioni eseguibili e reciprocamente coerenti, assicurando che le operazioni del robot siano contestuali e appropriate per soddisfare le esigenze dell'utente. Per valutare l'efficacia di ConceptBot è stato condotto un confronto con un pianificatore proposto da Google, SayCan. Oltre alla valutazione delle politiche generate, sono stati eseguiti alcuni esperimenti significativi in ambiente simulativo, utilizzando PyBullet, ViLD e CLIPort, e all'interno del laboratorio IDSIA, utilizzando il braccio robotico Franka Emika Panda dotato di una telecamera Intel RealSense e YOLO come sistema di rilevamento degli oggetti. I risultati mostrano che la soluzione proposta eccelle nella consapevolezza del contesto, nell'interpretazione accurata delle richieste dell'utente e nel riconoscimento di azioni fattibili che possono coinvolgere anche oggetti ambigui per un LLM tradizionale, ottenendo dunque una elevata adattabilità senza bisogno di essere addestrato specificamente.