Taming the DOM: a RAG-enhanced CNT framework for conversational web assistance

Screen-reader users still face slow, noisy, and fragile experiences when interacting with modern web pages. ConWeb, a conversational web navigation platform developed at Politecnico di Milano, lets visually impaired users operate websites through natural-language commands, but its implementation relied heavily on large language models (LLMs) for HTML parsing, intent understanding, and action grounding, leading to high and unpredictable token usage, long latencies, and downtime when external LLM services are unavailable. This thesis proposes an alternative, structure-first pipeline that turns raw DOM into a compact, navigable, and retrieval-friendly representation for conversational web assistance. The approach introduces a Conversation-oriented Navigation Tree (CNT) that cleans, merges, prunes, and augments DOM nodes into meaningful units with stable identifiers, visibility and actionability flags, and accessibility-aligned metadata. Each CNT node is serialized into text, embedded with a multilingual text encoder, and cached in a per-user vector store. On top of this, a Retrieval-Augmented Generation (RAG) layer expands a spoken user request into multiple short variants, retrieves and ranks a small set of relevant nodes, and feeds only this compact context to the agent LLM. The CNT + RAG pipeline was integrated into the existing ConWeb platform and evaluated on six heterogeneous webpages, covering 49 scripted user requests across 'read', 'ask', 'describe', 'navigate', and 'fill' intents. Compared to the legacy system, the new architecture reduces prompt tokens and LLM usage costs, shortens latency, and keeps the CNT construction and embedding within tight token, time, and memory budgets. At the same time, answer accuracy rises from 71% to 90%, with particularly strong gains in navigation and description oriented tasks. These results show that using CNT-based retrieval to deliver small, high-precision node sets to an LLM yields a more efficient, controllable, and robust conversational web assistant, suggesting design principles for future accessible, retrieval-focused agents.

Gli utenti di screen reader sperimentano interazioni lente e fragili con le pagine web. ConWeb, una piattaforma di navigazione web conversazionale sviluppata al Politecnico di Milano, permette a persone con disabilità visive di utilizzare i siti tramite comandi in linguaggio naturale, ma la sua implementazione, fortemente basata su Large Language Models (LLMs) per parsing HTML e comprensione dell'intento, generava un consumo di token elevato, latenze significative e disservizi quando i servizi LLM non erano disponibili. Questa tesi propone una pipeline "structure-first" che trasforma il DOM grezzo in una rappresentazione compatta, navigabile e adatta al retrieval per l'assistenza conversazionale sul web. L'approccio introduce un Conversation-oriented Navigation Tree (CNT) che pulisce, unisce, pota e arricchisce i nodi del DOM in unità informative dotate di identificativi stabili, flag di visibilità e azionabilità, e metadati allineati ai criteri di accessibilità. Ogni nodo CNT viene serializzato, incorporato tramite un text encoder multilingue e memorizzato in una vector store. Su questo livello si innesta uno strato di Retrieval-Augmented Generation (RAG), che espande la richiesta vocale dell'utente in varianti brevi, recupera e ordina un insieme ridotto di nodi rilevanti e fornisce solo questo contesto compatto all'LLM agente. La pipeline CNT + RAG è stata integrata in ConWeb e valutata su sei pagine web, coprendo 49 richieste utente con cinque intenti principali. Rispetto al sistema legacy, la nuova architettura riduce il numero di token nei prompt e i costi degli LLM, diminuisce la latenza e mantiene la costruzione ed embedding del CNT entro vincoli stringenti di tempo e memoria. Allo stesso tempo, l'accuratezza delle risposte cresce dal 71% al 90%, con miglioramenti marcati nei task orientati alla navigazione e alla descrizione. Questi risultati mostrano che un retrieval basato su CNT, capace di fornire a un LLM insiemi di nodi ridotti ma ad alta precisione, consente di ottenere un assistente conversazionale per il web più efficiente e robusto e suggerisce principi progettuali per futuri agenti accessibili centrati sul retrieval.