Engineering a runtime system for AQL

Information Extraction addresses the task of extracting information from unstructured text automatically. Traditionally, IE systems focus on the accuracy of extraction tasks. However, the problem of the scalability of these tasks has recently began to receive attention. SystemT is an IE system that successfully addresses this concern. This success stimulated many studies related to SystemT. One of the most prominent introduces a formal model that delineates a new approach to extraction tasks, considerably different from that of SystemT. According to the formal model of SystemT, an extraction task is executed by first obtaining basic elements from the input text, and then combining these elements by means of algebraic operators to obtain the desired results. In this dissertation, I embark on the task of de ning and implementing a new runtime system that allows for efficiently obtaining the desired results without the necessity to apply algebraic operators over basic elements. To this end, I first introduce a computational model (namely eVset-automata), that is based on a previous model (vset-automata), used for extracting basic elements from text, and I show that the two models are equivalent in expressive power. Moreover, I prove that under the proposed model there are polynomial time constructions that allow for directly evaluating the algebraic operators, without the need for obtaining the basic elements first. This represents an improvement over the previous approach (vset-automata), under which constructions for the same purpose existed but resulted in exponentially larger computations. Then, I discuss the results of a series of experiments executed on a corpus of real-life blog posts, that compare the performance of the runtime system with that of a subsystem that uses the same approach of SystemT. The experiments show that the runtime system is better than the subsystem in many cases. However, the subsystem is naturally capable of reusing partial results to reduce its workload, which can sometimes be advantageous. Finally, I lay the foundations for further improvements of the runtime system.

L'Estrazione dell'Informazione (Information Extraction, IE) affronta il compito di estrarre automaticamente informazioni da testi non strutturati. Tradizionalmente, i sistemi di IE si concentrano sulla precisione delle attività di estrazione. Tuttavia, il problema della scalabilità di questi compiti ha recentemente cominciato a ricevere attenzione. SystemT è un sistema di IE che affronta con successo questo problema. Questo successo ha stimolato molti studi relativi a SystemT. Uno dei più prominenti introduce un modello formale che delinea un nuovo approccio alle attività di estrazione, notevolmente diverso da quello di SystemT. Secondo il modello formale di SystemT, un compito di estrazione viene eseguito prima ottenendo elementi di base dal testo di input e quindi combinando questi elementi tramite operatori algebrici per ottenere i risultati desiderati. In questa tesi mi impegno a definire ed implementare un nuovo sistema di runtime che consente di ottenere in modo efficiente i risultati desiderati senza la necessità di applicare operatori algebrici sugli elementi di base. A tal fine, introduco innanzitutto un modello computazionale (gli eVset-automata), basato su un modello precedente (i vset-automata), utilizzato per estrarre elementi di base dal testo, e mostro che i due modelli sono equivalenti nel potere espressivo. Inoltre, dimostro che per il modello proposto esistono costruzioni di complessità temporale polinomiale che consentono di valutare direttamente gli operatori algebrici, senza necessità di ottenere prima gli elementi di base. Questo rappresenta un miglioramento rispetto all'approccio precedente (vset-automata), per il quale esistevano costruzioni per lo stesso scopo, tuttavia queste spesso portavano a computazioni esponenzialmente più grandi. Successivamente, discuto i risultati di una serie di esperimenti eseguiti su un corpus di post di blog di vita reale, che confrontano le prestazioni del sistema di runtime con quello di un sottosistema che utilizza lo stesso approccio di SystemT. Gli esperimenti mostrano che in molti casi il sistema di runtime è migliore del sottosistema. Tuttavia, il sottosistema è naturalmente in grado di riutilizzare risultati parziali per ridurre il carico di lavoro, il che risulta a volte vantaggioso. Infine, metto le basi per ulteriori miglioramenti del sistema di runtime.