Exploiting AI and NLP methods for empowering naive users in solving data science problems

Data Science (DS) and Machine Learning (ML) have become critical tools for making informed decisions, predicting outcomes, and automating processes. The rise of big data and the availability of powerful computers, coupled with the development of new and more sophisticated ML algorithms, has led to a huge growth of interest in ML over the past decade. Despite the significant advancements in ML methods, building and training ML models can still be complex and time-consuming, requiring expertise in computer science, mathematics, and statistics; taking advantage of ML can still be challenging, especially for people without these skills. The significant gap in a deep understanding of machine learning principles among IT and business professionals has led to incidents related to bias, privacy, security, transparency, and ethical concerns. The democratization of data science and machine learning aims to change this situation, by making ML technologies and techniques more accessible to a wider range of people. This thesis discusses the difficulties non-experts face in using ML tools. It explores various approaches to democratize data science and machine learning, such as developing user-friendly ML tools and platforms and educational initiatives to teach the necessary skills to non-experts. It discusses the potential benefits of democratizing data science, such as driving innovation, providing new insights into complex problems, and creating a more inclusive and diverse data science community. In particular, my research introduces a progression of methods and underlying tools that make use of conversational agents, natural language, and autoML, with the objective of democratizing data science and make it more accessible to a wider range of people. The thesis begins by presenting GeCoAgent and DSBot, two multi-modal conversational agents designed to facilitate data science processes starting from natural language input. GeCoAgent and DSBot are two distinct conversational agents that serve different purposes in the context of data science automation. GeCoAgent takes a proactive approach by driving the conversation with the user, asking detailed and specific questions to better understand the user's needs and goals. On the other hand, DSBot is a user-driven conversational agent, where the user provides a research question in natural language and the bot extracts the necessary information and executes the relevant data science processes. However, the automation of data science processes raises issues such as the difficulty in formulating the research question and the importance of incorporating domain expertise into the pipeline. To address these challenges, the thesis then presents two additional tools: MLFriend and Zephyr. MLFriend enables automatic generation of prediction tasks, while Zephyr streamlines the integration of domain expertise and automated data science tools. We conducted empirical evaluations and user studies to illustrate the effectiveness of these tools in making machine learning more accessible and user-friendly. By providing these four solutions, embodied within GeCoAgent, DSBot, MLFriend, and Zephyr, we show a progressive development of ideas, methods and tools towards the goal of improving the accessibility and usability of data science tools for non-experts. Our research contributes to the field of democratization of DS by providing new strategies that can be used to reduce the gap between experts and non-experts in the field. We trust that our results will contribute to address the remaining challenges and opportunities and make machine learning more accessible and user-friendly for a wider range of people.

La Data Science (DS) e il Machine Learning (ML) sono diventati strumenti essenziali per prendere decisioni informate, prevedere risultati e automatizzare processi. La crescita di big data e la disponibilità di potenti computer, uniti allo sviluppo di algoritmi ML sempre più sofisticati, hanno portato a un enorme interesse per il ML nell'ultimo decennio. Nonostante i significativi progressi nei metodi di ML, la creazione e l'addestramento dei modelli ML possono ancora essere complessi e richiedere tempo, esigendo competenze informatiche, matematiche e statistiche. Sfruttare il ML può ancora rappresentare una sfida, specialmente per le persone senza queste competenze. La significativa lacuna nella comprensione approfondita dei principi del machine learning tra i professionisti IT e commerciali ha portato a incidenti legati al pregiudizio, alla privacy, alla sicurezza, alla trasparenza e alle questioni etiche. La democratizzazione della data science e del machine learning mira a cambiare questa situazione, rendendo le tecnologie e le tecniche ML più accessibili a un pubblico più ampio. Questa tesi discute le difficoltà che i non esperti affrontano nell'uso degli strumenti ML. Esplora vari approcci per democratizzare la data science e il machine learning, come lo sviluppo di strumenti e piattaforme ML user-friendly e iniziative educative per insegnare le competenze necessarie ai non esperti. Discute i potenziali vantaggi della democratizzazione della data science, come stimolare l'innovazione, fornire nuove intuizioni su problemi complessi e creare una comunità di data science più inclusiva e diversificata. In particolare, la mia ricerca introduce una progressione di metodi e strumenti sottostanti che fanno uso di agenti conversazionali, linguaggio naturale e autoML, con l'obiettivo di democratizzare la data science e renderla accessibile a un pubblico più ampio. La tesi inizia presentando GeCoAgent e DSBot, due agenti conversazionali multi-modali progettati per agevolare i processi di data science a partire dall'input in linguaggio naturale. GeCoAgent e DSBot sono due agenti conversazionali distinti che servono a scopi diversi nel contesto dell'automazione della data science. GeCoAgent adotta un approccio proattivo guidando la conversazione con l'utente, ponendo domande dettagliate e specifiche per comprendere meglio le esigenze e gli obiettivi dell'utente. D'altra parte, DSBot è un agente conversazionale guidato dall'utente, in cui l'utente fornisce una domanda di ricerca in linguaggio naturale e il bot estrae le informazioni necessarie ed esegue i processi di data science pertinenti. Tuttavia, l'automazione dei processi di data science solleva questioni come la difficoltà di formulare la domanda di ricerca e l'importanza di incorporare l'esperienza nel dominio nel flusso di lavoro. Per affrontare queste sfide, la tesi presenta poi due ulteriori strumenti: MLFriend e Zephyr. MLFriend consente la generazione automatica di prediction tasks, mentre Zephyr semplifica l'integrazione dell'esperienza nel dominio e degli strumenti automatizzati di data science. Abbiamo condotto valutazioni empiriche e studi sugli utenti per illustrare l'efficacia di questi strumenti nel rendere il machine learning più accessibile e user-friendly. Fornendo queste quattro soluzioni, incarnate in GeCoAgent, DSBot, MLFriend e Zephyr, mostriamo un progressivo sviluppo di idee, metodi e strumenti verso l'obiettivo di migliorare l'accessibilità e l'usabilità degli strumenti di data science per i non esperti. La nostra ricerca contribuisce al campo della democratizzazione della data science fornendo nuove strategie che possono essere utilizzate per ridurre il divario tra esperti e non esperti nel campo. Confidiamo che i nostri risultati contribuiranno ad affrontare le sfide e opportunità rimanenti e renderanno machine learning più accessibile e user-friendly per un pubblico più ampio.