Development of a text-analytics based framework to support automated clinical literature research and study classification

Nowadays, technological innovation is proceeding at an advanced speed, especially in the medical and biomedical fields. The need to search and aggregate medical information from the Web quickly and consistently is increasingly in demand, and for this reason, many platforms have been developed in recent years to guarantee the possibility of speeding up some essential processes for conducting complete and efficient clinical literature research. The automation of such processes continues to present numerous challenges; in fact, many already developed tools still work independently and cannot be combined to include all the steps necessary for such research. This project aims to fill this gap, to provide a tool to be used to search and aggregate information from scientific articles retrieved from two of the most important databases used in research: PubMed and Google Scholar. The developed framework was designed to support researchers in the collection of clinical evidence regarding a specific topic about the medical field, to carry out experiments, studies, or to prove the validity of a device or an application. This is also in line with the requirements of the new regulation entered into force in May 2021 for medical devices (MDR) that requires manufacturers to prove the validity of a given device during the post-market surveillance through clinical evidence, which can also be found in the literature. This project builds the basis for developing the main steps required to complete structured clinical literature research, which could also be useful for building research studies through Systematic Reviews. All phases were automated and implemented with the Python programming language, and the intervention of an external user was required only to launch the script and open the final interface. Through this interface, the user can enter a query string that will be automatically searched in the two chosen search engines, by which all the corresponding scientific articles will be downloaded to form a single final database. Once the database was created, a classification algorithm was implemented and extensively tested to categorize the articles according to the type of study (Systematic Review and Meta-Analysis - SRMA, Randomized Clinical Trial - RCT, or Other), by comparing the titles and abstracts with manually created dictionaries. At the end of this operation, data were presented to the user in an intuitive and aggregated way, to provide an overview and a presentation of the most important information about the obtained results.

Ad oggi l’innovazione tecnologica procede a velocità avanzata, soprattutto nell’ambito medico/biomedico. La necessità di cercare e aggregare informazioni mediche dal Web in modo rapido e consistente è sempre più richiesta, e per questo negli ultimi anni sono stati sviluppate molte piattaforme che garantiscono la possibilità di velocizzare alcuni processi essenziali per condurre una ricerca della letteratura clinica completa ed efficiente. L’automatizzazione di questi processi, però, continua a presentare numerose sfide; infatti molti tools già sviluppati funzionano solo indipendentemente e non riescono a combinarsi tra loro per includere tutti gli step necessari per la ricerca. Questo progetto mira a colmare questo divario con lo scopo di fornire un tool da utilizzare per collezionare articoli scientifici estratti da due tra i più importanti database usati nella ricerca: PubMed e Google Scholar. Il framework sviluppato è stato pensato per supportare i ricercatori nella raccolta dell’evidenza clinica riguardo uno specifico argomento relativo all’ambito medico per effettuare esperimenti, studi, o provare la validità di uno strumento o di un’applicazione. Tutto questo è in linea con il nuovo regolamento entrato in vigore a Maggio 2021 per i dispositivi medici (MDR) che richiede ai produttori la necessità di dimostrare la validità di un determinato dispositivo durante la Post-Market Surveillance attraverso l’evidenza clinica che si può trovare in letteratura. Questo progetto costruisce le basi per sviluppare le principali fasi necessarie per completare una ricerca clinica strutturata, che potrebbero essere utili anche per costruire studi di ricerca attraverso le Revisioni Sistematiche. Tutte le fasi sono automatizzate e implementate con il linguaggio di programmazione Python, e l’intervento di un utente esterno è richiesto solo per lanciare lo script e aprire l’interfaccia finale. Attraverso tale interfaccia l’utente può inserire una stringa che verrà automaticamente cercata nei due motori di ricerca utilizzati, tramite cui tutti gli articoli corrispondenti verranno scaricati per formare un unico database finale. Una volta creato il database, un algoritmo di classificazione è stato implementato e testato per categorizzare gli articoli in base al tipo di studio (Revisioni Sistematiche e Meta-Analisi – SRMA, Studi Clinici Randomizzati – RCT, o altro), attraverso il confronto dei titoli e degli abstract con dei dizionari creati manualmente. Al termine di questa operazione, i dati vengono presentati all’utente in modo intuitivo e aggregato, per avere una panoramica e una presentazione delle informazioni più importanti riguardo ai risultati ottenuti.