Learning text embeddings on virus sequences for candidate drug discovery

The classic development of new drugs against diseases has shown in recent years a very marked increase in costs and timing. The number of drugs approved and placed on the market does not justify the enormous investments made by the scientific community, creating an imbalance between positive results and the resources used to obtain them. The current scenario of COVID19, which caused a world pandemic, showed more the need to find faster and more efficient alternative methodologies for drug research. A modern approach that has greatly interested scientific research is known as 'drug repurposing', a technique based on the repositioning or reusing drugs already known against diseases. Thanks to technological and IT innovation, the development of the biomedical field, and the exponential growth of biological information on proteins, drugs, and interactions between different molecules, today it is possible to identify a comprehensive scenario of computational alternatives able to analyze these amounts of data from different angles and predict new possible drug allocations. The reduction of research times and costs represent only some of the advantages of these methodologies, but the scientific community has not entirely dogmatized their effectiveness. However, it represents a promising alternative way for the possible medical results that can be obtained and what one day can be avoided, addressing world emergencies with greater speed and effectiveness. This thesis describes a drug repurposing approach based on recent developments in Natural Language Processing, a field of deep learning that aims to make the human's language more computational as possible. In particular, we are using a BERT-based model for the first time as an investigation tool to identify similitudes between different viral protein sequences and predict possible drug repositioning based on these similitudes.

I classici processi di sviluppo di nuovi farmaci ha mostrato negli ultimi anni un aumento molto marcato di costi e di tempistiche. Il numero di farmaci approvati e immessi sul mercato non giustifica gli enormi investimenti fatti dalla comunità scientifica, creando uno squilibrio tra i risultati positivi e le risorse impiegate per ottenerli. L'attuale scenario del COVID19, che ha causato una pandemia mondiale, ha mostrato maggiormente la necessità di trovare metodologie alternative più veloci ed efficienti per la ricerca sui farmaci. Un approccio moderno che ha interessato molto la ricerca scientifica è noto come "drug repurposing", una tecnica basata sul riposizionamento o sul riutilizzo di farmaci già noti contro le malattie. Grazie all'innovazione tecnologica e informatica, allo sviluppo del campo biomedicale e alla crescita esponenziale di informazioni biologiche su proteine, farmaci e interazioni tra diverse molecole, oggi è possibile identificare uno scenario completo di alternative computazionali in grado di analizzare queste quantità di dati da diverse angolazioni e prevedere nuove possibili assegnazioni di farmaci. La riduzione dei tempi e dei costi di ricerca rappresentano solo alcuni dei vantaggi di queste metodologie, ma la comunità scientifica non ha dogmatizzato del tutto la loro efficacia. Tuttavia, il riposizionamento dei farmaci rappresenta una promettente via alternativa non solo per i possibili risultati medici che si potranno ottenere, ma anche per le disgrazie che si potranno evitare, affrontando le emergenze mondiali con maggiore rapidità ed efficacia. Questa tesi descrive un approccio di riproposizione di farmaci basato sui recenti sviluppi nel Natural Language Processing, un campo del deep learning che mira a rendere comprensile computazionalmente il linguaggio umano. In particolare, stiamo utilizzando per la prima volta un modello basato su BERT come strumento di indagine per identificare le similitudini tra diverse sequenze di proteine virali e prevedere il possibile riposizionamento dei farmaci sulla base di queste similitudini.