An analysis of reproducibility and experimental results comparison in cross-domain recommender systems using deep learning

Due to the increasing economical interest of on-line services and the exponential growth of the information over the Internet, recommender systems have become very popular, being extensively used by social networks, on-line shops or web sites offering songs, movies, books, and many others items. Using specific techniques and algorithms, the system analyzes the behavior of the users and discovers their preferences, being thus able to filter-out irrelevant information and prioritize the relevant content. Interacting with the system, the users express their preferences, allowing the recommender system to evaluate their interest also for items they have never interacted with. These recommendations can be expressed suggesting a list of N items likely to be appreciated by the user (Top-N recommendation). Each user generally interacts with a very limited percentage of items, so that, in some cases, the available information is not sufficient to guarantee good quality recommendations. Cross-domain Recommender Systems address this problem collecting user preferences in different domains (e.g. movies and books). Starting from the mid 90's, various types of algorithms, of growing complexity, have been designed in an effort to get better performance. Recently a new class of algorithms has become very popular; they are based on Deep Learning, a neural technology that has brought enormous advances in recent years in other areas, such as language or image processing. The publication, in top-rated international conferences, of many papers claiming excellent results, far exceeding those obtained with traditional methods, has caused this class of algorithms to become dominant in the research area. However, the acknowledgment of this success is not unanimous and several scientists have put forward the hypothesis that it is actually a “phantom progress” affected by methodological errors. This thesis aims at a better understanding of the degree of progress these deep learning methods have actually achieved, analyzing, reproducing and evaluating recent neural-based algorithms, focusing on cross-domain systems providing top-N recommendations. The work carried out also made it possible to evaluate the actual possibility of reproducing the algorithms based on the information usually provided by the authors and possibly identify additional methodological issues. The analysis confirmed that none of the reproducible neural approaches actually provides better results if compared with traditional well-established techniques, provided these baseline algorithms are properly selected and optimized. In addition, various methodological issues have been identified, and suggestions have been provided on how to allow a better evaluation in the future, making the work of reproducing the algorithms easier.

Con l'aumentare dell'interesse economico dei servizi on-line e la crescita esponenziale delle informazioni disponibili su Internet, i sistemi di raccomandazione sono divenuti molto popolari e vengono utilizzati dai social network, negozi on-line o siti web che offrono musica, film, libri, e molti altri articoli. Usando specifici algoritmi, questi sistemi analizzano il comportamento degli utenti e definiscono le loro preferenze, riuscendo così a filtrare le informazioni irrilevanti e stabilire una priorità tra quelle di interesse. Interagendo con il sistema, gli utenti esprimono le loro preferenze, consentendogli di valutare il loro interesse anche per gli articoli con cui non hanno mai interagito. In molti casi le raccomandazioni sono espresse suggerendo un elenco di N articoli che il sistema ritiene possano essere apprezzati dall'utente (raccomandazione top-N). Ogni utente in genere interagisce con una percentuale molto limitata di articoli, per cui in alcuni casi le informazioni acquisite possono non essere sufficienti a garantire raccomandazioni di buona qualità. I sistemi “cross-domain” cercano di ridurre questo problema raccogliendo le preferenze dell'utente in diversi domini (es. film e libri). A partire dalla metà degli anni '90 sono stati proposti vari tipi di algoritmi, di complessità crescente, nel continuo tentativo di ottenere prestazioni sempre migliori. Recentemente una nuova classe è diventata molto popolare; si tratta di algoritmi basati sul Deep Learning, una tecnologia neurale che ha portato enormi progressi in altri campi, come l'elaborazione del linguaggio o delle immagini. La pubblicazione, in prestigiose conferenze internazionali, di molti articoli che rivendicano risultati eccellenti, di gran lunga superiori a quelli ottenuti con i metodi tradizionali, ha fatto sì che questa classe di algoritmi diventasse dominante nell'area di ricerca. Tuttavia, il riconoscimento di questo successo non è unanime e diversi ricercatori hanno avanzato l'ipotesi che si tratti in realtà di un “progresso fantasma” dovuto a errori metodologici. Questa tesi mira a comprendere meglio quale grado di progresso questi recenti metodi neurali abbiano effettivamente raggiunto, analizzandoli, riproducendoli e valutandoli in base a risultati verificati, concentrandosi su sistemi cross-domain per la raccomandazione top-N. Il lavoro svolto ha anche consentito di valutare l'effettiva possibilità di riprodurre gli algoritmi sulla base delle informazioni solitamente fornite dagli autori ed individuare eventuali ulteriori problemi metodologici. L'analisi ha confermato che nessuno degli algoritmi neurali riproducibili fornisce effettivamente risultati migliori rispetto alle consolidate tecniche tradizionali, a condizione che anche questi algoritmi siano adeguatamente selezionati e ottimizzati. Inoltre, sono stati individuati diversi problemi metodologici e sono stati forniti suggerimenti su come consentire una migliore valutazione in futuro, facilitando il lavoro di riproduzione degli algoritmi.