Predicting commodity consumption using Google trends and sentiment analysis of news

With the widespread diffusion of the Internet around the world, it is interesting to investigate on how the behaviour of users, surfing on the net, may influence global dynamics. Our study focuses on the context of the Food Commodities Market. More in detail, we refer to global exchanges of agricultural resources like grains (corn, rice, soybean, etc.), oilseeds (soybean oil, palm oil, etc.) or soft commodities (cocoa, coffee, etc.). This market is characterized by its own peculiar mechanisms and can be described using indicators like supply, demand, price and so on. A challenging task, widely tackled in the similar context of the Stock Market, is the prediction of these indicators through the analysis of Web data. In this thesis, we propose a new approach to improve the prediction of consumption, combining heterogeneous sources of data. We retrieved the consumption data concerning two specific commodities, rice and corn, and developed several predictive models based on two different data sources: Google Trends and Google News. The first one is a platform of Google, that provides volumes of searches on Google regarding queries that contained specified keywords. The second one is a popular aggregator of news. Our scope was to investigate whether search traffic on Google and sentiment of specialized news concerning corn or rice, could represent additional sources useful for the prediction of consumption trends. Through the development of a Web Scraper, we retrieved the texts of the most popular specialized news and we automatically computed their emotional attitude using Sentiment Analysis. We developed the predictive models applying two different regression algorithms, Multivariate Linear Regression (MLR) and Support Vector Regression (SVR). Furthermore, we analysed each data source separately in order to evaluate its contribution to the prediction and then, we combined both, to obtain an even better accuracy of the forecast. At the end, we compared the various models, evaluating their performances

Con la diffusione virale di Internet a livello mondiale, è interessante studiare come il comportamento degli utenti che navigano in rete possa influenzare dinamiche globali. Il nostro studio si colloca nel contesto del Mercato delle Commodities Alimentari, che consiste nella compravendita di prodotti agricoli quali i cereali, gli oli, oppure le soft commodities (cacao, caffè, ecc.). Questa realtà economica è caratterizzata da dinamiche proprie che possono essere descritte usando appositi indicatori economici quali la domanda, l’offerta o il prezzo. Una sfida interessante, ampiamente accolta nel simile contesto del mercato azionario, è la predizione di tali indicatori attraverso l’analisi di dati provenienti dal Web. In questa tesi, proponiamo un approccio nuovo per migliorare la predizione del consumo, combinando fonti di dati eterogenee. Abbiamo raccolto i dati relativamente a due commodities specifiche, il riso e il mais, e abbiamo sviluppato dei modelli predittivi basati su due sorgenti di dati: Google Trends e Google News. La prima è una piattaforma di Google, che fornisce i dati sui volumi di ricerca di query contenenti parole chiave specifiche. La seconda fonte è uno dei più popolari aggregatori di notizie del Web. Il nostro obbiettivo era valutare se l’andamento delle ricerche su Google e il sentiment di notizie specializzate riguardanti il mais o il riso, potessero rappresentare indicatori aggiuntivi utile ai fini della predizione dei consumi. Dunque, attraverso lo sviluppo di un Web Scraper abbiamo raccolto i testi delle news più popolari nel settore e ne abbiamo computato il sentiment in modo automatizzato, facendo uso di tecniche di Sentiment Analysis. Dopo avere analizzato i dati, abbiamo considerato ogni sorgente separatamente al fine di valutarne il contributo alla predizione e, successivamente, le abbiamo combinate insieme per ottenere un’accuratezza più elevata. Abbiamo applicato due algoritmi di regressione differenti: la Multivariate Linear Regression (MLR) e la Support Vector Regression (SVR) e abbiamo comparato i vari modelli, valutandone le performance.