POLITESI Politecnico di Milano Servizi Bibliotecari di Ateneo Servizi Bibliotecari di Ateneo
authorized users
Please use this identifier to cite or link to this thesis: http://hdl.handle.net/10589/125481

Date: 29-Sep-2016
Academic year: 2015/2016
Title: Power of time window in predicting hashtags while typing a tweet
English abstract: Micro-blogging as a short text reports like tweets is widely used recently, processing and extracting the knowledge from this huge volume of information stream has attracted many attentions. In general Text Analysis and Text Processing is an essential task in Natural Language Processing and also a major application field in Machine Learning area, there has been wide range of approaches proposed both in NLP and ML. Central part of solutions in this regard is around finding data patterns and study the features of the data. Finding the accurate and proper form of features can be significantly important in data processing, for text processing previous works, usually exploit only human-designed features, such as dictionaries, knowledge bases and special tree kernels, although currently advanced approaches get involved with deeper semantical layers rather than only retrieving features from text, these approaches appear to have a great progress in boosting the Text Classification performance. The importance of social networks like twitter and their universal propagation in the last few years represents one of the most pervasive phenomena of the recent computer and data science society, keeping track of predictive behavior of hashtags in tweets have been an interesting topic that can be reached by many approaches. Although applying Machine Learning and Information Retrieval on short documents like tweets is not an easy task, due to few numbers of features in each sample which can reduce accuracy of the final predictive analysis. The solution can be using these numerable features efficiently by extracting latent properties and processing semantical aspects of the features. Although it is needless to mention that text processing on stream data can be even more challenging, on one hand we usually do not have a fixed number of features, on the other hand these features can have inconstant importance at different time windows so we need to verify this fact at every stage by updating ML methods and fitting the model with recent training set, this can boost degree of accuracy in pattern recognition and enhance the precision in predicting the output values. In this thesis we work on a dataset with more than 2 million of tweets collected in more than 4 months from Expo 2015 in Milan. The general goal is predicting second hash-tags considering its relativeness with upcoming events and features at different time windows. With the purpose of comparing different ML classification methods there was 2 main hypotheses: 1. Effect of applying different time window lengths on the accuracy of the classification on stream data. 2. Applying dimension reduction methods only on the prediction values can boost the final score of classification result. As usual the data processing task will start with Preprocessing of the data as the first step, eliminate all outliers and irrelevant elements of data, and evoking useful and informative form of words then converting them into standard feature vectors can be briefly consider as what we mean by preprocessing. By Finishing the preprocessing steps and achieving the proper informative vector form of the data acceptable for machine learning algorithms, next is to step into ML world, here we compare 5 different Classification results with different time windows where we applied a 2-layer feed-forward Neural Network algorithm to reduce the dimensionality of the outputs. Finally, the result approved both hypothesis and face an interesting conclusion about the effect of preprocessing to enhance the final result.
Italian keywords: machine learning; intervallo di tempo; flusso dati; riduzione delle dimensioni; ingegneria funzionalità; NLP; classificatzione; caratteristiche; pre-elavorazione; language detection; predizione; rete neurale
English keywords: machine learning; tme window; stream data; dimension reduction; feature engineering; classification; features; prediction; language detection; NLP preprocessing; neural networks
Language: eng
Appears in Collections:POLITesi >Tesi Specialistiche/Magistrali

Files in This Item:

File Description SizeFormatVisibility
2016_09_Fanaei.pdfThesis text4.08 MBAdobe PDFView/Open


  Support, maintenance and development by SURplus team @ CINECA- Powered by DSpace Software