Micro-blogging as a short text reports like tweets is widely used recently, processing and extracting the knowledge from this huge volume of information stream has attracted many attentions. In general Text Analysis and Text Processing is an essential task in Natural Language Processing and also a major application field in Machine Learning area, there has been wide range of approaches proposed both in NLP and ML. Central part of solutions in this regard is around finding data patterns and study the features of the data. Finding the accurate and proper form of features can be significantly important in data processing, for text processing previous works, usually exploit only human-designed features, such as dictionaries, knowledge bases and special tree kernels, although currently advanced approaches get involved with deeper semantical layers rather than only retrieving features from text, these approaches appear to have a great progress in boosting the Text Classification performance. The importance of social networks like twitter and their universal propagation in the last few years represents one of the most pervasive phenomena of the recent computer and data science society, keeping track of predictive behavior of hashtags in tweets have been an interesting topic that can be reached by many approaches. Although applying Machine Learning and Information Retrieval on short documents like tweets is not an easy task, due to few numbers of features in each sample which can reduce accuracy of the final predictive analysis. The solution can be using these numerable features efficiently by extracting latent properties and processing semantical aspects of the features. Although it is needless to mention that text processing on stream data can be even more challenging, on one hand we usually do not have a fixed number of features, on the other hand these features can have inconstant importance at different time windows so we need to verify this fact at every stage by updating ML methods and fitting the model with recent training set, this can boost degree of accuracy in pattern recognition and enhance the precision in predicting the output values. In this thesis we work on a dataset with more than 2 million of tweets collected in more than 4 months from Expo 2015 in Milan. The general goal is predicting second hash-tags considering its relativeness with upcoming events and features at different time windows. With the purpose of comparing different ML classification methods there was 2 main hypotheses: 1. Effect of applying different time window lengths on the accuracy of the classification on stream data. 2. Applying dimension reduction methods only on the prediction values can boost the final score of classification result. As usual the data processing task will start with Preprocessing of the data as the first step, eliminate all outliers and irrelevant elements of data, and evoking useful and informative form of words then converting them into standard feature vectors can be briefly consider as what we mean by preprocessing. By Finishing the preprocessing steps and achieving the proper informative vector form of the data acceptable for machine learning algorithms, next is to step into ML world, here we compare 5 different Classification results with different time windows where we applied a 2-layer feed-forward Neural Network algorithm to reduce the dimensionality of the outputs. Finally, the result approved both hypothesis and face an interesting conclusion about the effect of preprocessing to enhance the final result.

Power of time window in predicting hashtags while typing a tweet

FANAEI, FARIMAH
2015/2016

Abstract

Micro-blogging as a short text reports like tweets is widely used recently, processing and extracting the knowledge from this huge volume of information stream has attracted many attentions. In general Text Analysis and Text Processing is an essential task in Natural Language Processing and also a major application field in Machine Learning area, there has been wide range of approaches proposed both in NLP and ML. Central part of solutions in this regard is around finding data patterns and study the features of the data. Finding the accurate and proper form of features can be significantly important in data processing, for text processing previous works, usually exploit only human-designed features, such as dictionaries, knowledge bases and special tree kernels, although currently advanced approaches get involved with deeper semantical layers rather than only retrieving features from text, these approaches appear to have a great progress in boosting the Text Classification performance. The importance of social networks like twitter and their universal propagation in the last few years represents one of the most pervasive phenomena of the recent computer and data science society, keeping track of predictive behavior of hashtags in tweets have been an interesting topic that can be reached by many approaches. Although applying Machine Learning and Information Retrieval on short documents like tweets is not an easy task, due to few numbers of features in each sample which can reduce accuracy of the final predictive analysis. The solution can be using these numerable features efficiently by extracting latent properties and processing semantical aspects of the features. Although it is needless to mention that text processing on stream data can be even more challenging, on one hand we usually do not have a fixed number of features, on the other hand these features can have inconstant importance at different time windows so we need to verify this fact at every stage by updating ML methods and fitting the model with recent training set, this can boost degree of accuracy in pattern recognition and enhance the precision in predicting the output values. In this thesis we work on a dataset with more than 2 million of tweets collected in more than 4 months from Expo 2015 in Milan. The general goal is predicting second hash-tags considering its relativeness with upcoming events and features at different time windows. With the purpose of comparing different ML classification methods there was 2 main hypotheses: 1. Effect of applying different time window lengths on the accuracy of the classification on stream data. 2. Applying dimension reduction methods only on the prediction values can boost the final score of classification result. As usual the data processing task will start with Preprocessing of the data as the first step, eliminate all outliers and irrelevant elements of data, and evoking useful and informative form of words then converting them into standard feature vectors can be briefly consider as what we mean by preprocessing. By Finishing the preprocessing steps and achieving the proper informative vector form of the data acceptable for machine learning algorithms, next is to step into ML world, here we compare 5 different Classification results with different time windows where we applied a 2-layer feed-forward Neural Network algorithm to reduce the dimensionality of the outputs. Finally, the result approved both hypothesis and face an interesting conclusion about the effect of preprocessing to enhance the final result.
EYNARD, DAVIDE
ING - Scuola di Ingegneria Industriale e dell'Informazione
29-set-2016
2015/2016
Tesi di laurea Magistrale
File allegati
File Dimensione Formato  
2016_09_Fanaei.pdf

accessibile in internet per tutti

Descrizione: Thesis text
Dimensione 4.08 MB
Formato Adobe PDF
4.08 MB Adobe PDF Visualizza/Apri

I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10589/125481