A machine learning-based approach to conversion rate estimation

The aim of the Thesis is to estimate the probability of buying an insurance car warranty policy, called Probability of Conversion or Conversion Rate, by means of three different Machine Learning algorithms: the Classification and Regression Tree (CART), the Ran- dom Forest and the Gradient Boosted Tree. The Generalized Linear Model (GLM ), benchmark model used to estimate the Probability of Conversion, is used as frame of reference and mean to perform the features selection. The Log Loss Error and the Confusion Matrix are the main basis for comparisons. A new stochastic extension of the Gradient Boosted Tree, obtained by introducing random feature sampling, is discussed in Chapter 6. Empirical results show that the new extension is able to better perform than the traditional version, in terms of accuracy both on train and test set, and less overfit the dataset under analysis. Chapter 7 contains the most important results in the Thesis. The Random Forest is the model with the highest recall value, while the Boosted Tree is the most precise model. The GLM performs as a trade off between these two models. A second analysis is car- ried out showing that if the Random Forest and the Extended Boosted Tree are fit on all the features available, exploiting their inner feature selection algorithm, then the overall prediction accuracy significantly increases. In particular the Random Forest is able to outperform the GLM with respect to any measure of comparison considered. The Variable Importance index computed with both the Random Forest and the Gradient Boosted Tree is in accordance with the Strength index computed starting from the fitted GLM model. The document is organized as follows. In Chapter 1 the Generalized Linear Model is presented in full details. The main ingredients characterizing the GLM model are discussed, then the Exponential Family distribution is rigorously defined and their main properties are studied. The Iterative Weighted Least Squares Algorithm used to fit the model is formulated and derived in great details. The main statistical hypothesis tests on GLM are reported. In Chapter 2 the structure of the Classification and Regression Tree and the fitting algorithm are extensively discussed, putting much emphasis on the model’s parameters. Pros & Cons of the algorithm and an illustrative example of a CART model are reported at the end. In Chapter 3 the Random Forest model is introduced highlighting its main properties. The most important theoretical results obtained so far for Random Forest are discussed. Pros & Cons of the algorithm and an illustrative example of a Random Forest model are reported at the end. In Chapter 4 the Boosted Tree model is presented. In order to derive the Gradient Boosting Algorithm used to fit the model, the Steepest-Descent Algorithm is discussed. Varying the underlying loss function different fitting algorithms are derived. Pros & Cons of the algorithm and an illustrative example of a Gradient Boosted Tree model are re- ported at the end. In Chapter 5 the measures adopted as basis of comparisons are defined. These are: the Log Loss Error, the Precision, the Recall, the Accuracy, the F Score and the ROC Curve. The choice of the threshold separating positive and negative cases is briefly discussed. In Chapter 6 is reported the full case study in great detail. The problem is defined and the dataset is analysed. Through a rigid stepwise procedure the GLM model is estimated using the Emblem software. Identified the most informative features from the GLM, the calibration of the CART, Random Forest and Gradient Boosted Tree are deeply discussed. This Chapter contains important empirical results showing that a new extended stochas- tic version of the Gradient Boosted Tree is able to better perform in terms of accuracy both on train and test set, and less overfit the dataset under analysis. The extension is obtained by introducing the random sampling of the possible features selected in the nodes of the Gradient Boosted Tree. In Chapter 7 we draw our conclusions by choosing the optimal depth for each model under analysis. It is shown how the Variable Importance, for the Random Forest and Boosted Tree, and the Strength index, for the GLM, are in accordance on the most predic- tive features on estimating the Conversion Rate. In a further study the Machine Learning models are also fit on the entire set of independent variables, showing a significant increase of accuracy. Variable Importance and Strength are discussed.

L’obiettivo principale di questa Tesi è di stimare la probabilità che un individio compri una polizza assicurativa RCA, denominata Probabilità di Conversion o Tasso di Conversion, attraverso tre differenti algoritmi di Machine Learning: il Classification and Regression Tree (CART), il Random Forest ed il Gradient Boosted Tree. Il cosidetto Modello Lineare Generalizzato (GLM), modello di riferimento usato per la stima della Probabilità di Conversion, viene considerato come mezzo di paragone e linea guida per eseguire la scelta delle variabili indipendenti piu importanti. La Funzione di Perdita Logaritmica e la Matrice di Confusione sono le principali misure di comparazione adottate. Una nuova versione stocastica del Gradient Boosted Tree, ottenuta introducendo una selezione random delle possibili variabili da cui scegliere lo split migliore in ogni nodo dell’albero, è discussa nel Capitolo 6. Risultati empirici mostrano che questa nuova ver- sione è capage di performare meglio rispetto la versione tradizionale in termini di accu- ratezza sia sul train che sul test set, ed overfittare meno il dataset sotto analisi. Il Capitolo 7 contiene i risultati piu significativi della Tesi. Il Random Forest è il mod- ello con il pù alto indice di recall, mentre il Gradient Boosted Tree risulta il modello più preciso. Il GLM si comporta come una media dei due modelli. Una seconda analisi viene eseguita mostrando che se il Random Forest ed il Gradient Boosted Tree vengono stimati a partire dall’insieme completo di variabili, quindi sfruttando il loro interno medoto di selezione delle variabili, allora l’accuratezza globale delle predizioni migliora significativa- mente. In particolare il Random Forest è capace di performare meglio del GLM rispetto a tutte le misure di confronto considerate. Il cosidetto indice di Variable Importance calcolato sia con il Random Forest ed il Gradi- ent Boosted Tree risulta essere in linea con l’indice di Robustezza calcolato a partire dal modello GLM. Il documento è organizzato come segue. Nel Capitolo 1 il Modello Lineare Generalizzato è presentato molto dettagliamente. Gli ingredienti principali che caratterizzano il GLM vengono discussi, successivamente la famiglia di Distribuzioni Esponenziali e le sue proprietà vengono studiate. L’Algoritmo Iterativo Pesato dei Minimi Quadrati usato per stimare il modello viene formulato e derivato. In fine sono riportati i principali test d’ipotesi utilizzati. Nel Capitolo 2 la struttura del Classification and Regression Tree e l’algoritmo di fitting del modello sono ampiamente discussi, analizzando in particolare i parametri di modello. Vantaggi e svantaggi dell’algoritmo ed un esempio illustrativo di un modello CART vengono riportati alla fine del capitolo. Nel Capitolo 3 il modello Random Forest viene introdotto sottolineando le sue impor- tanti proprietà. I più importanti risultati teorici ottenuti fino ad oggi vengono discussi. Vantaggi e svantaggi dell’algoritmo ed un esempio illustrativo di un modello Random For- est vengono riportati alla fine del capitolo. Nel Capitolo 4 il modello Gradient Boosted Tree viene presentato. Allo scopo di derivare l’Algoritmo di Gradient Boosting usato per stimare il modello, il cosidetto Al- goritmo di Steepest-Descent viene discusso. Al variare della funzione perdita sottostante vengono dedotti diversi algoritmi di fitting. Vantaggi e svantaggi dell’algoritmo ed un esempio illustrativo di un modello Gradient Boosted Tree vengono riportati alla fine del capitolo. Nel Capitolo 5 vengono definite le misure adottate al fine di comparare i diversi mod- elli sotto analisi. Queste sono: il Log Loss Error, la Precision, la Recall, la Accuracy, l’F Score e la ROC Curve. La scelta della barriera separante i casi positivi da quelli negativi viene brevemente discussa. Nel Capitolo 6 viene interamente riportato il case study oggetto della Tesi. Viene definito il problema e viene descritto il dataset. Attraverso una rigida procedura a step viene stimato il modello GLM con il supporto del software Emblem. Identificate le princi- pali variabili indipendenti, la calibrazione del CART, Random Forest ed Gradient Boosted Tree viene dettagliamente discussa. Il Capitolo contiene importanti risultati empirici che mostrano come la versione estesa del Gradient Boosted Tree è capace di meglio perfor- mare, in termini di accuratezza sia sul train che sul test set, e overfittare meno il dataset sotto analisi. L’estensione del modello è ottenuta introducendo la scelta random delle variabili di splitting tra cui scegliere lo split migliore. Nel capitolo 7 vengono riportate le conclusioni del case study, in particolare viene scelta la profondità ottima per ciascun modello. Viene mostrato come l’indice di Variable Importance, per il Random Forest e Boosted Tree, ed l’indice di Strength, per il GLM, sono in accordo riguardo le variabili piu predittive al fine di stimare il Tasso di Conversion. In un ulteriore studio i modelli di Machine Learning vengono stimati a partire dall’insieme completo delle variabili disponibili. Si evidenziano miglioramenti significativi in termini di accuratezza, precisione e recall. Gli indici di Variable Importance e Strength vengono discussi.