This thesis is about estimating the non-deterministic part of a flight ticket revenues and costs. Online travel agencies (OTAs) need predictive toolsets for setting the optimal price of flights in order to maximize their expected profit. The final price depends on the base ticket price and on other costs and revenues like insurance, car rental, commissions, etc. For predicting the price, OTAs use proprietary algorithms that are usually based on the experience accumulated in the field by the OTA, and whose functionality is determined by some features of the flight which are known at the time the customer performs a search. Since the base price and part of the costs are deterministic, the challenging task is the estimation of the non-deterministic costs and revenues, whose summation we call Other Costs and Revenues, Estimated (OCRE). We propose an alternative approach to the existing one for estimating the OCRE. To take advantage of the massive customer and transactional data generated by OTAs, we considered data mining, statistical and machine learning techniques. Our goal was to create a successful methodology which is able to reduce bias and/or variance w.r.t. the methodology already considered by a specific OTA (based on expert experience). We also wanted low computational complexity models, since the price prediction has to be applied in real time, and there is a high load due to the millions of flights being priced every minute. Our solution was applied to a real booking dataset that contains information about the type of flight, the number of passengers, the insurance, the date of the ticket purchase, the currency and other relevant information regarding the booking. The research part of this thesis was completed in two phases: feature selection which we performed using Extremely Randomized Trees algorithm, and prediction where we implemented the system for predicting the OCRE based on different approaches including OTA’s rules, linear regression, overall mean, regression trees and pricing contexts. The results obtained by this thesis didn’t decide for a final winner, but they demonstrate that data analysis and machine learning approaches can compete with proprietary methods used by OTAs for predicting the flight prices. The results also provided empirical evidence for the capabilities of extremely randomized trees in the feature selection task compared to other algorithms when dealing with high dimensional sparse datasets.
Questa tesi tratta la stima della parte non-deterministica dei costi e ricavi relativi ad un volo. Le agenzie di viaggio online (OTA) hanno l’obiettivo di creare strumenti di previsione per impostare il prezzo ottimale dei voli al fine di massimizzare il profitto. Il prezzo finale dipende dal prezzo del biglietto e da altri costi e ricavi come assicurazioni, noleggio auto, commissioni, ecc. Per prevedere tali costi e ricavi, le OTA usano algoritmi proprietari generalmente basati sull'esperienza accumulata nel campo e, la loro funzionalità è determinata dai dati inseriti dal cliente al momento della ricerca. Considerato che il prezzo di base e una parte dei costi sono deterministici, il problema principale è la stima dei costi e dei ricavi non-deterministici, la cui somma è denominata OCRE. Noi proponiamo un approccio alternativo a quello esistente per la stima dell’OCRE. Poiché le OTA dispongono di una grossa quantità di dati, ci siamo basati su tecniche di data mining, statistica e di machine learning. Il nostro obiettivo era quello di creare una metodologia di successo, che fosse in grado di ridurre il bias e / o la varianza rispetto alla metodologia già considerata dall’ OTA (sviluppata in precedenza basandosi su l'esperienza acquisita in questo campo specifico). Volevamo anche dei modelli a bassa complessità computazionale, poiché la previsione del prezzo deve essere applicata in tempo reale e vi è un carico elevato a causa dei milioni di voli che vengono cercati ogni minuto. La nostra soluzione è stata applicata ad un dataset reale di voli acquistati che contiene informazioni sul tipo di volo, il numero dei passeggeri, l'assicurazione, la data di acquisto del biglietto, la valuta e altre informazioni utili per quanto riguarda la prenotazione. La parte di ricerca di questa tesi è composta di due fasi: feature selection, che abbiamo eseguito utilizzando l'algoritmo Extremely Randomized Trees, e la predizione in cui abbiamo implementato il sistema per la predizione del OCRE sulla base di diversi approcci, tra cui quello correntemente utilizzato dall'OTA, la regressione lineare, etc. I risultati ottenuti in questa tesi hanno evidenziato che non esiste un metodo che domini gli altri in tutte le situazioni, ma dimostrano che gli approcci di data analysis e di machine learning sono in grado di competere con metodi proprietari usati dalle OTA per prevedere i prezzi dei voli. I risultati mostrano anche la capacità di selezione delle features da parte dell’Extremely Randomized Trees rispetto ad altri algoritmi.
Estimation of costs and revenues based on online travel agencies' historical data
NECHOFSKI, ZDRAVKO
2014/2015
Abstract
This thesis is about estimating the non-deterministic part of a flight ticket revenues and costs. Online travel agencies (OTAs) need predictive toolsets for setting the optimal price of flights in order to maximize their expected profit. The final price depends on the base ticket price and on other costs and revenues like insurance, car rental, commissions, etc. For predicting the price, OTAs use proprietary algorithms that are usually based on the experience accumulated in the field by the OTA, and whose functionality is determined by some features of the flight which are known at the time the customer performs a search. Since the base price and part of the costs are deterministic, the challenging task is the estimation of the non-deterministic costs and revenues, whose summation we call Other Costs and Revenues, Estimated (OCRE). We propose an alternative approach to the existing one for estimating the OCRE. To take advantage of the massive customer and transactional data generated by OTAs, we considered data mining, statistical and machine learning techniques. Our goal was to create a successful methodology which is able to reduce bias and/or variance w.r.t. the methodology already considered by a specific OTA (based on expert experience). We also wanted low computational complexity models, since the price prediction has to be applied in real time, and there is a high load due to the millions of flights being priced every minute. Our solution was applied to a real booking dataset that contains information about the type of flight, the number of passengers, the insurance, the date of the ticket purchase, the currency and other relevant information regarding the booking. The research part of this thesis was completed in two phases: feature selection which we performed using Extremely Randomized Trees algorithm, and prediction where we implemented the system for predicting the OCRE based on different approaches including OTA’s rules, linear regression, overall mean, regression trees and pricing contexts. The results obtained by this thesis didn’t decide for a final winner, but they demonstrate that data analysis and machine learning approaches can compete with proprietary methods used by OTAs for predicting the flight prices. The results also provided empirical evidence for the capabilities of extremely randomized trees in the feature selection task compared to other algorithms when dealing with high dimensional sparse datasets.File | Dimensione | Formato | |
---|---|---|---|
2016_04_NECHOFSKI.pdf
non accessibile
Descrizione: Thesis document
Dimensione
11.94 MB
Formato
Adobe PDF
|
11.94 MB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/119381