Ranking resumes using machine learning

Recruitment is a key factor to the competitiveness of a company. Every time an organization posts a job offer, its recruiters have to meticulously inspect each résumé in order to identify the candidates that most deserve to advance in the process. Evidence shows that résumé screening is one of the major bottlenecks of recruitment, both because every job posting attracts hundreds of applications – most of which irrelevant – and because little has been done until recently to support this activity with computers. In this thesis, we explore the application of machine learning techniques to the résumé screening task. More in detail, we show how text classification methods can be adapted to the résumé ranking problem, in which résumés have to be ranked according to their relevance to a given job type. The account we provide covers all the phases of the process we underwent to tackle the problem. After reviewing existing solutions and confronting with recruiters, we focused on how to represent résumés in a manner suitable for machine learning algorithms: we derived three different representations, resorting to well-known Information Retrieval methods for text representation and to Named Entity Recognition for the extraction of complex features. We then selected several classification algorithms – k-Nearest-Neighbours, Nearest Centroid Classifiers, Linear Regression, Logistic Regression, and Support Vector Machines – and used their decision functions to produce a ranked list of résumés, rather than a hard categorization. We finally compared the results obtained with all possible combinations of résumé representation and algorithm, so as to evaluate the performance yielded by each model and to determine which worked best; we also compared the same results with a baseline corresponding to a random ordering of the résumés, in order to quantify the impact of our work on the résumé screening process.

La selezione del personale è un fattore chiave per la competitività di un’azienda. Per ogni annuncio di lavoro pubblicato, l’ufficio risorse umane deve meticolosamente passare in rassegna ogni curriculum ricevuto al fine di identificare i candidati più adatti alla posizione aperta. Diversi studi mostrano che il vaglio dei curriculum è tra i principali colli di bottiglia del processo di selezione, sia per l’alto numero di candidature attratte da ogni annuncio (la maggior parte delle quali irrilevanti), sia perché si tratta di un’attività ancora poco supportata dall’informatica. In questa tesi, dunque, ci proponiamo di esplorare l’applicazione di tecniche di machine learning al vaglio dei curriculum. Più dettagliatamente, mostriamo come alcuni metodi di classificazione testuale possano essere adattati al problema di ordinamento delle candidature, che prevede di ordinare i curriculum in base alla loro rilevanza rispetto a un certo tipo di lavoro. Il resoconto che ne risulta copre tutte le fasi del processo con cui abbiamo affrontato il problema. Dopo uno studio delle soluzioni esistenti in letteratura e sul mercato, e dopo un confronto con alcuni selezionatori, ci siamo concentrati su come rappresentare i curriculum in maniera compatibile con gli algoritmi di machine learning che intendevamo utilizzare: ne sono derivate tre diverse rappresentazioni, ottenute sia mediante metodi standard di rappresentazione testuale, sia attraverso tecniche di Named Entity Recognition per l’estrazione di informazioni complesse. Abbiamo quindi scelto diversi algoritmi di classificazione – k-Nearest-Neighbours, classificatori Nearest Centroid, Regressione Lineare, Regressione Logistica e Support Vector Machines – e utilizzato le loro funzioni di decisione per produrre un ordinamento dei curriculum rispetto a una certa classe di lavori. Da ultimo, abbiamo confrontato i risultati ottenuti con ogni possibile combinazione di rappresentazione testuale e algoritmo, così da poter valutare quale produca i risultati migliori. Contestualmente, abbiamo anche rapportato i risultati ottenuti a una baseline corrispondente a un ordinamento casuale dei curriculum, al fine di quantificare l’impatto del nostro lavoro sul processo di selezione.