Hierarchical Bayesian nonparametric language models

A goal of statistical language modeling is to learn the probability of a sequence of words, i.e. a sentence. The estimation of this probability is of great inter- est in all areas of natural language processing (NLP). In fact, the estimation of the probability of correctness of a sentence is the main task of the algorithms that deal with speech recognition, spelling correction (or spelling check), handwriting recognition, optical character recognition and machine translation. Research in these areas is always at the forefront, and in order to improve the accuracy of the estimator, increasingly sophisticated statistical techniques have been employed over the years. A widely used statistical approach in language modelling has been the Bayesian approach. One of the major advantages of using Bayesian statistics in language modelling is the ability to introduce a prior distribution that can influence inference towards better solutions. One of the best Bayesian models used in this context is the hierarchical Pitman-Yor process (HPYLM). A nonparametric Bayesian model with a hierarchical structure based on the Pitman-Yor process. This prior is particularly effective in the context of language models because it generates a power-law distribution, the same distribution found in natural languages. Starting from the HPYLM we wanted to create an extension that could work with more text documents, i.e. corpora. Since corpora are a collection of different corpus, our model will associate each corpus with a specific HPYLM. To these we will add a further model called global HPYLM, which will influence the entire corpora according to a parameter. We denote the entire model as Shared hierarchical Pitman-Yor language model (SHPYLM). This new language model is then implemented in a C++/Python package. Finally, experiments verify that our model performs better than the standard HPYLM.

Obiettivo principale della modellazione statistica del linguaggio è quello di imparare la probabilità di una sequenza di parole, cioè di una frase. La stima di questa probabiltà è di grande interesse in tutti gli ambiti riguardanti il natural language process (NLP). Difatti, la stima della probabiità di corretteza di una frase è il compito principale degli algoritmi che si occupano di riconoscimento vocale, correzione ortografica (o controllo ortografico), riconoscimento della scrittura, riconoscimento ottico dei caratteri e traduzione automatica. La ricerca in questi settori è sempre all’avanguardia, e al fine di migliorare l’accuratezza dello stimatore, negli anni sono state impiegate tecniche statistiche sempre più sofisticate. Un approccio statistico ampiamente usato nel language modeling è quello bayesiano. Uno dei maggiori vantaggi dell’utilizzo della statistica bayesiana nella modellazione del linguaggio è la capacità di introdurre una distribuzione a priori che può influenzare l’inferenza verso soluzioni migliori. Uno dei migliori modelli bayesiani usati in questo contesto è sicuramente lo hierarchical Pitman-Yor process (HPYLM), un modello bayesiano nonparametrico con struttura gerarchica basato sul processo di Pitman-Yor. Questa prior è particolarmente efficace nell’ambito dei language model perché genera una power-law distribution, la stessa distribuzione presente nei linguaggi naturali. Partendo dal modello HPYLM abbiamo considerato una sua estensione che potesse lavorare con molteplici documenti di testo, i.e. corpora. Dato che i corpora sono una collezione di documenti diversi, il nostro modello assocerà ad ogni corpus un HPYLM specifico. A questi aggiungeremo un ulteriore modello chiamato HPYLM globale, che influenzerà l’intero corpora secondo un parametro. Denotiamo l’intero modello come Shared hierarchical Pitman-Yor language model (SHPYLM). Questo nuovo modello linguistico è stato poi implementato in un pacchetto C++/Python. Infine, gli esperimenti su dati simulati hanno verificato che il nostro modello funziona meglio del modello HPYLM standard.