In the routine of modern applications, collecting high-dimensional data is becoming increasingly widespread. In such contexts, it is common to be interested in clustering subjects based on these observations. Bayesian model-based approaches are able to quantify the uncertainty of the inferences, but they often perform poorly in such settings. Chandra et al. (2020) show that standard mixture models produce inconsistent inference in high-dimensional settings, proposing a general class of models, called Lamb, overcoming such an issue. Their approach consists in linking the high-dimensional observations to a set of low-dimensional latent factors through a matrix of loadings, and performing model-based clustering via nonparametric mixture models on the latent space. However, Lamb models are very likely to be misspecified, thus leading to inconsistent clustering. On the other side, repulsive mixture models have recently provided empirical evidence about robustness to misspecification, limitedly to low-dimensional data. In our work, we propose an alternative model to Lamb. Its global structure is similar to the Lamb model, but it is able to overcome the drawbacks of Lamb, by properly employing the repulsive mixture models to cluster the latent factors. In particular, we assume a mixture of Gaussian densities for the latent factors and an anisotropic repulsive point process as prior for the locations of the mixture components. We let the matrix of factor loadings drive the anisotropic behavior, so that separation is indeed induced between the high-dimensional centers of different clusters. Within the class of repulsive point processes, we propose a general construction for anisotropic determinantal point processes (DPPs), which guarantees analytic availability of their spectral densities. Finally, an efficient Markov chain Monte Carlo (MCMC) algorithm is proposed and the performance of our model is compared with the Lamb model, on a simple simulated dataset.
Nelle moderne applicazioni, è routine sempre più diffusa raccogliere dati altamente dimensionali. In questi contesti, spesso si è interessati a raggruppare i soggetti nel campione sulla base di questi dati, utilizzando metodi statistici. L'approccio model-based bayesiano offre la possibilità di quantificare l'incertezza riguardo l'inferenza prodotta, ma spesso dà risultati insoddisfacenti quando la dimensione del dato è grande. Chandra et al. (2020) dimostrano che i modelli mistura standard producono inferenze inconsistenti in questi contesti e propongono una classe generale di modelli, chiamati Lamb, che risolvono questi problemi. L'approccio consiste nel mettere in relazione i dati "altamente dimensionali" con un insieme di fattori latenti, di dimensione più piccola, attraverso una matrice di loadings; si procede quindi con un clustering model-based sullo spazio dei fattori latenti, attraverso un modello mistura nonparametrico. Tuttavia, i modelli Lamb sono generalmente "misspecified" e questo porta a stime di clustering inconsistenti. Dall'altro lato, i modelli mistura repulsivi hanno recentemente mostrato evidenze empiriche circa la loro robustezza rispetto alla misspecificazione, limitatamente a dati a bassa dimensionalità. Nel nostro lavoro, abbiamo proposto un modello alternativo a Lamb. La sua struttura è simile a quella di Lamb, ma è in grado di superare i problemi di Lamb, utilizzando in modo appropriato i modelli mistura repulsivi per il clustering dei fattori latenti. In particolare, abbiamo assunto una mistura di densità Gaussiane per i fattori latenti, mentre la prior sui centri delle componenti della mistura è rappresentata da un processo di punto repulsivo e anisotropo. Lasciamo che l'anisotropia del processo sia "guidata" dalla matrice dei loadings, in modo tale da indurre separazione tra i centri altamente dimensionali dei differenti clusters. All'interno della classe dei processi di punto repulsivi, abbiamo proposto una costruzione generale per processi di punto anisotropi di tipo determinantal (DPPs), la quale permette di derivare analiticamente le relative densità spettrali. Infine, abbiamo disegnato e costruito un algoritmo Markov chain Monte Carlo (MCMC) efficiente, confrontando anche le performance del nostro modello, rispetto a Lamb, su un dataset simulato.
Bayesian clustering of high-dimensional data via latent repulsive mixtures
GHILOTTI, LORENZO
2019/2020
Abstract
In the routine of modern applications, collecting high-dimensional data is becoming increasingly widespread. In such contexts, it is common to be interested in clustering subjects based on these observations. Bayesian model-based approaches are able to quantify the uncertainty of the inferences, but they often perform poorly in such settings. Chandra et al. (2020) show that standard mixture models produce inconsistent inference in high-dimensional settings, proposing a general class of models, called Lamb, overcoming such an issue. Their approach consists in linking the high-dimensional observations to a set of low-dimensional latent factors through a matrix of loadings, and performing model-based clustering via nonparametric mixture models on the latent space. However, Lamb models are very likely to be misspecified, thus leading to inconsistent clustering. On the other side, repulsive mixture models have recently provided empirical evidence about robustness to misspecification, limitedly to low-dimensional data. In our work, we propose an alternative model to Lamb. Its global structure is similar to the Lamb model, but it is able to overcome the drawbacks of Lamb, by properly employing the repulsive mixture models to cluster the latent factors. In particular, we assume a mixture of Gaussian densities for the latent factors and an anisotropic repulsive point process as prior for the locations of the mixture components. We let the matrix of factor loadings drive the anisotropic behavior, so that separation is indeed induced between the high-dimensional centers of different clusters. Within the class of repulsive point processes, we propose a general construction for anisotropic determinantal point processes (DPPs), which guarantees analytic availability of their spectral densities. Finally, an efficient Markov chain Monte Carlo (MCMC) algorithm is proposed and the performance of our model is compared with the Lamb model, on a simple simulated dataset.File | Dimensione | Formato | |
---|---|---|---|
thesis.pdf
solo utenti autorizzati dal 30/03/2022
Dimensione
626.12 kB
Formato
Adobe PDF
|
626.12 kB | Adobe PDF | Visualizza/Apri |
I documenti in POLITesi sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/10589/174082