Présentation : Clustering Heterogeneous Gaussian Data without Prior Knowledge of the Number of Clusters

Lundi 9 décembre 2024 à 09:30:00 - Lundi 9 décembre 2024 à 10:30:00

Équipe :

Lieu : A définir

Auteurs: D. Pastor, E. Dupraz

Orateur: D. Pastor

Résumé: This paper addresses the problem of clustering measurement vectors that are heterogeneous in that they can have different covariance matrices. From the assumption that the measurement vectors within a given cluster are Gaussian distributed with possibly different covariant matrices around the cluster centroid, we introduce a novel cost function to estimate the centroids. The zeros of the gradient of this cost function turn out to be the fixed-points of a certain function. As such, the approach generalizes the methodology employed to derive the existing Mean-Shift algorithm. But as a main and novel theoretical result compared to Mean-Shift, this paper shows that the sole fixed-point of the identified function is one of the cluster centroids if both the number of measurements per cluster and the distance between this centroid and the other ones tend to $\infty$. As a second contribution, this paper introduces the Wald kernel for clustering. This kernel is defined as the p-value of the Wald hypothesis test for testing the mean of a Gaussian. As such, the Wald kernel measures the plausibility that a measurement vector belongs to a given cluster and it scales better with the dimension of the measurement vectors than the usual Gaussian kernel. Finally, the proposed theoretical framework allows to derive a new clustering algorithm named CENTREx that works by estimating the fixed-points of the identified function. This algorithm relies on a Wald hypothesis test to significantly reduce the number of estimated fixed-points compared to the Mean-Shift algorithm, thus resulting in a clear gain in complexity. Simulation results on synthetic and real datasets show that CENTREx has comparable or better performance that standard clustering algorithms K-means and Mean-Shift, even when the covariance matrices are not perfectly known.

page supérieure