Mode-Detection via Median-ShiftProject Page
Median-shift is a mode seeking algorithm that relies on computing the median of local neighborhoods, instead of the mean. We further combine median-shift with Locality Sensitive Hashing (LSH) and show that the combined algorithm is suitable for clustering large scale, high dimensional data sets. In particular, we propose a new mode detection step that greatly accelerates performance. In the past, LSH was used in conjunction with mean shift only to accelerate nearest neighbor queries. Here we show that we can analyze the density of the LSH bins to quickly detect potential mode candidates and use only them to initialize the median-shift procedure. We use the median, instead of the mean (or its discrete counterpart - the medoid) because the median is more robust and because the median of a set is a point in the set. A median is well defined for scalars but there is no single agreed upon extension of the median to high dimensional data. We adopt a particular extension, known as the Tukey median, and show that it can be computed efficiently using random projections of the high dimensional data onto $1D$ lines, just like LSH, leading to a tightly integrated and efficient algorithm.