Piecewise-Linear Manifolds for Deep Metric Learning
Conference on Parsimony and Learning (CPAL) 2024 , Oral

Abstract

Unsupervised deep metric learning (UDML) focuses on learning a semantic representation space using only unlabeled data. This challenging problem requires accurately estimating the similarity between data points, which is used to supervise a deep network. For this purpose, we propose to model the high-dimensional data manifold using a piecewise-linear approximation, with each low-dimensional linear piece approximating the data manifold in a small neighborhood of a point. These neighborhoods are used to estimate similarity between data points. We empirically show that this similarity estimate correlates better with the ground truth than the similarity estimates of current state-of-the-art techniques. We also show that proxies, commonly used in supervised metric learning, can be used to model the piecewise-linear manifold in an unsupervised setting, helping improve performance. Our method outperforms existing unsupervised metric learning approaches on standard zeroshot image retrieval benchmarks.

Intuition

A toy example showing the effect of our method on a 2D embedding


We propose a framework for unsupervised metric learning having 3 steps
  1. Identify and approximate the submanifold at each point (eg. Ellipses A-E) by a linear model over a neighborhood small enough to (i) not contain multiple submanifolds and (ii) the contained single submanifold is adequately linear. Such low-dimensional (1D) subspaces are assumed to contain semantically similar points.
  2. Estimate similarity for each pair of points (x1, x2), in terms of the lengths of projections of vector x1 − x2 on the linear neighborhood D, and on the normal to D. Similarity decays faster orthogonal to a neighborhood than along it.
  3. Train network embedding to bring similar points closer together and push dissimilar points apart. This brings closer together points within the same low-dimensional neighborhoods or those in different neighborhoods but the same low-dimensional (1D) space (eg. A, B, C) closer together. Points not lying in the same low dimensional space (eg. B, D, E) are pushed away from each other.




The figure provides an overview of our method. Points are selected using the neighborhood sampling and their embeddings are calculated using a momentum encoder. Embeddings generated by the momentum encoder are used to construct a piecewise linear approximation of the data manifold.
These embeddings are used to calculate point-point and proxy-point similarities which are in turn used to modulate the distance between them. Locations and neighborhoods of proxies are also updated through backpropagation. Losses colored yellow/ green are calculated only using quantities with the same color.


Example Embedding

The figure below shows a t-sne visualization of the embedding learnt by our network for the CUB-200-2011 dataset.





Results

Our method outperforms all current methods in terms of Recall@1. Please refer to the paper for more detailed results and ablations

Citation

The website template was borrowed from Michaël Gharbi and Ref-NeRF.