Piecewise-Linear Manifolds for
Deep Metric Learning

Approximate the unlabeled data manifold with many small linear pieces, use those local neighborhoods to estimate which points are similar, and train an embedding for zero-shot image retrieval, without any labels.

A toy example showing the effect of our method on a 2D embedding. Each point is an image embedding. We approximate the data manifold locally with small linear pieces, and treat points sharing a low-dimensional neighborhood as similar. Training then pulls points on the same piece (or the same low-dimensional subspace) closer together and pushes points on different pieces apart, shaping a representation space suited to retrieval, all without labels.

Abstract

Unsupervised deep metric learning (UDML) focuses on learning a semantic representation space using only unlabeled data. This challenging problem requires accurately estimating the similarity between data points, which is used to supervise a deep network. For this purpose, we propose to model the high-dimensional data manifold using a piecewise-linear approximation, with each low-dimensional linear piece approximating the data manifold in a small neighborhood of a point. These neighborhoods are used to estimate similarity between data points. We empirically show that this similarity estimate correlates better with the ground truth than the similarity estimates of current state-of-the-art techniques. We also show that proxies, commonly used in supervised metric learning, can be used to model the piecewise-linear manifold in an unsupervised setting, helping improve performance. Our method outperforms existing unsupervised metric learning approaches on standard zero-shot image retrieval benchmarks.

TL;DR

Approximate the unlabeled data manifold with local linear pieces, use those neighborhoods to estimate similarity and supervise an embedding network, and model the pieces with proxies, giving state-of-the-art unsupervised deep metric learning.

Intuition

We propose a framework for unsupervised metric learning having 3 steps:

  1. Approximate the manifold locally. Approximate the submanifold at each point with a local linear model over a neighborhood small enough to contain a single, adequately linear submanifold; such low-dimensional subspaces hold semantically similar points.
  2. Estimate similarity. Estimate pairwise similarity from the projection of x1 - x2 along the linear neighborhood D versus its normal, so similarity decays faster orthogonal to the neighborhood than along it.
  3. Train the embedding. Train the embedding to pull together points in the same neighborhood or the same low-dimensional subspace and push apart points in different subspaces.

Method

The figure below provides an overview of our method, from neighborhood sampling and the piecewise-linear approximation of the manifold to the proxy- and point-based similarities used to train the embedding.

Overview of the piecewise-linear manifold pipeline: sampled points are embedded by a momentum encoder, used to build a piecewise-linear approximation of the data manifold, then point-point and proxy-point similarities modulate distances and are optimized by backpropagation
The figure provides an overview of our method. Points are selected using the neighborhood sampling and their embeddings are calculated using a momentum encoder. Embeddings generated by the momentum encoder are used to construct a piecewise linear approximation of the data manifold. These embeddings are used to calculate point-point and proxy-point similarities which are in turn used to modulate the distance between them. Locations and neighborhoods of proxies are also updated through backpropagation. Losses colored yellow/green are calculated only using quantities with the same color.

Results

Our method outperforms current unsupervised metric learning approaches on standard zero-shot image retrieval benchmarks (CUB-200-2011, Cars-196 and SOP). Please refer to the paper for more detailed results and ablations.

61.7
Recall@1 on CUB-200-2011 (GoogleNet, 512-d), state-of-the-art among UDML methods
+2.8
Recall@1 on CUB-200-2011 over the previous best unsupervised method
66.4
Recall@1 on SOP (GoogleNet, 512-d), the largest of the three benchmarks
Full benchmark table: Recall@K across CUB-200-2011, Cars-196 and SOP for GoogleNet 128-dim and 512-dim embeddings, with our method outperforming prior unsupervised deep metric learning baselines
Recall@K on the CUB-200-2011, Cars-196 and SOP zero-shot retrieval benchmarks for 128-dim and 512-dim GoogleNet embeddings; our method outperforms existing unsupervised metric learning approaches.

BibTeX

If you find our work useful, please consider citing:

@InProceedings{pmlr-v234-bhatnagar24a,
    title     = {Piecewise-Linear Manifolds for Deep Metric Learning},
    author    = {Bhatnagar, Shubhang and Ahuja, Narendra},
    booktitle = {Conference on Parsimony and Learning},
    pages     = {269--281},
    year      = {2024},
    editor    = {Chi, Yuejie and Dziugaite, Gintare Karolina and Qu, Qing
                 and Wang, Atlas Wang and Zhu, Zhihui},
    volume    = {234},
    series    = {Proceedings of Machine Learning Research},
    month     = {03--06 Jan},
    publisher = {PMLR}
}