Potential Field Based Deep Metric Learning

Abstract

Deep metric learning (DML) involves training a network to learn a semantically meaningful representation space. Many current approaches mine n-tuples of examples and model interactions within each tuplets. We present Potential Field based metric learning (PFML), a novel compositional DML model, inspired by electrostatic fields in physics that, instead of in tuples, represents the influence of each example (embedding) by a continuous potential field, and superposes the fields to obtain their combined global potential field. We use attractive/repulsive potential fields to represent interactions among embeddings from images of the same/different classes. Contrary to typical learning methods, where mutual influence of samples is proportional to their distance, we enforce reduction in such influence with distance, leading to a decaying field. We show that such decay helps improve performance on real world datasets with large intra-class variations and label noise. Like other proxy-based methods, we also use proxies to succinctly represent sub-populations of examples. We evaluate our method on three standard DML benchmarks- Cars-196, CUB-200-2011, and SOP datasets where it outperforms state-of-the-art baselines.

TL;DR:

Use a continous potential field to represent interactions between a set of example embeddings instead of using subsets of examples (triplets/tuplets) or proxies

Intuition

An example of PFML on a toy problem with embeddings from 2 classes shown. PFML creates a potential field (visualized) by superposing attractive and repulsive fields generated by individual embeddings. It draws embeddings towards other nearby embeddings belonging to the same class, while also being driven away from embeddings of other classes which is a mirror image of the behavior of an isolated system of electric charges. Such a potential field is defined for each embddings of each class, with the field for the blue embeddings being visualized in the animation . The movement shown in the animation is a result of the net effect of all potentials (blue and red).




Advatages of PFML vs Previous DML Approaches

  1. Potential field representation enables modelling interactions between all sample embeddings, as opposed to modeling those between small subsets (of sample or proxy points) as done in all previous methods using e.g. point-tuplets based (e.g., contrastive, triplet, N-tuplet ) loss and Proxy-based losses (e.g., Proxy NCA, Proxy Anchor).
  2. Modeling interactions of all points, made possible by the use of potentials helps:
    • improve the quality of features learned while also
    • increasing robustness to noise since the effect of noise on interactions among a smaller number of samples will have a larger variance.
  3. A major difference of our potential field based approach compared to previous approaches is in the variation of strength of interaction between two points as the distance between them increases: instead of remaining constant or even becoming stronger, as is the case with most existing methods, in our model it becomes weaker with distance. This decay in interaction strength is helpful in several ways:
    • It ensures the intuitive expectation that two distant positive samples are too different to be considered as variants of each other, helping treat them as different varieties (e.g., associated with different proxies).
    • The decay property also significantly improves performance for the specific type of noise affecting labels, e.g., due to annotation errors common in real-world datasets .
    • As a result of the decay, the learned proxies remain closer to (at smaller Wasserstein distance W2 from) the sample embeddings they represent than for (e.g., current proxy-based) methods where interactions strengthen with distance, thereby enhancing their desired role.
Further details can be found in our paper.

Method

For each class, PFML defines a class potential field Ψ that affects embeddings of only the selected class. This class potential field brings together embeddings of the class while pushing them away from embeddings of other classes. The class potential field is formed from a superposition of potentials belonging to individual embeddings from all classes. The potential field exerted by indivdual embeddings is designed based on both, the principles from electrostatics and observations from DML literature. More details and exact definitions of the potential field can be found in Section 3 of our paper.

Our Potential-field based DML pipeline includes (1) Computing attraction and repulsion fields generated by each embedding and proxy, (2) Computing the class potential fields by superposition of individual fields (3) Evaluating total potential energy by summing up the potentials of embeddings and proxies under the class potential field and (4) Updating locations of sample embeddings (through network parameters) and proxies to minimize total potential energy through backprop.

Performance on DML Benchmarks

As is common, we evaluate the metric learnt by our method using its performance for zero-shot image retrieval on 3 standard benchmarks: (1) Cars-196 (2) CUB200-2011 and (3) SOP dataset. We also train 4 different backbone networks: ResNet50 , BN Inception, ViT and DINO with our method to enable a fair comparison with other methods using these. The Table summarizes the performance of our method and other state-of-the-art methods on these datasets.



As seen in the table, our method outperforms all other methods in terms of Recall@1 on all three datasets.

Visual Resuts

t-SNE Visual of Embedding Space

The figure below shows a t-sne visualization of the embedding space learnt by our method on the CUB200-2011 dataset. it can be seen that images closer together share more semantic characteristics than those that are far apart.

Zero-shot Retrieval Examples

Example image retrieved by our method for query images from (a) Cars-196 (b) CUB-200- 2011 and (c) SOP test datasets, in increasing order of distance from the query. Correct retrievals have a green border, while incorrect ones have a red one. Despite large intra-class variation (pose, color) in the datasets, our method is able to effectively retrieve similar images.

Citation

The website template was borrowed from Michaël Gharbi and Ref-NeRF.