Prior-Aware Multilabel Food Recognition using Graph Convolutional Networks
Meta Food Workshop CVPR 2024

Abstract

Multi-label Recognition (MLR) involves the identification of multiple objects within an image. To address the additional complexity of this problem, recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for the task. These methods learn an independent classifier for each object (class), overlooking correlations in their occurrences. Such co-occurrences can be captured from the training data as conditional probabilities between a pair of classes. We propose a framework to extend the independent classifiers by incorporating the co-occurrence information for object pairs to improve the performance of independent classifiers. We use a Graph Convolutional Network (GCN) to enforce the conditional probabilities between classes, by refining the initial estimates derived from image and text sources obtained using VLMs. We validate our method on four MLR datasets, where our approach outperforms all state-of-the-art methods.

Motivation



To mitigate the paucity of labeled data in Multi-label recognition (MLR), recent approaches have focused on adapting large Vision Language Models (VLMs) with a common approach being learning pair of positive/negative prompts forming a binary classifier for each class. However, these methods learn independent prompts (classifiers) for each class. In practice, many objects are in sets, making their occurrences interdependent. Using independent classifiers neglects the mutual information present, which, if used, could enhance the performance of individual classifiers.


Proposed Model

Given an image with multiple objects, we extract image features and text features from the subimages using a vision-language model (CLIP). An image-text feature aggregation module (Sec. 3.1) combines these features to identify all classes present in the image as a union of the classes present in the subimages, giving an initial set of image level class logits. These logits are passed to a GCN, that uses conditional probabilities between classes to refine these initial predictions (Sec. 3.2). We train this framework while reweighting the loss generated by classes to address any class imbalance in the training data using a Reweighted Asymmetric Loss (RASL), a weighted version of the familiar ASL.

Results

Our method outperforms all state-of-the-art baselines, on four MLR datasets in the low data regime: FoodSeg103, UNIMIB 2016, COCO-small (5% of COCO’s training data) and VOC-2007. Our approach achieves the best performance on all metrics: per-class and overall average precisions (CP and OP), recalls (CR and OR), F1 scores (CF1 and OF1), and mean average precision (mAP).

A comparison of the average performance of our approach with the previous state-of-the-art VLM-based method DualCoOp[41] on classes that are difficult to recognize using only visual features (having 10 lowest CF1 values on the FoodSeg103[44] and UNIMIB[10]). Our approach significantly improves MLR performance on such classes due to its use of information derived from class conditional probabilities.

Improvement in average precision (ΔAP) of a class obtained by refining VLM-based initial logits to incorporate the information provided by conditional probabilities, shown as a function of the mean conditional probability of most co-occurring three classes.

Citation

The website template was borrowed from Michaël Gharbi and Ref-NeRF.