Teaching CLIP to Count

Abstract

Counting objects of a specified type within images is a critical task in computer vision, applicable to domains such as surveillance, inventory management, and autonomous navigation. Accurately determining the number of instances of specific object categories in diverse and complex scenes remains a challenging problem, particularly due to variations in object appearances, occlusions, and background clutter.

In this work, we leverage the Contrastive Language-Image Pre-training (CLIP) model to develop a streamlined and effective object counting framework. Our approach utilizes CLIP's pre-trained image and text embeddings to create a comprehensive feature representation, which is then fed into a Multi-Layer Perceptron (MLP) regressor trained to predict object counts directly. Unlike traditional methods that depend on generating intermediate representations like heat maps or density estimations, our model operates entirely within the embedding space, simplifying the pipeline and reducing computational overhead. We train and evaluate our model on the COCO dataset, achieving competitive performance metrics in terms of Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and a Modified MAE (MMAE) designed to more effectively penalize off-by-one errors. Furthermore, we assess the model's generalization capabilities on the CountBench dataset, highlighting both its strengths and areas for improvement. By integrating residual connections and regularization techniques within the MLP architecture, we enhance the model's stability and generalization. Our findings demonstrate that leveraging CLIP's rich semantic embeddings provides a scalable and adaptable solution for object counting tasks.

Figure 1: A picture of two oranges where our model predicts 1.610, which gets rounded to 2.

Introduction

Object counting within images is a fundamental task in computer vision, integral to applications such as surveillance, inventory management, and autonomous systems. Accurately determining the number of instances of specific object categories in diverse and complex scenes remains a challenging problem, particularly due to variations in object appearances, occlusions, and background clutter.

Traditional approaches to object counting often involve generating intermediate representations, such as density maps or heat maps, which estimate object locations and counts. While effective, these methods typically require extensive computational resources and large amounts of annotated data, limiting their scalability and applicability to real-world scenarios.

In recent developments, large-scale vision-language models like Contrastive Language-Image Pre-training (CLIP) have demonstrated remarkable capabilities in understanding and relating visual and textual information through shared embedding spaces. CLIP's ability to encode images and textual descriptions into semantically meaningful vectors enables a new paradigm for object counting: leveraging pre-trained embeddings for direct count prediction.

In this work, we propose a model that builds upon CLIP's powerful embedding representations to perform object counting. Our approach involves concatenating CLIP's image and text embeddings to form a comprehensive feature vector, which is then input into a Multi-Layer Perceptron (MLP) regressor trained to predict the count of specified objects directly. This method simplifies the counting pipeline by eliminating the need for intermediate representations and capitalizes on CLIP's extensive pre-training on diverse image-text pairs.

Additionally, we explore zero-shot counting capabilities by utilizing CLIP's text embeddings to handle object categories not seen during training. By designing our model to generalize across various object types without requiring explicit retraining, we aim to create a versatile and scalable counting solution. Our experiments on the COCO and CountBench datasets validate the effectiveness of our approach, demonstrating competitive performance and highlighting areas for future improvement.

Related Work

Large vision-language models (VLMs), such as CLIP, learn rich joint image-text representations, facilitating advances in numerous downstream tasks, including zero-shot classification and text-to-image generation. Nevertheless, existing VLMs exhibit a prominent well-documented limitation – they fail to encapsulate compositional concepts such as counting. Despite this, many other papers managed to improve CLIP's (or some other VLM's) counting capabilities, however, all of the existing literature differs from our approach. We approach this problem as a regression task, and thus make our pipeline simpler, while still maintaining better accuracy over zero-shot.

Teaching CLIP to Count to Ten ^[2] by Roni Paiss et al. explores this limitation by introducing a counting-aware CLIP model. They propose a counting-contrastive loss to finetune a pre-trained VLM, enabling it to distinguish between correct and counterfactual object counts in image-text pairs. Their work demonstrates significant improvements over original CLIP in object counting tasks while maintaining performance on general benchmarks.

Zero-Shot Object Counting with Language-Vision Models ^[3] This paper introduces Zero-Shot Object Counting (ZSC), a new setting where only a class name is known at test time, eliminating the need for human-selected exemplars. Instead, the method automatically extracts exemplar patches from the input image itself. Using large vision-language models like CLIP and Stable Diffusion, it identifies and selects patches that represent the target object class. A ranking model then picks the patches that minimize counting errors. Experiments on the FSC-147 dataset demonstrate that this approach effectively counts objects without relying on manual annotations.

Data Collection

For our data, we used the COCO 2017 train and validation sets ^[4]. COCO (Common Objects in Context) contains over 100,000 images of different objects in multiple settings. Objects range from everyday fruits to more abstract nouns like person as well as other random nouns like umbrella and toilet. It has annotations, so we were able to get data of the form (image, object, count). We additionally tested our model on the CountBench dataset ^[2] (which has completely different objects than COCO) after modifying it. Namely, the CountBench dataset has data of the form (image, caption, number) - so an example is a picture of headsets, with the caption being "We review the ten best gaming headsets in the market" and the number being 10 as there are 10 headsets. However, as stated, our model has the input of an image and the word description of an object, which is different than the caption. Thus for this, we modified the CountBench dataset by using the OpenAI API and prompting ChatGPT-4 to summarize each caption to its most relevant word - because of this, we note that this dataset might be faulty, and thus did not use it as the official validation dataset, but rather the val2017 from COCO.

Example Visualization: Counting Multiple Objects

Figure 2: A GIF demonstrating how different objects in a COCO image are counted by masking them. Each frame highlights a specific object type, and we count the number of objects of that specific type.

Data Subset Selection and Balancing Counts

For training, we first thought of using every possible pair (image, object) where images are and objects are taken from train2017. However, as train2017 had 18GB of images, and there were 80 objects, this was not viable. Additionally, with this approach, a large majority of labels would then be zero, which would bias our model to predict zero. Thus, we opted for taking every pair (image, object) where the object was in the image and additionally added a fixed amount of (random) datapoints where the count was zero. With this approach, there was still quite a bit of class imbalance, so after a bit of testing we rebalanced our training data to not bias our model.

Count Distributions Before and After Balancing

Figure 3: The original count distribution of object-image pairs in the COCO dataset, showing significant imbalance.

Figure 4: The count distribution after balancing the dataset, resulting in a more uniform representation of object counts.

Model and Method

Our model starts from the CLIP embeddings of the input image and the textual query. Given an input image and a corresponding query (e.g., "red cars"), we:

Image Embedding: Use the CLIP image encoder to obtain an image embedding \( I \in \mathbb{R}^d \).
Text Embedding: Use the CLIP text encoder to obtain a text embedding \( T \in \mathbb{R}^d \).
Concatenation: Concatenate these embeddings \( X = [I; T] \in \mathbb{R}^{2d} \).

We feed X into a customized Multi-Layer Perceptron (MLP) regressor. Our MLP uses five fully connected linear layers, including a skip connection in between them as well as batch normalization and dropout. After extensive testing, we arrived at this architecture, noting the following:

Skip Connection: Greatly improved our validation accuracy notably because we are addressing a counting problem, so intuitively it makes sense as the model needs some form of "memory".
Batch Normalization and Dropout: Employed to combat overfitting, as our model overfitted on training data very quickly.

Training and Evaluation Metrics

RMSE, MAE, and MMAE Definitions

RMSE: Defined as \( \sqrt{\frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{N}} \), where \( y_i \) is the ground-truth count and \( \hat{y}_i \) is the predicted count. RMSE heavily penalizes large errors and is sensitive to outliers. It is a common metric in regression tasks.
MAE: Defined as \( \frac{\sum_{i=1}^{N} |y_i - \hat{y}_i|}{N} \), MAE measures the average absolute difference between predicted and actual values. It provides a more direct interpretation of the average magnitude of errors.
MMAE (Modified MAE): We propose a modified MAE given by \( \text{MMAE} = \frac{\sum_{i=1}^{N} \frac{|y_i - \hat{y}_i|}{y_i + 1}}{N} \). MMAE weights the error by the scale of the count, penalizing off-by-one errors more heavily for smaller counts.

Why Train on MMAE?

For most counting tasks, an off-by-one error when the true count is small (e.g., 1) is more critical than the same error when the count is large (e.g., 54). By training directly on MMAE, we emphasize the importance of precision at low object counts, encouraging the model to be more careful and accurate when counts are small while still performing well in denser scenarios.

Experiments and Results

After training our model on the aforementioned train data, we conducted experiments on our validation data (val2017 COCO) - which in our case was all pairs of (img, obj) in val2017 such that the actual count is greater than zero, plus 5000 random zero samples. We additionally evaluated CLIP zero-shot on the same dataset. For each image and each object we would make ten prompts of the form: "A photo of {i} {object}", being careful with plural vs singular forms.

From these results, we see that our pipeline of CLIP + MLP is substantially better for lower counts (0, 1, 2, 3, 4) but performs similarly or worse for higher counts. We suspect this is due to fewer high-count samples in the training data. Nonetheless, overall accuracy improves significantly (by about 40%), suggesting that with enough training samples, our model can generalize and improve counting performance.

We also tested generalization to objects not in the training set using the ModifiedCountBench dataset. Note that this dataset’s labeling is derived from caption summaries generated by ChatGPT-4, so it may be noisy. Still, it provides insight into how the model handles previously unseen objects. We see that our model does not generalize well when the object is not in the train data, however, this might be because the dataset is potentially wrong. Additionally, here is the confusion matrix for the subset of val2017 - by subset, we mean the validation set we used during training, as the tables show the results for the comprehensive validation dataset (we used this because of performance).

val2017 COCO
Count	Model 1 (CLIP)				Model 2 (CLIP + MLP)
Count	Accuracy	MMAE	MAE	RMSE	Accuracy	MMAE	MAE	RMSE
0	0.03	3.23	3.23	3.36	0.73	0.39	0.39	0.79
1	0.22	1.32	2.64	3.79	0.59	0.30	0.59	0.94
2	0.17	0.86	2.59	3.44	0.42	0.25	0.75	1.01
3	0.13	0.60	2.41	3.00	0.27	0.27	1.07	1.31
4	0.12	0.52	2.59	3.01	0.14	0.31	1.53	1.83
5	0.06	0.45	2.72	3.14	0.06	0.38	2.29	2.53
6	0.13	0.43	3.00	3.69	0.02	0.44	3.06	3.29
7	0.11	0.42	3.36	4.12	0.01	0.47	3.77	4.00
8	0.06	0.45	4.03	4.86	0.02	0.52	4.70	4.99
9	0.07	0.47	4.70	5.66	0.01	0.56	5.63	5.89
Total	0.14	1.64	2.84	3.61	0.53	0.32	0.83	1.45

ModifiedCountBench
Count	Model 1 (CLIP)				Model 2 (CLIP + MLP)
Count	Accuracy	MMAE	MAE	RMSE	Accuracy	MMAE	MAE	RMSE
2	0.40	0.80	2.40	3.70	0.45	0.24	0.71	0.89
3	0.32	0.33	1.32	2.12	0.21	0.28	1.12	1.36
4	0.20	0.39	1.96	2.58	0.18	0.33	1.64	1.92
5	0.10	0.33	2.00	2.59	0.08	0.37	2.24	2.52
6	0.27	0.24	1.67	2.11	0.04	0.46	3.21	3.47
7	0.15	0.20	1.60	1.90	0.06	0.48	3.87	4.26
8	0.06	0.21	1.87	2.13	0.04	0.50	4.51	4.90
9	0.26	0.15	1.47	1.99	0.04	0.53	5.33	5.70
10	0.29	0.17	1.86	2.50	0.02	0.54	5.95	6.31
Total	0.23	0.31	1.78	2.44	0.12	0.42	3.22	3.98

Discussion and Future Work

In conclusion, we managed to substantially improve accuracy in the task of counting objects in an image, for objects that were seen during the training phase (but for general images). Further tests need to be run to check for generalization outside of the train for objects, as the ModifiedCountBench might not be correct.

Our model was CLIP + MLP regressor, however, certain other architectures may perform better - namely, MLP regressor is not suited for counting, and ideally, we would want to use a convolutional layer somehow as to incorporate the inductive bias of counting in an image. These experiments are worth trying, and might improve accuracy.

Fun Results

Below is an overview of our model’s predictions across various images. The consolidated image displays multiple examples, highlighting both our predicted counts and the ground-truth counts.

Figure 5: Overview of our model’s predictions compared to ground-truth counts across various images.

References

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... Sutskever, I. (2021). "Learning Transferable Visual Models From Natural Language Supervision." In Proceedings of the International Conference on Machine Learning (ICML). arXiv:2103.00020.
Paiss, R., Ephrat, A., Tov, O., Zada, S., Mosseri, I., Irani, M., & Dekel, T. (2023). "Teaching CLIP to Count to Ten." arXiv:2302.12066.
Xu, J., Le, H., & Samaras, D. (2023). "Zero-Shot Object Counting with Language-Vision Models." arXiv:2309.13097.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). "Microsoft COCO: Common Objects in Context." In European Conference on Computer Vision (ECCV). arXiv:1405.0312.