
ZeroShot Image Classification CLIP
ZeroShot Image Classification CLIP is a task in the field of machine learning and image processing, aiming to predict the class or label of an image that has not been previously classified, in a dataset that the model has not been trained on with those classes.
ZeroShot Image Classification CLIP
Summary
Introduction
Zero-shot image classification is a technique in computer vision that allows the classification of images into predefined categories without the need for labeled training data specific to those categories. The model employs a ViT-B/32 Transformer architecture as its image encoder and utilizes a masked self-attention Transformer as its text encoder. Both encoders are trained to maximize the similarity between (image, text) pairs using a contrastive loss.
Model Details
The CLIP model, created by OpenAI researchers, aims to understand the factors that enhance robustness in computer vision assignments. Additionally, the model was designed to assess the capacity of models to generalize to diverse image classification tasks without prior training. It is essential to note that CLIP was not primarily designed for widespread model deployment. Before deploying models such as CLIP, researchers must thoroughly examine their capabilities concerning the particular context in which they are intended to be utilized.
Model Type
The model employs a ViT-B/32 Transformer architecture to encode images and a masked self-attention Transformer to encode text. These encoders are trained to enhance the similarity of (image, text) pairs through a contrastive loss mechanism.
Initially, the implementation offered two versions: one featuring a ResNet image encoder and the other utilizing a Vision Transformer. The version available in this repository utilizes the Vision Transformer for image encoding.
Parameters
Inputs
image
- (image -.png|.jpg|.jpeg): The image data provided by the user is required for classification purposes.labels
- (text): The labels provided by the user are used to classify the input image.
Output
output
- (text): Includes the model's predictions for each label, displayed as probabilities. These labels are provided by the user.
Examples
image | labels | output |
---|---|---|
![]() | "a photo of a cat", "a photo of a dog" | ![]() |
Usage for developers
Please find below the details to track the information and access the code for processing the model on our platform.
Requirements
torch
Pillow
transformers
numpy
Code based on AIOZ structure
from transformers import pipeline
from PIL import Image
import torch, os
...
def shot(image, labels_text, model):
...
def do_ai_task(
image: Union[str, Path],
labels: Union[str, Path],
model_storage_directory: Union[str, Path],
device: Literal["cpu", "cuda", "gpu"] = "cpu",
*args, **kwargs) -> Any:
"""Define AI task: load model, pre-process, post-process, etc ..."""
# Define AI task workflow. Below is an example
device = "cuda" if torch.cuda.is_available() else "cpu"
path_weight = os.path.abspath(model_storage_directory + "/...")
pipe = pipeline("zero-shot-image-classification",
model= path_weight,
device=device)
image = Image.open(image)
out = shot(image, labels, pipe)
return str(out)
Reference
This repository is based on and inspired by the CLIP work developed by OpenAI. We sincerely thank them for sharing the code.
License
We respect and comply with the terms of the author's license cited in the Reference section.
Citation
@misc{
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya SutskeverAlec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever},
booktitle={arXiv preprint arxiv:2103.00020},
year={2021},
archivePrefix={arXiv},
primaryClass={cs.CV}
}