ZeroShot Image Classification CLIP

Summary

Introduction

Zero-shot image classification is a technique in computer vision that allows the classification of images into predefined categories without the need for labeled training data specific to those categories. The model employs a ViT-B/32 Transformer architecture as its image encoder and utilizes a masked self-attention Transformer as its text encoder. Both encoders are trained to maximize the similarity between (image, text) pairs using a contrastive loss.

Model Details

The CLIP model, created by OpenAI researchers, aims to understand the factors that enhance robustness in computer vision assignments. Additionally, the model was designed to assess the capacity of models to generalize to diverse image classification tasks without prior training. It is essential to note that CLIP was not primarily designed for widespread model deployment. Before deploying models such as CLIP, researchers must thoroughly examine their capabilities concerning the particular context in which they are intended to be utilized.

Model Type

The model employs a ViT-B/32 Transformer architecture to encode images and a masked self-attention Transformer to encode text. These encoders are trained to enhance the similarity of (image, text) pairs through a contrastive loss mechanism.

Initially, the implementation offered two versions: one featuring a ResNet image encoder and the other utilizing a Vision Transformer. The version available in this repository utilizes the Vision Transformer for image encoding.

Parameters

Inputs

image - (image -.png|.jpg|.jpeg): The image data provided by the user is required for classification purposes.
labels - (text): The labels provided by the user are used to classify the input image.

Output

output - (text): Includes the model's predictions for each label, displayed as probabilities. These labels are provided by the user.

Examples

image	labels	output
	"a photo of a cat", "a photo of a dog"

Usage for developers

Please find below the details to track the information and access the code for processing the model on our platform.

Requirements

torch
Pillow
transformers
numpy

Code based on AIOZ structure

from transformers import pipeline
from PIL import Image
import torch, os

...
def shot(image, labels_text, model):
  ...

def do_ai_task(
        image: Union[str, Path],
        labels: Union[str, Path],
        model_storage_directory: Union[str, Path],
        device: Literal["cpu", "cuda", "gpu"] = "cpu",
        *args, **kwargs) -> Any:
    """Define AI task: load model, pre-process, post-process, etc ..."""
    # Define AI task workflow. Below is an example
    device = "cuda" if torch.cuda.is_available() else "cpu"

    path_weight = os.path.abspath(model_storage_directory + "/...") 
    pipe = pipeline("zero-shot-image-classification",
                     model= path_weight,
                     device=device)
    image = Image.open(image)
    out = shot(image, labels, pipe)
    return str(out)

Reference

This repository is based on and inspired by the CLIP work developed by OpenAI. We sincerely thank them for sharing the code.

License

We respect and comply with the terms of the author's license cited in the Reference section.

Citation

@misc{
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya SutskeverAlec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever},
  booktitle={arXiv preprint arxiv:2103.00020},
  year={2021},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}