Dense Prediction for Vision Transformers

Summary

Introduction

The Dense Prediction Transformer (DPT) model has been trained on a dataset of 1.4 million images specifically for the task of monocular depth estimation. It was introduced in the research paper titled Vision Transformers for Dense Prediction by Ranftl et al. in 2021 and was initially made available in the provided repository.

DPT builds upon the Vision Transformer (ViT) as its backbone architecture and extends it with additional components for monocular depth estimation. The model consists of a "neck" and a "head" that are added on top of the ViT backbone.

The repository contains the "hybrid" version of the model, as described in the paper. Referred to as DPT-Hybrid, it deviates from the original DPT model by utilizing ViT-hybrid as its backbone architecture and incorporating selected activations from this backbone.

Parameters

Inputs

input - (image -.png|.jpg|.jpeg): The input of the model is typically an image or a set of images captured by a camera. They can be in different formats and have various dimensions.

Output

output - (image -.png): The output of the model for depth estimation is a dense depth map. This map represents the estimated depth values for each pixel in the input image(s). It provides information about the relative distance or depth of objects in the scene. The depth map can be visualized as a grayscale image, where darker regions correspond to closer objects and lighter regions represent farther objects.

Examples

input	output

Usage for developers

Please find below the details to track the information and access the code for processing the model on our platform.

Requirements

torch
transformers
Pillow
numpy

Code based on AIOZ structure

from PIL import Image
import numpy as np
import torch, os
from transformers import DPTImageProcessor, DPTForDepthEstimation

...
def do_ai_task(
        input: Union[str, Path],
        model_storage_directory: Union[str, Path],
        device: Literal["cpu", "cuda", "gpu"] = "cpu",
        *args, **kwargs) -> Any:
        
        model_id = os.path.abspath(model_storage_directory + "...")
        image_processor = DPTImageProcessor.from_pretrained(repo_id)
        model = DPTForDepthEstimation.from_pretrained(repo_id, low_cpu_mem_usage=True).to(device)

        image = Image.open(input).convert('RGB')

        # prepare image for the model
        inputs = image_processor(images=image, return_tensors="pt")

        with torch.no_grad():
            outputs = model(**inputs.to(device))
            predicted_depth = outputs.predicted_depth

        # interpolate to original size
        prediction = torch.nn.functional.interpolate(
            predicted_depth.unsqueeze(1),
            size=image.size[::-1],
            mode="bicubic",
            align_corners=False,
        )
        output = prediction.squeeze().cpu().numpy()
        formatted = (output * 255 / np.max(output)).astype("uint8")

        depth = Image.fromarray(formatted)
        depth.save("output.png")

        output = open("output.png", "rb")  # io.BufferedReader
    return output

Reference

This repository is based on and inspired by Intel's work. We sincerely appreciate their generosity in sharing the code.

License

We respect and comply with the terms of the author's license cited in the Reference section.

Citation

@article{DBLP:journals/corr/abs-2103-13413,
  author    = {Ren{\'{e}} Ranftl and
               Alexey Bochkovskiy and
               Vladlen Koltun},
  title     = {Vision Transformers for Dense Prediction},
  journal   = {CoRR},
  volume    = {abs/2103.13413},
  year      = {2021},
  url       = {https://arxiv.org/abs/2103.13413},
  eprinttype = {arXiv},
  eprint    = {2103.13413},
  timestamp = {Wed, 07 Apr 2021 15:31:46 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2103-13413.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}