April 10, 2023

A Breakdown on Segment Anything Model (SAM)

One of the latest breakthroughs in computer vision (CV) technology is Meta AI's Segment Anything Model (SAM). SAM is a promptable segmentation system with zero-shot generalization to unfamiliar objects and images. In this article, we'll cover SAM's model and approach, explore its usability for generating training data, and how it will impact the industry.

Let's dive in!

Background

Developing zero-shot auto-segmentation models has long been a challenge in the CV industry. The main difficulty being the effort to provide semantically rich unsupervised pre-training data.

Enter SAM – a game-changing AI model that combines two classes of approaches, interactive segmentation and automatic segmentation, into a single model. SAM's promptable segmentation system enables it to perform both types of segmentation with ease. It has the ability to learn from millions of images and masks, through the use of a model-in-the-loop "data engine." This has enabled SAM to develop a sophisticated and efficient segmentation model with a dataset of 11M images and 1B+ masks.

Segment Anything Model

The architecture of the model comprises three main components:
1) image encoder
2) prompt encoder
3) mask decoder

‍Image Encoder

The image encoder is a featurization transformer block that generates one-time image embeddings to compress the input image to a 256x64x64 feature matrix. This allows for a more computationally efficient approach to image featurization.

Prompt Encoder

‍The prompt encoder encodes the prompts to create the segmentations into an embedding vector in real time. There are two types of prompts: sparse (bounding boxes, points, or text) and dense (masks). The points and boxes are represented by positional encodings, while text prompts (unreleased) are represented by text encodings from CLIP.

Mask Decoder

Both the image and prompt encoders are then passed on to the lightweight mask decoder to predict the segmentation tasks. The compressed image features are then passed on to the lightweight mask decoder along with the model's prompts to produce high-quality results. The decoder block updates all embeddings through bidirectional prompt self-attention and cross-attention, where attention is applied from the prompt to the image embedding and vice versa.

This approach significantly reduces the computational burden of the system, with the encoder taking only 0.15 seconds for inference on an NVIDIA A100 GPU.

Putting SAM to the test

To understand SAM’s usability for generating training data, we launched a mock AV project on SUPA with SAM’s sample dataset and compared the outputs.

What stood out was the model’s “segment everything approach” would sometimes create over segmentation for subparts of an object, such as door handles and windows of the car in the example above. This might be because the model heavily relies on local image features to make segmentation decisions, rather than considering the larger context of the image to capture its spatial relationships. The model also currently does not have the capability of generating labels.

However, this approach has a lot of potential for boosting labeling productivity. The model can generate masks for human reviewers to validate and correct, which is more efficient than annotating the images from scratch.

In SUPA’s validation workflow, we perform further processing on the model output to make it useful by:

Grouping instances for the same object
Adjusting segmentations based on the ML objective and edge cases
Assigning classes

Conclusion

The zero-shot capabilities of Segment Anything represent a significant breakthrough in image segmentation. However, at SUPA, we recognize that even the most advanced AI models cannot replace the need for human involvement in the data labeling process.

While SAM can greatly enhance the efficiency and accuracy of data labeling, there will always be a need for human validation to ensure that the output meets the specific needs and objectives of each ML project.

At SUPA, we are committed to providing the human expertise necessary to validate, refine, and optimize the output of AI models. We believe that the combination of advanced AI models and a human-in-the-loop system is the key to unlocking the full potential of machine learning.

Have questions or want to discuss a model validation project? Contact our team to learn more.