Text prompt? · Issue #4 · facebookresearch/segment-anything (original) (raw)

We believe directly using the text from text features from CLIP is not good. Because the explainability of CLIP is bad. Using the text features matched with opposite semantic regions lead to wrong results.

This is our work about CLIP's explainability: https://github.com/xmed-lab/CLIP_Surgery

And we can see the self-attention of CLIP links irrelevant regions, with serious noisy activations across labels.
fig1
fig2

We suggest using the corrected heatmap to generate points to replace the manual input points.
This is our similarity map from the raw prediction of CLIP and results on SAM.
fig3
fig4

Besides, it's very simple, just use the original CLIP without any fine-tuning or extra supervisions. It's also another solution besides text->box->mask, with requires the least training and supervision cost.