๐ Read the full paper:
SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation
Now on arXiv โ arxiv.org/abs/2505.21795
๐น 4/4 โ Promptable segmentation in action SANSA reduces reliance on costly pixel-level masks by supporting point, box, and scribble prompts
๐enabling fast, scalable annotation with minimal supervision.
See the qualitative results ๐
๐น 3/4 โ SANSA achieves state-of-the-art in few-shot segmentation. We outperform specialist and foundation-based methods across various benchmarks:
๐ +9.3% mIoU on LVIS-92i
โก 3ร faster than prior works
๐ก Only 234M parameters (4-5x smaller than competitors)
๐น2/4 โ Unlocking semantic structure
SAM2 features are rich, but optimized for tracking.
๐ง Insert bottleneck adapters into frozen SAM2
๐ These restructure feature space to disentangle semantics
๐ Result: features cluster semanticallyโeven for unseen classes (see PCA๐)
๐ As #CVPR2025 week kicks off, meet SANSA: Semantically AligNed Segment Anything 2
We turn SAM2 into a semantic few-shot segmenter:
๐ง Unlocks latent semantics in frozen SAM2
โ๏ธ Supports any prompt: fast and scalable annotation
๐ฆ No extra encoders
๐ github.com/ClaudiaCutta...
#ICCV2025
I guess merging the events could also work ๐ I wonder whether cricket players would be better at ComputerVision than CV researchers are at cricket, or viceversa
To Match or Not to Match: Revisiting Image Matching for Reliable Visual Place Recognition
Davide Sferrazza, @berton-gabri.bsky.social Gabriele Trivigno, Carlo Masone
tl;dr: global descriptors nowadays are often better than local feature matching methods for simple datasets.
arxiv.org/abs/2504.06116
โจ SAMWISE achieves state-of-the-art performance across multiple #RVOS benchmarksโwhile being the smallest model in RVOS! ๐ฏ It also sets a new #SOTA in image-level referring #segmentation. With only 4.9M trainable parameters, it runs #online and requires no fine-tuning of SAM2 ๐
๐ Contributions:
๐น Textual Prompts for SAM2: Early fusion of visual-text cues via a novel adapter
๐น Temporal Modeling: Essential for video understanding, beyond frame-by-frame object tracking
๐น Tracking Bias: Correcting tracking bias in SAM2 for text-aligned object discovery
๐ฅ Our paper SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation is accepted as a #Highlight at #CVPR2025! ๐
We make #SegmentAnything wiser, enabling it to understand textual promptsโtraining only 4.9M parameters! ๐ง
๐ป Code, models & demo: github.com/ClaudiaCutta...
Why SAMWISE?๐
To Match or Not to Match: Revisiting Image Matching for Reliable Visual Place Recognition
Davide Sferrazza, @berton-gabri.bsky.social, @gabtriv.bsky.social, Carlo Masone
tl;dr:VPR datasets saturate;re-ranking not good;image matching->uncertainty->inlier counts->confidence
arxiv.org/abs/2504.06116
๐ Paper Release! ๐
Curious about image retrieval and contrastive learning? We present:
๐ "All You Need to Know About Training Image Retrieval Models"
๐ The most comprehensive retrieval benchmarkโthousands of experiments across 4 datasets, dozens of losses, batch sizes, LRs, data labeling, and more!
Trying to convince my bluesky feed to put me in the Computer Vision community. Right now I only see posts about the orange-haired president. @berton-gabri.bsky.social @gabrigole.bsky.social how did you do it?
Image segmentation doesnโt have to be rocket science. ๐
Why build a rocket engine full of bolted-on subsystems when one elegant unit does the job? ๐ก
Thatโs what we did for segmentation.
โ
Meet the Encoder-only Mask Transformer (EoMT): tue-mps.github.io/eomt (CVPR 2025)
(1/6)
Went outside today and thought this would be perfect for my first #bluesky post