Built by:
π¨βπ¬ Tommie Kerssies, NiccolΓ² Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, Daan de Geus
π TU Eindhoven, Polytechnic of Turin, RWTH Aachen University
#ComputerVision #DeepLearning #ViT #ImageSegmentation #EoMT #CVPR2025
(6/6)
               
            
            
                31.03.2025 20:35 β π 2    π 0    π¬ 0    π 0                      
            
         
            
        
            
            
            
            
            
    
    
            
                        
                Your ViT is Secretly an Image Segmentation Model (CVPR 2025)
                CVPR 2025: EoMT shows ViTs can segment efficiently and effectively without adapters or decoders.
            
        
    
    
            Segmentation, simplified.
Weβre excited to see what you build on top of it. π οΈ
π Project: tue-mps.github.io/eomt
π Paper: arxiv.org/abs/2503.19108
π» Code: github.com/tue-mps/eomt
π€ Models: huggingface.co/tue-mps
(5/6)
               
            
            
                31.03.2025 20:35 β π 2    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Why does EoMT work?
Large ViTs pre-trained on rich visual data (like DINOv2 π¦) can learn the inductive biases needed for segmentation, with no extra components required.
β
 EoMT removes the clutter and lets the ViT do it all. 
(4/6)
               
            
            
                31.03.2025 20:35 β π 2    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            How fast can segmentation get while still maintaining accuracy?
β
 EoMT achieves an optimal trade-off between accuracy (PQ) π and speed (FPS) β‘ on COCO, thanks to its simple encoder-only design.
β No complex additional components. 
β No bottlenecks.
π Just performance. 
(3/6)
               
            
            
                31.03.2025 20:35 β π 2    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            How do modern segmentation models work?
π« They chain together complex components:
ViT β Adapter β Pixel Decoder β Transformer Decoderβ¦
β
 EoMT removes them all.
It keeps only the ViT and adds a few query tokens that guide it to predict masks, no decoder needed.
(2/6)
               
            
            
                31.03.2025 20:35 β π 2    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            Image segmentation doesnβt have to be rocket science. π
Why build a rocket engine full of bolted-on subsystems when one elegant unit does the job? π‘
Thatβs what we did for segmentation.
β
 Meet the Encoder-only Mask Transformer (EoMT): tue-mps.github.io/eomt (CVPR 2025)
(1/6)
               
            
            
                31.03.2025 20:35 β π 8    π 4    π¬ 1    π 1