's Avatar

@sta8is.bsky.social

30 Followers  |  39 Following  |  17 Posts  |  Joined: 07.02.2025  |  1.6838

Latest posts by sta8is.bsky.social on Bluesky

Post image

1/n Introducing ReDi (Representation Diffusion): a new generative approach that leverages a diffusion model to jointly capture
– Low-level image details (via VAE latents)
– High-level semantic features (via DINOv2)🧡

25.04.2025 07:23 β€” πŸ‘ 21    πŸ” 3    πŸ’¬ 1    πŸ“Œ 1
Preview
Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified a...

πŸ“„ Check out our paper at arxiv.org/abs/2501.08303 and πŸ–₯️code at github.com/Sta8is/FUTUR... to learn more about FUTURIST and its applications in autonomous systems! (9/n)
Joint work with @ikakogeorgiou.bsky.social, @spyrosgidaris.bsky.social and Nikos Komodakis

26.02.2025 19:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

πŸš€ The architecture demonstrates significant performance improvements with extended trainingβ€”indicating substantial potential for future enhancements (8/n)

26.02.2025 19:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ’‘ Our multimodal approach significantly outperforms single-modality variants, demonstrating the power of learning cross-modal relationships (7/n)

26.02.2025 19:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ“ˆ Results are impressive! We achieve state-of-the-art performance in future semantic segmentation on Cityscapes, with strong improvements in both short-term (0.18s) and mid-term (0.54s) predictions (6/n)

26.02.2025 19:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

🎭 Key innovation #3: We developed a novel multimodal masked visual modeling objective specifically designed for future prediction tasks (5/n)

26.02.2025 19:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ”— Key innovation #2: Our model features an efficient cross-modality fusion mechanism that improves predictions by learning synergies between different modalities (segmentation + depth) (4/n)

26.02.2025 19:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

🎯 Key innovation #1: We introduce a VAE-free hierarchical tokenization process integrated directly into our transformer. This simplifies training, reduces computational overhead, and enables true end-to-end optimization (3/n)

26.02.2025 19:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ” FUTURIST employs a multimodal visual sequence transformer to directly predict multiple future semantic modalities. We focus on two key modalities: semantic segmentation and depth estimationβ€”critical capabilities for autonomous systems operating in dynamic environments (2/n)

26.02.2025 19:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

🧡 Excited to share our latest work: FUTURIST - A unified transformer architecture for multimodal semantic future prediction, is accepted to #CVPR2025! Here's how it works (1/n)
πŸ‘‡ Links to the arxiv and github below

26.02.2025 19:57 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1
Post image

1/nπŸš€If you’re working on generative image modeling, check out our latest work! We introduce EQ-VAE, a simple yet powerful regularization approach that makes latent representations equivariant to spatial transformations, leading to smoother latents and better generative models.πŸ‘‡

18.02.2025 14:26 β€” πŸ‘ 18    πŸ” 8    πŸ’¬ 1    πŸ“Œ 1
Preview
DINO-Foresight: Looking into the Future with DINO Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and ...

8/n πŸ’‘Our work shows that by leveraging the semantic power of VFMs, we create more efficient and effective future prediction systems.

πŸ“„ Paper: arxiv.org/abs/2412.11673
πŸ–₯️Code available at: github.com/Sta8is/DINO-...
Joint work with @ikakogeorgiou.bsky.social, @spyrosgidaris.bsky.social, N. Komodakis

07.02.2025 17:05 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

7/n πŸ”¬Interesting discovery: The intermediate features from our transformer can actually enhance the already-strong VFM features, suggesting potential for self-supervised learning.

07.02.2025 17:05 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

6/n πŸ“ŠAnd it works amazingly well! We achieve state-of-the-art results in semantic segmentation forecasting, with strong performance across multiple tasks using a single feature prediction model.

07.02.2025 17:05 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

5/n 🎨The beauty of our method? It's completely modular - different task-specific heads (segmentation, depth estimation, surface normals) can be plugged in without retraining the core model.

07.02.2025 17:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

4/n πŸ”„Our approach: We train a masked feature transformer to predict how VFM features change over time. These predicted features can then be used for various scene understanding tasks!

07.02.2025 17:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

3/n 🧩Why is this important? Most existing approaches focus on pixel-level prediction, which wastes computation on irrelevant visual details. We focus directly on meaningful semantic features!

07.02.2025 17:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

2/n 🎯Our key insight: Instead of predicting future RGB frames directly, we can forecast how semantic features from Vision Foundation Models (VFMs) evolve over time.

07.02.2025 17:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

1/n πŸš€ Excited to share our latest work: DINO-Foresight, a new framework for predicting the future states of scenes using Vision Foundation Model features!
Links to the arXiv and Github πŸ‘‡

07.02.2025 17:05 β€” πŸ‘ 20    πŸ” 3    πŸ’¬ 2    πŸ“Œ 1

@sta8is is following 20 prominent accounts