Hafez Ghaemi's Avatar

Hafez Ghaemi

@hafezghm.bsky.social

Ph.D. Student @mila-quebec.bsky.social and @umontreal.ca, AI Researcher

60 Followers  |  44 Following  |  11 Posts  |  Joined: 27.03.2025  |  1.9006

Latest posts by hafezghm.bsky.social on Bluesky

Thrilled to see this work accepted at NeurIPS!

Kudos to @hafezghm.bsky.social for the heroic effort in demonstrating the efficacy of seq-JEPA in representation learning from multiple angles.

#MLSky πŸ§ πŸ€–

19.09.2025 18:46 β€” πŸ‘ 18    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0

Excited to share that seq-JEPA has been accepted to NeurIPS 2025!

19.09.2025 18:02 β€” πŸ‘ 15    πŸ” 2    πŸ’¬ 2    πŸ“Œ 2
Preview
seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models Current self-supervised algorithms mostly rely on transformations such as data augmentation and masking to learn visual representations. This is achieved by inducing invariance or equivariance with re...

Huge thanks to my supervisors and co-authors @neuralensemble.bsky.social and @shahabbakht.bsky.social !

Check out the full paper here: πŸ“„ arxiv.org/abs/2505.03176

πŸ’» Code coming soon!
πŸ“¬ DM me if you’d like to chat or discuss the paper!

(10/10)

14.05.2025 12:52 β€” πŸ‘ 9    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Interestingly, seq-JEPA shows path integration capabilities – an important research problem in neuroscience. By observing a sequence of views and their corresponding actions, it can integrate the path connecting the initial view to the final view.

(9/10)

14.05.2025 12:52 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Thanks to action conditioning, the visual backbone encodes rotation information which can be decoded from its representations, while the transformer encoder aggregates different rotated views, reduces intra-class variations (caused by rotations), and produces a semantic object representation.

8/10

14.05.2025 12:52 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

On 3D Invariant-Equivariant Benchmark (3DIEBench) where each object view has a different rotation, seq-JEPA achieves top performance on both invariance-related object categorization and equivariance-related rotation prediction w/o sacrificing one for the other.

(7/10)

14.05.2025 12:52 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Seq-JEPA learns invariant-equivariant representations for tasks that contain sequential observations and transformations; e.g., it can learn semantic image representations by seeing a sequence of small image patches across simulated eye movements w/o hand-crafted augmentation or masking.

(6/10)

14.05.2025 12:52 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Post-training, the model has learned two segregated representations:

An action-invariant aggregate representation
Action-equivariant individual-view representations

πŸ’‘No explicit equivariance loss or dual predictor required!

(5/10)

14.05.2025 12:52 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Inspired by this, we designed seq-JEPA which processes sequences of views and their relative transformations (actions).

➑️ A transformer encoder aggregates these action-conditioned view representations to predict a yet unseen view.

(4/10)

14.05.2025 12:52 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

🧠 Humans learn to recognize new objects by moving around them, manipulating them, and probing them via eye movements. Different views of a novel object are generated through actions (manipulations & eye movements) that are then integrated to form new concepts in the brain.

(3/10)

14.05.2025 12:52 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Current SSL methods face a trade-off: optimizing for transformation invariance in representational space (useful in high-level classification) often reduces equivariance (needed for tasks related to details like object rotation & movement). Our world model, seq-JEPA, resolves this trade-off.

2/10

14.05.2025 12:52 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Preprint Alert πŸš€

Can we simultaneously learn transformation-invariant and transformation-equivariant representations with self-supervised learning?

TL;DR Yes! This is possible via simple predictive learning & architectural inductive biases – without extra loss terms and predictors!

🧡 (1/10)

14.05.2025 12:52 β€” πŸ‘ 51    πŸ” 16    πŸ’¬ 1    πŸ“Œ 5

@hafezghm is following 20 prominent accounts