ColBERT (a.k.a. multi-vector, late-interaction) models are extremely strong search models, often outperforming dense embedding models. And @lightonai.bsky.social just released a new state-of-the-art one: GTE-ModernColBERT-v1!
Details in ๐งต
30.04.2025 15:27 โ ๐ 12 ๐ 2 ๐ฌ 1 ๐ 0
pylate
Neural Search
I'm a big fan of the PyLate project for ColBERT models, and I'm glad to see these strong models coming out. Very nice work by the @lightonai.bsky.social folks, especially @nohtow.bsky.social.
Learn more about PyLate here: lightonai.github.io/pylate/
30.04.2025 15:27 โ ๐ 6 ๐ 1 ๐ฌ 0 ๐ 0
As per usual, thanks to my dear co-maintainer @raphaelsty.bsky.social for helping me make PyLate what it is ๐ซถ
30.04.2025 14:42 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
In addition to knowledge distillation, we recently added features to allow large-scale contrastive pre-training, and this model has been released upon popular demand, but we are currently doing heavier training so stay tuned!
30.04.2025 14:42 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
PyLate makes downstream usage easy, but also facilitate training!
You can reproduce this SOTA training with <80 LoC and 2 hours of training and it'll run NanoBEIR during training, report it to W&B and create an informative model card!
Link to the gist: gist.github.com/NohTow/3030f...
30.04.2025 14:42 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Besides, it also comes with the 8k context window of ModernBERT, which is very useful given that late interaction models generalize very well to longer context as highlighted in the ModernBERT paper
It is thus very suited to handle your very long documents!
30.04.2025 14:42 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
It is also the first model to outperform ColBERT-small on BEIR
While it is bigger, it is still a very lightweight model and benefits from the efficiency of ModernBERT!
Also, it has only been trained on MS MARCO (for late interaction) and should thus generalize pretty well!
30.04.2025 14:42 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
lightonai/GTE-ModernColBERT-v1 ยท Hugging Face
Weโre on a journey to advance and democratize artificial intelligence through open source and open science.
Model link: huggingface.co/lightonai/GT...
GTE-ModernColBERT is trained on top of the GTE-ColBERT model using knowledge distillation on the MS MARCO dataset and is the first SOTA model trained using PyLate!
Get started with PyLate using the documentation:
lightonai.github.io/pylate/
30.04.2025 14:42 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Among all those LLM releases, here is an important retrieval release:
To overcome limitations of awesome ModernBERT-based dense models, today @lightonai.bsky.social is releasing GTE-ModernColBERT, the very first state-of-the-art late-interaction (multi-vectors) model trained using PyLate ๐
30.04.2025 14:42 โ ๐ 6 ๐ 0 ๐ฌ 1 ๐ 0
lightonai/modernbert-embed-large ยท Hugging Face
Weโre on a journey to advance and democratize artificial intelligence through open source and open science.
ModernBERT-embed-large is released under Apache 2.0 and is available on Hugging Face:
huggingface.co/lightonai/mo...
Congrats to @nohtow.bsky.social for this great work!
14.01.2025 16:42 โ ๐ 10 ๐ 2 ๐ฌ 0 ๐ 0
When I saw the release of ModernBERT-embed during the holidays, I knew I had to build the large variant, so I wanted to thank Zach Nussbaum from Nomic AI for building and sharing it (as well as all the nomic-embed tools and data) and bearing with me during the training!
14.01.2025 15:32 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0
ModernBERT-embed-large not only enables usage of ModernBERT-large out-of-the-box, but it should also be a very good starting point for strong fine-tunings on various tasks, so I can't wait to see what the community will build on top of it!
14.01.2025 15:32 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
Obviously, it comes at a slightly higher cost, but it is also trained with Matryoshka capabilities to reduce the footprint of embeddings
Notably, the performance with dimension 256 is only slightly worse than the base version with full dimension 768
14.01.2025 15:32 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Model link: huggingface.co/lightonai/mo...
ModernBERT-embed-large is trained using the same (two-stage training) recipe as its smaller sibling and expectedly increases the performance, reaching +1.22 in MTEB average
14.01.2025 15:32 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
ModernBERT-embed-base is awesome because it allows to use ModernBERT-base for various tasks out-of-the-box
But the large variant of ModernBERT is also awesome...
So today, @lightonai.bsky.social is releasing ModernBERT-embed-large, the larger and more capable iteration of ModernBERT-embed!
14.01.2025 15:32 โ ๐ 12 ๐ 2 ๐ฌ 1 ๐ 0
Today, LightOn releases ModernBERT, a SOTA model for retrieval and classification.
This work was performed in collaboration with Answer.ai and the model was trained on Orange Business Cloud Avenue infrastructure.
www.lighton.ai/lighton-blog...
19.12.2024 16:53 โ ๐ 9 ๐ 4 ๐ฌ 1 ๐ 0
This week we released ModernBERT, the first encoder to reach SOTA on most common benchmarks across language understanding, retrieval, and code, while running twice as fast as DeBERTaV3 on short context and three times faster than NomicBERT & GTE on long context.
22.12.2024 06:12 โ ๐ 73 ๐ 15 ๐ฌ 2 ๐ 2
When one evaluates log-likelihood of a sequence of length L via the chain rule of probability, the first term has missingness fraction of 1, the second has missingness of (L-1)/L, etc. So the inference-time masking rate is ~ Uniform[0, 1].
20.12.2024 19:52 โ ๐ 1 ๐ 1 ๐ฌ 2 ๐ 0
Oh, I see!
Having also worked a lot on causal models, I never thought of this kind of modelling because I always opposed MLM to open ended generation
I guess with papers such as this one arxiv.org/pdf/2406.04823, I should more!
Very interesting perspective, thanks!
20.12.2024 22:23 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
Could you elaborate?
Or give me pointers?
Is it because having a fixed value bias the learning w.r.t the way we will sample downstream? (Like not mask 30% of the target?)
20.12.2024 07:59 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
But there is definitely some diggings to be made to find an optimal strategy in this regard
To me the logic would be to ramp up to have a kick off signal and make it harder and harder but papers seems to say otherwise
Maybe random is the optimal solution!
19.12.2024 22:44 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
Not really, we considered ramping up/down the masking ratio, but the findings from the literature (at least what we read at time) seemed counter-intuitive/not consensual
We ended up not really digging much into this particular aspect, again because we had so much to explore
19.12.2024 22:43 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
Definitely!
Again, the original goal of the project (besides cool models) was to convince some researchers to spend a bit of their GPUs hours on encoders pre-training again!
Hopefully we nailed it and will have the answers to a lot of questions in the future!
19.12.2024 22:40 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
As you can see in the paper, we did _a lot_ of ablations (with even more not making it into the paper actually), but there is still so much to be explored, we hope ModernBERT will encourage people exploring these avenues!
19.12.2024 21:12 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
We did not try to go beyond this vocab size because:
First, for small encoders, this is a non negligible part of the total params count
Second, we discussed training our own tokenizer but at some point lacked time and I am not sure we had discussed existing tokenizer with bigger vocab size
19.12.2024 21:11 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
Thanks!
As mentioned in the paper, we tested the tokenizers from OLMo, BERT RoBERTa and Llama
They were all pretty much equivalent except Llama that was underperforming. We went for OLMo because of the distribution it was trained on and code tokens
19.12.2024 21:07 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
BERT is BACK! I joined a collaboration with AnswerAI and LightOn to bring you the next iteration of BERT.
Introducing ModernBERT: 16x larger sequence length, better downstream performance (classification, retrieval), the fastest & most memory efficient encoder on the market.
๐งต
19.12.2024 16:41 โ ๐ 50 ๐ 7 ๐ฌ 1 ๐ 0
I'll get straight to the point.
We trained 2 new models. Like BERT, but modern. ModernBERT.
Not some hypey GenAI thing, but a proper workhorse model, for retrieval, classification, etc. Real practical stuff.
It's much faster, more accurate, longer context, and more useful. ๐งต
19.12.2024 16:45 โ ๐ 622 ๐ 148 ๐ฌ 19 ๐ 34
Directeur de recherche at Inria, former invited professor at Collรจge de France, co-founder of opensquare
Postdoc @mlia_isir@sciences.re (Sorbonne Universitรฉ, CNRS, ISIR)
/ Teacher @ aivancity
/ Teacher Assistant @ ENSAE
https://paullerner.github.io/
Head of Applied AI @LightOn | PhD in physics of the universe ๐ซ.
Interested in IR, LLMs & synthetic data.
I also like gaming ๐ฎ and lifting ๐๏ธโโ๏ธ.
I'm a nerd (https://nathancooper.io).
The world can be ugly and cruel to the most innocent. Consider donating to help children suffering from one of the worst things: http://thorn.org/donate
speech synthesis and LLM nerd, DMs open, working on LLM stuff
https://felix-red-panda.com
based in Berlin, Germany
Co-founder and CEO at Hugging Face
teaching computers how to see
hot takes, linear Algebra, JAX apologist, Raconteur
#NLP / #NLProc , #dataScience, #AI / #ArtificialIntelligence, #linguistics (#syntax, #semantics, โฆ), occasional #parenting, #gardening, & what not. PhD. Adjunct prof once in a full red moon. Industry / technical mentor. Not my opinion, never my employerโs
compling phd student @ boulder
rare languages, morphology, finite state automata
michaelginn.com
Father, Husband, and a Senior Researcher and Manager of the AI Multimodal group at IBM Research.
proud mediterrenean ๐งฟ open-sourceress at hugging face ๐ค multimodality, zero-shot vision, vision language models, transformers
Chilean ๐จ๐ฑ living in France. I build DL models and pipelines. ML Engineer at W&B
cargobike โฅ๐ด
https://tcapelle.github.io/
Independent researcher working on NLP/LLMs ยท PhD in AI & Wireless Comms
tahayassine.me