's Avatar

@danielmisrael.bsky.social

29 Followers  |  269 Following  |  9 Posts  |  Joined: 19.11.2024  |  1.6972

Latest posts by danielmisrael.bsky.social on Bluesky

Pretty shocking vulnerability that the community should be aware of!

11.03.2025 23:18 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Enabling Autoregressive Models to Fill In Masked Tokens Historically, LLMs have been trained using either autoregressive (AR) or masked language modeling (MLM) objectives, with AR models gaining dominance in recent years. However, AR models are inherently ...

Please check out the rest of the paper! We propose: how MARIA can be used for test time scaling, how to initialize MARIA weights for efficient training, how MARIA representations differ, and moreโ€ฆ

arxiv.org/abs/2502.06901

Thanks to my advisors Aditya Grover and @guyvdb.bsky.social

14.02.2025 00:28 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

MARIA 1B achieves the best throughput numbers, and MARIA 7B achieves similar throughput to DiffuLlama, but better samples as previously noted. Here, we see that ModernBERT despite being much smaller does not scale well for masked infilling because it cannot KV cache.

14.02.2025 00:28 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

We perform infilling on downstream data with words masked 50 percent. Using GPT4o-mini as a judge we compute the ELO scores for each model respectively. MARIA 7B and 1B have the highest rating ELO rating under the Bradley-Terry model.

14.02.2025 00:28 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

MARIA achieves far better perplexity than just using ModernBERT autoregressively and discrete diffusion models on downstream masked infilling test sets. Based on parameter counts, MARIA presents the most effective way to scale models for masked token infilling.

14.02.2025 00:28 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

We can get the best of both worlds with MARIA: train a linear decoder to combine the hidden states of an AR and MLM model. This enables AR masked infilling with the advantages of a more scalable AR architecture, such as KV cached inference. We combine OLMo and ModernBERT.

14.02.2025 00:28 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Autoregressive (AR) LMs are more compute efficient to train than Masked LMs (MLM), which compute a loss on some fixed ratio e.g. 30% of the tokens instead of 100% like AR. Unlike MLM, AR models can also KV cache at inference time, but they cannot infill masked tokens.

14.02.2025 00:28 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

โ€œThatโ€™s one small [MASK] for [MASK], a giant [MASK] for mankind.โ€ โ€“ [MASK] Armstrong

Can autoregressive models predict the next [MASK]? It turns out yes, and quite easilyโ€ฆ
Introducing MARIA (Masked and Autoregressive Infilling Architecture)
arxiv.org/abs/2502.06901

14.02.2025 00:28 โ€” ๐Ÿ‘ 6    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Interesting. I always felt the reviews should be even more independent so that the aggregate score has lower variance

03.01.2025 21:21 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@danielmisrael is following 20 prominent accounts