For some reason my account got suspended after posting this. Weird moderation.
Having restored my account, I'm reposting to increase visibility.
@camuljak.bsky.social
For some reason my account got suspended after posting this. Weird moderation.
Having restored my account, I'm reposting to increase visibility.
The paper is accepted at EACL Findings. See you in Rabat! π²π¦
Shoutout to @mlkukic.bsky.social (just started his MS, hire him!) @ddaviddukic.bsky.social @mtutek.bsky.social and sensei Jan Ε najder for this cute collaboration.
πhttps://arxiv.org/abs/2601.17585
π»https://github.com/takelab/repetition-sl
We establish multi-fold repetition with early exiting as a viable strategy for decoder-as-encoder adaptation, one that does not require complex architectural modifications or extensive training. π°
02.02.2026 12:04 β π 0 π 0 π¬ 1 π 0For Mistral-7B, we find that embeddings from layer 24 (out of 32) can even outperform those at the last layer, while matching the processing time of the input sequence with no repetitions.
02.02.2026 12:04 β π 0 π 0 π¬ 1 π 0To counteract the computational overhead, we experiment with early exiting, by using the representations in the intermediate layers of the models.π‘
02.02.2026 12:04 β π 0 π 0 π¬ 1 π 0However, the performance gains saturate around 4 repetitions. Also β adding many repetitions incurs computational costs. π€¨
02.02.2026 12:04 β π 0 π 0 π¬ 1 π 0Indeed, we observe performance gains over SotA baselines such as removing the causal mask on all the layers in the model (full unmasking) or only the ones in the middle (middle unmasking), and SotA encoder-only models (ModernBERT and RoBERTa).
02.02.2026 12:04 β π 0 π 0 π¬ 1 π 0Therefore, additional repetitions bring the model closer to a balanced ratio of left- and right-context information throughout the entire input sequence. βοΈ
02.02.2026 12:04 β π 0 π 0 π¬ 1 π 0We demonstrate the utility of increased repetitions on sequence labeling tasks such as NER or aspect-based sentiment analysis. π
We focus on token-level tasks as they require bidirectional context at each token, something decoder-only models lack.
Additional repetitions increase the proportion of bidirectional blocks, and with a little bit of high school math, it is easy to see that this proportion approaches 1 at infinite repetitions, thus resembling an encoder-only model.
02.02.2026 12:04 β π 0 π 0 π¬ 1 π 0π‘Thus, we wanted to have a look at what happens if a model is fine-tuned to utilize additional repetitions. In theory, repeating a sequence once leads to a bidirectional block in the attention matrix.
02.02.2026 12:04 β π 0 π 0 π¬ 1 π 0Previous works found that performance gains dissipate at higher repetition counts. πππ...
We found this phenomenon counterintuitive since additional repetitions effectively increase the processing capacity of the model.
We already know prompt repetition is a handy hack to improve a decoder-only LMβs performance as it allows the model to βseeβ bidirectionally, an ability otherwise suppressed by the causal mask.
But what happens if we increase the number of repetitions? π€π§΅ @eaclmeeting.bsky.social #EACL2026
Very honored to be one out of seven outstanding papers at this years' EMNLP :)
Huge thanks to my amazing collaborators @fatemehc.bsky.social @anamarasovic.bsky.social @boknilev.bsky.social , this would not have been possible without them!
Back from #ICML2025 π¬, and off π to NorrkΓΆping πΈπͺ for #ic2s2
CLAN (cs.au.dk/~clan/) members are presenting 2 papers: 1 spotlight and 1 oral. See π§΅for posters and summaries
π Reach out to chat about observational studies, causality, LLM agents, human-centered AI etc.
We did a cool group project exploring diachronic embeddings for Croatian and found that (among other things) embeddings trained on later periods are more positive when plugged into models trained on earlier time periods.
Check out the thread π§΅ & come talk to us in Vienna about this & other works π»