Ali Modarressi (@amodarressi) — Bluesky Profile

1 month ago

Principled Personas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance Expert persona prompting -- assigning roles such as expert in math to language models -- is widely used for task improvement. However, prior work shows mixed results on its effectiveness, and does not...

📢 New paper accepted at @eaclmeeting.bsky.social
2026:

Persistent Personas? Role-Playing, Instruction Following, and Safety in Extended Interactions

with
@mhedderich.bsky.social
@amodarressi.bsky.social
Hinrich Schuetze
& Benjamin Roth.

Preprint: arxiv.org/abs/2512.12775

2 1 1 0

3 months ago

🧑‍🔬I’m recruiting PhD students in Natural Language Processing @unileipzig.bsky.social Computer Science, together with @scadsai.bsky.social!

Topics include, but aren’t limited to:

🔎Linguistic Interpretability
🌍Multilingual Evaluation
📖Computational Typology

Please share!

#NLProc #NLP

41 25 1 3

4 months ago

CIS & MaiNLP Group picture at EMNLP 2025! 🤩 🤗 (1/3)

While I sadly 🥲 won't be at EMNLP this year myself, please do reach out to any of our members for a chat if you are interested in our research!

We also co-organize and participate in some great workshops at EMNLP:

13 1 1 0

4 months ago

Excited to be here in Suzhou for #EMNLP2025!
I’ll be presenting “ImpliRet”, check out our poster on Friday Nov. 7th at 14:00.
If you’re into long-context, IR, or just want to chat, come *Pay Ali* a visit 😁
Link to thread:
x.com/zeinabtaghav...

1 0 0 0

7 months ago

Details on poster times and locations coming soon.

Would love to meet and chat ☕️💬

If you’re attending #ACL2025, feel free to stop by and say hi! 👋
🧵[4/4]

0 0 0 0

7 months ago

Time Course MechInterp: Analyzing the Evolution of Components and Knowledge in Large Language Models Understanding how large language models (LLMs) acquire and store factual knowledge is crucial for enhancing their interpretability and reliability. In this work, we analyze the evolution of factual kn...

⏱️🔎 Time Course MechInterp
We track how factual knowledge forms in OLMo over training by analyzing the evolving roles of Attention Heads and FFNs.
Heads are dynamic and often repurposed; FFNs are stable and keep refining facts.
By: A. Dawar Hakimi
arxiv.org/abs/2506.03434
🧵[3/4]

0 0 1 0

7 months ago

Amir H. Kargaran on X: "Excited to introduce MEXA, a method for assessing the multilingual capabilities of English-centric LLMs using parallel sentences. It estimates how many languages an LLM covers and at what level. Paper: https://t.co/awRq0Y4SCl Code: https://t.co/M3UVh2F9J1 https://t.co/xBOQ1DJmWx" / X Excited to introduce MEXA, a method for assessing the multilingual capabilities of English-centric LLMs using parallel sentences. It estimates how many languages an LLM covers and at what level. Paper: https://t.co/awRq0Y4SCl Code: https://t.co/M3UVh2F9J1 https://t.co/xBOQ1DJmWx

🌐 MEXA: Multilingual Evaluation of English-Centric LLMs

A method for assessing the multilingual capabilities of English-centric LLMs using parallel sentences. It estimates how many languages an LLM covers and at what level.

By: @kargaranamir.bsky.social

x.com/amir_nlp/sta...
🧵[2/4]

1 0 1 0

7 months ago

Leaving Vancouver after ICML’s closing fireworks 😁🎆

Heading to Toronto for a few days, then off to
@aclmeeting.bsky.social to present:

"Collapse of Dense Retrievers"
A work by @mohsen-fayyaz.bsky.social that I was fortunate to collaborate on.

Also co-presenting two other papers…🧵 [1/4]

0 0 1 0

8 months ago

Ali Modarressi on X: "🚀 Introducing NoLiMa Paper 🚀 Most long-context benchmarks have literal overlaps between the questions and the context—but what if they didn’t? 🤔 Turns out, it’s a tough challenge! Powerful models like GPT-4o performance drops from 99.3% to 69.7% at 32K context length. 📉 https://t.co/Fo3YsGCBsi" / X 🚀 Introducing NoLiMa Paper 🚀 Most long-context benchmarks have literal overlaps between the questions and the context—but what if they didn’t? 🤔 Turns out, it’s a tough challenge! Powerful models like GPT-4o performance drops from 99.3% to 69.7% at 32K context length. 📉 https://t.co/Fo3YsGCBsi

Full NoLiMa post thread (X / Twitter): x.com/AModarressi/...

0 0 0 0

8 months ago

NoLiMa: Long-Context Evaluation Beyond Literal Matching Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves ret...

Check out the paper & our GitHub repo (with results on recent models 🆕✨)!
📄: arxiv.org/abs/2502.05167
🔗: github.com/adobe-resear...
🤗: huggingface.co/datasets/amo...
This work was my internship project at
@adobe.com, in collaboration with my mentors there and Hinrich Schütze.

1 0 1 0

8 months ago

I’ll be at @icmlconf.bsky.social next week presenting NoLiMa!
Poster on Tue July 15, 4:30–7pm (E-2312).

Happy to grab a coffee and chat about long-context, memory, research, or just to catch up.

I’ll be in Toronto for a couple of days after the conference, let me know if you’re around!

4 2 1 0

9 months ago

MemLLM: Finetuning LLMs to Use Explicit Read-Write Memory

Ali Modarressi, Abdullatif Köksal, Ayyoob Imani, Mohsen Fayyaz, Hinrich Schuetze

Action editor: Greg Durrett

https://openreview.net/forum?id=dghM7sOudh

#memory #memorizing #memllm

2 1 0 0

9 months ago

Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robu...

The takeaway? we need robust retrievers that prioritize answer relevance, not just heuristic shortcuts.

work with an amazing team:
@mohsen-fayyaz.bsky.social,
Hinrich Schütze,
@violetpeng.bsky.social

paper: arxiv.org/abs/2503.05037
dataset 🤗: t.co/QZFyCLqP0P

Cross-post from x.com/mohsen_fayyaz

3 0 0 0

9 months ago

We also analyze RAG: biased retrievers can mislead LLMs, degrading their performance by 34%, worse than retrieving nothing! 😮

1 0 1 0

9 months ago

When multiple biases combine, retrievers fail catastrophically:
📉 Answer-containing docs ranked <3% of the time over a synthetic biased doc with no answer!

1 0 1 0

9 months ago

Dense retrievers are crucial for RAG and search, but do they actually retrieve useful evidence? 🤔
We design controlled experiments by repurposing a relation extraction dataset, exposing serious flaws in models like Dragon+ and Contriever.

2 0 1 0

9 months ago

📄 Collapse of Dense Retrievers

Accepted to #ACL2025 main conference 🎉🎉

In this paper we uncover major vulnerabilities in dense retrievers like Contriever, showing they favor:
📌 Shorter docs
📌 Early positions
📌 Repeated entities
📌 Literal matches
...all while ignoring the answer's presence!

9 2 1 1