Our new paper in #PNAS (bit.ly/4fcWfma) presents a surprising findingβwhen words change meaning, older speakers rapidly adopt the new usage; inter-generational differences are often minor.
w/ Michelle Yang, βͺ@sivareddyg.bsky.socialβ¬ , @msonderegger.bsky.socialβ¬ and @dallascard.bsky.socialβ¬π(1/12)
29.07.2025 12:05 β π 31 π 16 π¬ 3 π 2
A blizzard is raging through Montreal when your friend says βLooks like Florida out there!β Humans easily interpret irony, while LLMs struggle with it. We propose a π³π©π¦π΅π°π³πͺπ€π’π-π΄π΅π³π’π΅π¦π¨πΊ-π’πΈπ’π³π¦ probabilistic framework as a solution.
Paper: arxiv.org/abs/2506.09301 to appear @ #ACL2025 (Main)
26.06.2025 15:52 β π 14 π 7 π¬ 1 π 4
"Build the web for agents, not agents for the web"
This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).
arxiv.org/abs/2506.10953
14.06.2025 04:17 β π 5 π 4 π¬ 0 π 0
Excited to share the results of my recent internship!
We ask π€
What subtle shortcuts are VideoLLMs taking on spatio-temporal questions?
And how can we instead curate shortcut-robust examples at a large-scale?
We release: MVPBench
Details ππ¬
13.06.2025 14:47 β π 16 π 5 π¬ 1 π 0
Do LLMs hallucinate randomly? Not quite.
Our #ACL2025 (Main) paper shows that hallucinations under irrelevant contexts follow a systematic failure mode β revealing how LLMs generalize using abstract classes + context cues, albeit unreliably.
π Paper: arxiv.org/abs/2505.22630 1/n
06.06.2025 18:09 β π 48 π 18 π¬ 1 π 3
Without π¦ and π¦, are we left with LinkedIn?
10.05.2025 20:55 β π 1 π 0 π¬ 1 π 0
Congratulations to Mila members @adadtur.bsky.social , Gaurav Kamath and @sivareddyg.bsky.social for their SAC award at NAACL! Check out Ada's talk in Session I: Oral/Poster 6. Paper: arxiv.org/abs/2502.05670
01.05.2025 14:30 β π 13 π 7 π¬ 0 π 3
Exciting release! AgentRewardBench offers that much-needed closer look at evaluating agent capabilities: automatic vs. human eval. Important findings here, especially on the popular LLM judges. Amazing work by @xhluca.bsky.social & team!
15.04.2025 19:11 β π 3 π 1 π¬ 1 π 0
An amazing team effort with: @a-kazemnejad.bsky.social Nick @arkil.bsky.social Dongchan Alejandra @karstanczak.bsky.social @ptshaw.bsky.social @chrisjpal.bsky.social @sivareddyg.bsky.social
15.04.2025 19:10 β π 1 π 0 π¬ 1 π 0
We find that rule-based evals underreport success rates, and no single LLM judge excels across all benchmarks.
We collect trajectories from web agents built on four LLMs (Claude 3.7, GPT-4o, Llama 3.3, Qwen2.5-VL) across popular web benchmarks (AssistantBench, WebArena, VWA, WorkArena, WorkArena++)
15.04.2025 19:10 β π 1 π 0 π¬ 1 π 0
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
15.04.2025 19:10 β π 7 π 4 π¬ 1 π 1
And thoughtology is now on Arxiv! Read more about R1 reasoning ππ across visual, cultural and psycholinguistic tasks at the link below:
π arxiv.org/abs/2504.07128
11.04.2025 16:31 β π 5 π 1 π¬ 0 π 0
bsky.app/profile/sara...
12.04.2025 16:12 β π 1 π 0 π¬ 0 π 0
DeepSeek-R1 Thoughtology: Letβs <think> about LLM reasoning
142-page report diving into the reasoning chains of R1. It spans 9 unique axes: safety, world modeling, faithfulness, long context, etc.
Now on arxiv: arxiv.org/abs/2504.07128
12.04.2025 16:11 β π 6 π 1 π¬ 1 π 0
Introducing the DeepSeek-R1 Thoughtology -- the most comprehensive study of R1 reasoning chains/thoughts β¨. Probably everything you need to know about R1 thoughts. If we missed something, please let us know.
01.04.2025 20:12 β π 17 π 4 π¬ 0 π 1
A circular diagram with a blue whale icon at the center. The diagram shows 8 interconnected research areas around LLM reasoning represented as colored rectangular boxes arranged in a circular pattern. The areas include: Β§3 Analysis of Reasoning Chains (central cloud), Β§4 Scaling of Thoughts (discussing thought length and performance metrics), Β§5 Long Context Evaluation (focusing on information recall), Β§6 Faithfulness to Context (examining question answering accuracy), Β§7 Safety Evaluation (assessing harmful content generation and jailbreak resistance), Β§8 Language & Culture (exploring moral reasoning and language effects), Β§9 Relation to Human Processing (comparing cognitive processes), Β§10 Visual Reasoning (covering ASCII generation capabilities), and Β§11 Following Token Budget (investigating direct prompting techniques). Arrows connect the sections in a clockwise flow, suggesting an iterative research methodology.
Models like DeepSeek-R1 π mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1βs reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour.
π: mcgill-nlp.github.io/thoughtology/
01.04.2025 20:06 β π 52 π 16 π¬ 1 π 9
Check out our new workshop on Actionable Interpretability @ ICML 2025. We are also looking forward to submissions that take a position on the future of interpretability research more broadly. π
31.03.2025 18:15 β π 9 π 1 π¬ 0 π 0
π’Excited to announce our upcoming workshop - Vision Language Models For All: Building Geo-Diverse and Culturally Aware Vision-Language Models (VLMs-4-All) @CVPR 2025!
π sites.google.com/view/vlms4all
14.03.2025 15:55 β π 17 π 11 π¬ 1 π 4
Exploiting Instruction-Following Retrievers for Malicious Information Retrieval
Parishad BehnamGhader, Nicholas Meade, Siva Reddy
Instruction-following retrievers can efficiently and accurately search for harmful and sensitive information on the internet! ππ£
Retrievers need to be aligned too! π¨π¨π¨
Work done with the wonderful Nick and @sivareddyg.bsky.social
π mcgill-nlp.github.io/malicious-ir/
Thread: π§΅π
12.03.2025 16:15 β π 12 π 8 π¬ 1 π 0
Web agents powered by LLMs can solve complex tasks, but our analysis shows that they can also be easily misused to automate harmful tasks.
See the thread below for more details on our new web agent safety benchmark: SafeArena and Agent Risk Assessment framework (ARIA).
10.03.2025 20:11 β π 5 π 2 π¬ 0 π 0
The potential for malicious misuse of LLM agents is a serious threat.
That's why we created SafeArena, a safety benchmark for web agents. See the thread and our paper for details: arxiv.org/abs/2503.04957 π
10.03.2025 18:20 β π 9 π 2 π¬ 0 π 0
Llamas browsing the web look cute, but they are capable of causing a lot of harm!
Check out our new Web Agents β© Safety benchmark: SafeArena!
Paper: arxiv.org/abs/2503.04957
10.03.2025 17:50 β π 9 π 3 π¬ 0 π 0
WebArena by Zhou et al; AgentLab and Browsergym by @servicenow.bsky.social allowed us to explore the latest agents; @gradio-hf.bsky.social enabled us to design UIs for implementing our ARIA framework, whereas @hf.co provided a hosting platform for 100GB+ artifacts.
bsky.app/profile/xhlu...
10.03.2025 17:45 β π 3 π 0 π¬ 0 π 0
This work was done by an awesome team of authors: @adadtur.bsky.social, Nick, @arkil.bsky.social, @karstanczak.bsky.social, Esin, @spandanagella.bsky.social, and @sivareddyg.bsky.social.
It's also important to recognize the incredible works that helped us build SafeArena:
10.03.2025 17:45 β π 4 π 1 π¬ 1 π 0
SafeArena: Evaluating the Safety of Autonomous Web Agents
LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an onlin...
We release benchmark, code, tasks to help researchers develop agents that are both helpful and safe:
Paper: arxiv.org/abs/2503.04957
Benchmark: safearena.github.io
Code: github.com/McGill-NLP/s...
Tasks/Environments: huggingface.co/datasets/McG...
Leaderboard: huggingface.co/spaces/McGil...
10.03.2025 17:45 β π 3 π 0 π¬ 1 π 0
Safearena Leaderboard - a Hugging Face Space by McGill-NLP
SafeArena Leaderboard
To provide transparency on the safety of popular LLMs, we host a leaderboard, which ranks models based on their normalized safety score: we calculate the rate where a model will complete a safe task compared to its harmful counterpart, which uses augmented environments built on top of WebArena.
10.03.2025 17:45 β π 3 π 0 π¬ 1 π 0
With ARIA, we find that Claude is substantially safer than Qwen, which very rarely refuses user requests, indicating limited safeguards for web-oriented tasks.
10.03.2025 17:45 β π 4 π 1 π¬ 1 π 0
We introduce the Agent Risk Assessment framework (ARIA), which can be used by humans and LLM judges to determine the risk level of a web agent, which ranges from safe, if it refuses a harmful request right away (L1), to effectively harmful, if it can successfully complete a harmful request (L4).
10.03.2025 17:45 β π 3 π 0 π¬ 1 π 0
The harmfulness of LLMs varies: whereas Claude-3.5 Sonnet refuses a majority of harmful tasks, Qwen-2-VL completes over a quarter of the 250 harmful tasks we designed for this benchmark. Moreover, a GPT-4o agent completes an alarming number of unsafe requests, despite extensive safety training.
10.03.2025 17:45 β π 3 π 1 π¬ 1 π 0
Mathematician at UCLA. My primary social media account is https://mathstodon.xyz/@tao . I also have a blog at https://terrytao.wordpress.com/ and a home page at https://www.math.ucla.edu/~tao/
PhD-ing at McGill Linguistics + Mila, working under Prof. Siva Reddy. Mostly computational linguistics, with some NLP; habitually disappointed Arsenal fan
Computational semantics and pragmatics, interpretability and occasionally some psycholinguistics. he/him. π¦
https://sebschu.com
AI/ML Applied Research Intern at Adobe | NLP-ing (Research Masters) at MILA/McGill
MSc Master's @mila-quebec.bsky.social @mcgill-nlp.bsky.social
Research Fellow @ RBC Borealis
Model analysis, interpretability, reasoning and hallucination
Studying model behaviours to make them better :))
Looking for Fall '26 PhD
Interp & analysis in NLP
Mostly π¦π·, slightly π¨π±
Assistant professor in Natural Language Processing at the University of Edinburgh and visiting professor at NVIDIA | A Kleene star shines on the hour of our meeting.
Working on RL training of LLMs @Mila_Quebec.
Research Scientist at Google DeepMind
https://e-bug.github.io
Research Scientist at Ai2, PhD in NLP π€ UofA. Ex
GoogleDeepMind, MSFTResearch, MilaQuebec
https://nouhadziri.github.io/
PhD fellow in XAI, IR & NLP
βοΈ Mila - Quebec AI Institute | University of Copenhagen π°
#NLProc #ML #XAI
Recreational sufferer
PhD student @ LIRIS INSA Lyon & Esker
PhD student @ ETH ZΓΌrich | all aspects of NLP but mostly evaluation and MT | go vegan | https://vilda.net
Stanford Professor of Linguistics and, by courtesy, of Computer Science, and member of @stanfordnlp.bsky.social and The Stanford AI Lab. He/Him/His. https://web.stanford.edu/~cgpotts/
Indigenous language technology. PhD candidate at McGill University in Montreal. NgΔpuhi Nui Tonu.
Associate Professor, CMU. Researcher, Google. Evaluation and design of information retrieval and recommendation systems, including their societal impacts.
PhD researcher at Mila Quebec
Ph.D. in NLP Interpretability from Mila. Previously: independent researcher, freelancer in ML, and Node.js core developer.