Joe Stacey @joestacey - Bluesky Profile

We have released #AgentCoMa, an agentic reasoning benchmark where each task requires a mix of commonsense and math to be solved 🧐

LLM agents performing real-world tasks should be able to combine these different types of reasoning, but are they fit for the job? 🤔

🧵⬇️

28.08.2025 14:01 — 👍 4 🔁 2 💬 1 📌 0

Congratulations!! Awesome you will be in Europe!

22.07.2025 19:49 — 👍 1 🔁 0 💬 1 📌 0

The bad:

- the chocolate here is terrible for no good reason
- hotel breakfasts never have any baked beans, which are way under appreciated here (they are delicious and add much needed moisture to a cooked breakfast)
- the temperature in summer is inhumane

Think that covers the main stuff 😍

17.07.2025 11:24 — 👍 0 🔁 0 💬 0 📌 0

Here’s my review of the US after a few days here. Did I miss anything? 🤔

The good:

- Americans are the most charming, friendly and hospitable people
- it’s super fun how the country is split into states that all have different laws and stuff, with different vibes state to state

17.07.2025 11:24 — 👍 1 🔁 0 💬 1 📌 0

Any chance Keir Starmer can reshuffle himself in as foreign secretary, and shuffle in another prime minister who actually has some vague idea about what they want to achieve? 🙏🤦‍♂️

02.07.2025 17:34 — 👍 0 🔁 0 💬 0 📌 0

Finally the heatwave has ended, and the UK is once again a bearable place to be 😍😍

If you have any UK-based collaborations, their productivity is about to increase like 10 fold

02.07.2025 11:52 — 👍 2 🔁 0 💬 0 📌 0

How to Improve the Robustness of Closed-Source Models on NLI Closed-source Large Language Models (LLMs) have become increasingly popular, with impressive performance across a wide range of natural language tasks. These models can be fine-tuned to further improv...

This work was really fun and a great last paper for my PhD. Check it out 🙂 Massive thanks to all my amazing collaborators!

arxiv.org/abs/2505.20209

P.S. if you know about a paper improving NLI model robustness not already in our related work appendix, I would love to hear about it 🥰

27.05.2025 15:50 — 👍 0 🔁 0 💬 0 📌 0

5) The best way to improve performance on the hardest OOD data was to choose more challenging training examples

Our best method (Uncertainty Sampling) picked examples with the most uncertain predictions. This identified challenging examples, but without too much label noise

27.05.2025 15:50 — 👍 1 🔁 0 💬 1 📌 0

4) Creating more complex synthetic data avoids a loss in performance on harder OOD datasets

We find that generating more challenging synthetic data (Long & Complex Generation) helps retain performance on harder OOD datasets, while still achieving gains on easier OOD data

27.05.2025 15:50 — 👍 0 🔁 0 💬 1 📌 0

3) Replacing some training examples with LLM-generated data proved very effective on less challenging OOD data

See Standard-OOD scores below (avg), where the simplest LLM-generated data (Short & Simple Generation) performed best, with substantial improvements

27.05.2025 15:50 — 👍 0 🔁 0 💬 1 📌 0

2) We experiment with 6+ ways for improving robustness:

This involved sampling methods to choose more complex examples in our training data, and generating new synthetic examples

Some methods were pretty fun, e.g. asking an LLM to assess the difficulty of training examples

27.05.2025 15:50 — 👍 1 🔁 0 💬 1 📌 0

1) It's time to stop using fine-tuned encoder models:

We find that fine-tuned LLMs are substantially more robust than commonly used encoder models, despite being fine-tuned on x50 less data.

This is especially the case on challenging OOD datasets (see Challenge-OOD avg below)

27.05.2025 15:50 — 👍 0 🔁 0 💬 1 📌 0

The paper tries to improve the robustness of closed-source LLMs fine-tuned on NLI, assuming a realistic training budget of 10k training examples.

Here's a 45 second rundown of what we found!

27.05.2025 15:50 — 👍 0 🔁 0 💬 1 📌 0

We have a fun new #NLProc paper on arXiv about improving the robustness of fine-tuned NLI models!

Have a look :)
arxiv.org/abs/2505.20209

27.05.2025 15:50 — 👍 5 🔁 0 💬 1 📌 0

I’d personally just love to see more negative results from nice ideas that didn’t quite work out. I feel like there’s probably a bunch of cool stuff people have tried out and discarded that could be made to work across multiple papers. Would be fun and interesting too

18.05.2025 15:48 — 👍 2 🔁 1 💬 1 📌 0

Was worried it was just me hating on it so much 🤣

18.05.2025 11:01 — 👍 0 🔁 0 💬 0 📌 0

I’d love to see more diversity in the field, what kind of things were you thinking?

18.05.2025 09:06 — 👍 0 🔁 0 💬 1 📌 0

Should I use an LLM to help refine my paper writing for the ARR deadline? 🤔🤔

It will improve the paper for sure, but probably also making the tone a whole lot more annoying

18.05.2025 09:05 — 👍 0 🔁 0 💬 1 📌 0

If you're at #NAACL2025 and want to hear about similarity effects for property inheritance in LMs, please stop by!

I will be presenting this work on Wednesday at the 11-12:30 poster session on Interpretability & analysis for language models (Hall 3).

aclanthology.org/2025.naacl-l...

28.04.2025 20:07 — 👍 12 🔁 4 💬 0 📌 0

Looks so cool! I’m insanely jealous

28.04.2025 16:54 — 👍 2 🔁 0 💬 1 📌 0

I’m not a fan of musk, but imo there’s some really nice work here 🙂

Interested in the Washington post article, would you mind sharing a link?

23.04.2025 06:01 — 👍 1 🔁 0 💬 0 📌 0

Excited to share our ICLR and NAACL papers! Please come and say hi, we're super friendly :)

22.04.2025 18:42 — 👍 14 🔁 5 💬 0 📌 0

That’s an awesome paper 👍👍

14.04.2025 17:29 — 👍 0 🔁 1 💬 1 📌 0

Wow, the old ITV Agatha Christie’s Poirot is brilliant. Some tv for 1989…

Gonna go binge watch the 13 seasons now 😍

05.04.2025 19:59 — 👍 1 🔁 0 💬 0 📌 0

Congratulations! It’s definitely worth trying/experimenting with responses that are more concise in the future and see what kind of reaction you get.

Best of luck with your meta-reviews! 🤞

05.04.2025 05:42 — 👍 1 🔁 0 💬 0 📌 0

Ah that’s good to know!

Yeah I think when authors choose to write concise responses everybody wins 🙂

04.04.2025 10:03 — 👍 1 🔁 0 💬 0 📌 0

Good point. I think the other downside is all the reviewer time to go through them.

I’m not sure what best solution is, and if you limit the responses too much it’s frustrating, but maybe something that stops way too long responses might be helpful 🙂

04.04.2025 10:00 — 👍 1 🔁 0 💬 1 📌 0

I feel like the length of the ARR author rebuttals keep growing every cycle

Is this a good thing for authors or reviewers that the responses can be so long? I feel like it’s a bit sub-optimal for both at the moment

04.04.2025 08:40 — 👍 4 🔁 0 💬 3 📌 0

Not only does everyone learn for themselves, but I think almost everyone sees themselves as good reviewers when that may not be the case

I think the ARR stats on how many great reviews people did is pretty cool step in the right direction!

25.03.2025 19:21 — 👍 1 🔁 0 💬 0 📌 0

Had a great time presenting my research on building more helpful QA systems @imperialcollegeldn.bsky.social! Thank you @joestacey.bsky.social for letting me invite myself 🫶

And loved visiting London+Edinburgh this week, hope to be back soon! 🙏

21.03.2025 12:07 — 👍 5 🔁 1 💬 0 📌 1

Joe Stacey

Latest posts by joestacey.bsky.social on Bluesky

@joestacey is following 19 prominent accounts