Joel Mire @joelmire - Bluesky Profile

How and when should LLM guardrails be deployed to balance safety and user experience?

Our #EMNLP2025 paper reveals that crafting thoughtful refusals rather than detecting intent is the key to human-centered AI safety.

📄 arxiv.org/abs/2506.00195
🧵[1/9]

20.10.2025 20:04 — 👍 6 🔁 3 💬 1 📌 0

10 years after the initial idea, Artificial Humanities is here! Thanks so much to all who have preordered it. I hope you enjoy reading it and find this research approach as generative as I do. More to come!

27.09.2025 02:28 — 👍 38 🔁 13 💬 5 📌 3

Academic paper titled un-straightening generative ai: how queer artists surface and challenge the normativity of generative ai models The piece is written by Jordan Taylor, Joel Mire, Franchesca Spektor, Alicia DeVrio, Maarten Sap, Haiyi Zhu, and Sarah Fox. As an image titled 24 attempts at intimacy showing 24 ai generated images with the word intimacy, none of which seems to include same gender couples

🏳️‍🌈🎨💻📢 Happy to share our workshop study on queer artists’ experiences critically engaging with GenAI

Looking forward to presenting this work at #FAccT2025 and you can read a pre-print here:
arxiv.org/abs/2503.09805

14.05.2025 18:38 — 👍 25 🔁 4 💬 0 📌 0

When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs:

🧵1/9

09.06.2025 13:47 — 👍 70 🔁 21 💬 2 📌 2

An overview of the work “Research Borderlands: Analysing Writing Across Research Cultures” by Shaily Bhatt, Tal August, and Maria Antoniak. The overview describes that We survey and interview interdisciplinary researchers (§3) to develop a framework of writing norms that vary across research cultures (§4) and operationalise them using computational metrics (§5). We then use this evaluation suite for two large-scale quantitative analyses: (a) surfacing variations in writing across 11 communities (§6); (b) evaluating the cultural competence of LLMs when adapting writing from one community to another (§7).

🖋️ Curious how writing differs across (research) cultures?
🚩 Tired of “cultural” evals that don't consult people?

We engaged with interdisciplinary researchers to identify & measure ✨cultural norms✨in scientific writing, and show that❗LLMs flatten them❗

📜 arxiv.org/abs/2506.00784

[1/11]

09.06.2025 23:29 — 👍 74 🔁 30 💬 1 📌 5

This looks incredible! Thanks for sharing the syllabus!

04.06.2025 17:23 — 👍 1 🔁 0 💬 0 📌 0

I’m thrilled to share RewardBench 2 📊— We created a new multi-domain reward model evaluation that is substantially harder than RewardBench, we trained and released 70 reward models, and we gained insights about reward modeling benchmarks and downstream performance!

02.06.2025 23:41 — 👍 22 🔁 6 💬 2 📌 1

Wisconsin-Madison's tree-filled campus, next to a big shiny lake

A computer render of the interior of the new computer science, information science, and statistics building. A staircase crosses an open atrium with visibility across multiple floors

I'm joining Wisconsin CS as an assistant professor in fall 2026!! There, I'll continue working on language models, computational social science, & responsible AI. 🌲🧀🚣🏻‍♀️ Apply to be my PhD student!

Before then, I'll postdoc for a year in the NLP group at another UW 🏔️ in the Pacific Northwest

05.05.2025 19:54 — 👍 145 🔁 14 💬 16 📌 3

When interacting with ChatGPT, have you wondered if they would ever "lie" to you? We found that under pressure, LLMs often choose deception. Our new #NAACL2025 paper, "AI-LIEDAR ," reveals models were truthful less than 50% of the time when faced with utility-truthfulness conflicts! 🤯 1/

28.04.2025 20:36 — 👍 25 🔁 9 💬 1 📌 3

A bar plot comparing the storytelling rates for different topics in the example dataset of congressional speeches. There are often large differences between storytelling and non-storytelling for individual topics. For example, the topic whose top words read "NUM, years, service, great, state" has much more storytelling that non-storytelling.

The top five congressional speeches for the topic "NUM, years, service, great state." All of the documents honor the lives of important people.

I updated our 🔭StorySeeker demo. Aimed at beginners, it briefly walks through loading our model from Hugging Face, loading your own text dataset, predicting whether each text contains a story, and topic modeling and exploring the results. Runs in your browser, no installation needed!
↳

15.04.2025 12:05 — 👍 20 🔁 6 💬 1 📌 0

New work on multimodal framing! 💫

Some fun results: comparisons of the same frame when expressed in images vs texts. When the "crime" frame is expressed in the article text, there are more political words in the text, but when the frame is expressed in the article image, more police words.

07.04.2025 09:48 — 👍 44 🔁 10 💬 0 📌 1

This was joint work with my co-author Zubin Aysola; collaborators @dchechel.bsky.social, Nick Deas, and @chryssazrv.bsky.social; and advisor @maartensap.bsky.social at @ltiatcmu.bsky.social @scsatcmu.bsky.social @columbiauniversity.bsky.social, and the @istecnico.bsky.social! (10/10)

06.03.2025 19:49 — 👍 4 🔁 0 💬 0 📌 0

Our work builds on sociolinguistic and NLP research on AAL and recent translation methods. Check out the paper for details! We hope others extend this work, e.g., to investigate or mitigate reward model biases against more dialects. (9/10)

06.03.2025 19:49 — 👍 0 🔁 0 💬 1 📌 0

These results point to representational and quality-of-service harms for AAL speakers. ⚠️They also highlight complex ethical questions about the desired behavior of LLMs concerning AAL. (8/10)

06.03.2025 19:49 — 👍 0 🔁 1 💬 1 📌 0

Bar chart showing the results from t-tests comparing rewards assigned to dialect mirroring (completion dialect matches prompt dialect) vs non-mirroring conditions (completion dialect differs from prompt dialect) inputs. The results show statistically significant preferences for responding in WME--regardless of whether the prompt was WME or AAL--for all models.

Finally, we show that the reward models strongly incentivize steering conversations toward WME, even when prompted with AAL. 🗣️🔄 (7/10)

06.03.2025 19:49 — 👍 0 🔁 0 💬 1 📌 0

Bar chart showing Pearson correlation coefficients between reward model score and AAL-ness score from a pre-existing dialect detection tool. The chart shows a statistically significant negative correlation between these variables for most models.

Also, for most models, rewards are negatively correlated with the predicted AAL-ness of a text (based on a pre-existing dialect detection tool). (6/10)

06.03.2025 19:49 — 👍 0 🔁 0 💬 1 📌 0

Bar chart showing the cohen's d effect sizes from t-tests comparing raw reward scores assigned to WME vs. AAL texts. All results show a significant dispreference for AAL texts.

Next, we show that most reward models predict lower rewards for AAL texts ⬇️ (5/10)

06.03.2025 19:49 — 👍 0 🔁 1 💬 1 📌 0

Line chart showing that reward models are less accurate at assigning higher rewards to human-preferred completions when processing paired WME vs. AAL texts.

First, we see a significant drop in performance (-4% accuracy on average) in assigning higher rewards to human-preferred completions when processing AAL texts vs. WME texts. 📉 (4/10)

06.03.2025 19:49 — 👍 0 🔁 1 💬 1 📌 0

Diagram depicting several ways we combine prompts and completions in White Mainstream English (WME) and African American Language (AAL) to evaluate dialect biases in reward models. Also, the image contains text summaries of our main findings: accuracy drop for AAL, moderate dispreference for AAL-aligned texts, and WME responses for AAL prompts.

We introduce morphosyntactic & phonological features of AAL into WME texts from the RewardBench dataset using validated automatic translation methods. Then, we test 17 reward models for implicit anti-AAL dialect biases. 📊 (3/10)

06.03.2025 19:49 — 👍 2 🔁 0 💬 1 📌 0

We develop a framework for evaluating dialect biases in reward models and conduct a case study on biases against African American Language (AAL) relative to White Mainstream English (WME). 🔍 (2/10)

06.03.2025 19:49 — 👍 0 🔁 0 💬 1 📌 0

Screenshot of Arxiv paper title, "Rejected Dialects: Biases Against African American Language in Reward Models," and author list: Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, and Maarten Sap.

Reward models for LMs are meant to align outputs with human preferences—but do they accidentally encode dialect biases? 🤔

Excited to share our paper on biases against African American Language in reward models, accepted to #NAACL2025 Findings! 🎉

Paper: arxiv.org/abs/2502.12858 (1/10)

06.03.2025 19:49 — 👍 37 🔁 11 💬 1 📌 2

Joel Mire

Latest posts by joelmire.bsky.social on Bluesky

@joelmire is following 19 prominent accounts