Lihao Sun's Avatar

Lihao Sun

@1e0sun.bsky.social

Working on LLM interpretability; recent graduate from uchicago. slhleosun.github.io

27 Followers  |  39 Following  |  11 Posts  |  Joined: 05.05.2025  |  1.8477

Latest posts by 1e0sun.bsky.social on Bluesky

Post image

7/
๐Ÿ“ข Accepted to #ACL2025 Main Conference! See you in Vienna.
Work done by @1e0sun.bsky.socialโ€ฌ, Chengzhi Mao, @valentinhofmann.bsky.socialโ€ฌ, Xuechunzi Bai.

Paper: arxiv.org/abs/2506.00253
Project page: slhleosun.github.io/aligned_but_...
Code & Data: github.com/slhleosun/al...

10.06.2025 14:38 โ€” ๐Ÿ‘ 4    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

6/
We call this failure mode "blindness"โ€”when alignment makes certain concepts less salient. This may reflect a broader class of alignment issues.

Similar methods can be extended to other forms of social bias or to study how models resolve polysemy under ambiguity.

10.06.2025 14:38 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

5/
This challenges a common belief:
unlearning โ‰  debiasing

When debiasing strategies suppress sensitive concepts, they can unintentionally reduce a modelโ€™s ability to detect bias.

๐Ÿง  Instead, we may achieve deeper alignment effects with strategies that make models aware of them.

10.06.2025 14:38 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

4/
Inspired by these results, we tested the opposite of โ€œmachine unlearningโ€ for debiasing.

What if we reinforced race concepts in models?
- Injecting race-laden activations cut implicit bias by 54.9%.
- LoRA fine-tuning brought it down from 97.3% โ†’ 42.4%.

Bonus: also lowered explicit bias.

10.06.2025 14:38 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

3/
We mechanistically tested this using activation patching and embedding interpretation.

Aligned models were 52.2% less likely to represent โ€œblackโ€ as race in ambiguous contexts compared to unaligned models.

๐Ÿง  LMs trained for harmlessness may avoid racial representationsโ€”amplifying stereotypes.

10.06.2025 14:38 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

This resembles race blindness in humans; ignoring race makes stereotypes more likely to slip through, and the LMsโ€™ safety guardrails aren't triggered.

10.06.2025 14:38 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

2/
So why does alignment increase implicit bias?

Our analyses showed that aligned LMs are more likely to treat โ€œblackโ€ and โ€œwhiteโ€ as pure color, not race, when the context is ambiguous.

10.06.2025 14:38 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Aligned models passed explicit testsโ€”but were more biased in implicit settings.
๐Ÿ“‰ Explicit bias: near 0%
๐Ÿ“ˆ Implicit bias: 91.4%

10.06.2025 14:38 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

- Explicit: Likert scale, asking whether the model agrees with a given association such as โ€œblackโ€ is related to negative, โ€œwhiteโ€ is related to positive.
- Implicit: Word association, let the model freely pair โ€œblackโ€/โ€whiteโ€ with positive/negative words.

10.06.2025 14:38 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

1/
We curated pairs of prompts testing for implicit and explicit racial bias and used them to evaluate Llama 3 models.

10.06.2025 14:38 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐ŸšจNew #ACL2025 paper!

Todayโ€™s โ€œsafeโ€ language models can look unbiasedโ€”but alignment can actually make them more biased implicitly by reducing their sensitivity to race-related associations.

๐ŸงตFind out more below!

10.06.2025 14:38 โ€” ๐Ÿ‘ 12    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

@1e0sun is following 20 prominent accounts