Emily Cheng's Avatar

Emily Cheng

@emcheng.bsky.social

https://generalstrikeus.com/ PhD student in computational linguistics at UPF chengemily1.github.io Previously: MIT CSAIL, ENS Paris Barcelona

97 Followers  |  235 Following  |  6 Posts  |  Joined: 11.11.2024  |  1.6977

Latest posts by emcheng.bsky.social on Bluesky

Post image

Our paper "Prediction Hubs are Context-Informed Frequent Tokens in LLMs" has been accepted at ACL 2025!

Main points:
1. Hubness is not a problem when language models do next-token prediction.
2. Nuisance hubness can appear when other comparisons are made.

07.07.2025 10:48 โ€” ๐Ÿ‘ 7    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2
Preview
Interpretability Techniques for Speech Models โ€” Tutorial @ Interspeech 2025

The @interspeech.bsky.social early registration deadline is coming up in a few days!

Want to learn how to analyze the inner workings of speech processing models? ๐Ÿ” Check out the programme for our tutorial:
interpretingdl.github.io/speech-inter... & sign up through the conference registration form!

13.06.2025 05:18 โ€” ๐Ÿ‘ 28    ๐Ÿ” 10    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2

Last day to sign up for the COLT Symposium!
Register: tinyurl.com/colt-register

๐Ÿ“ข ๐—Ÿ๐—ผ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ฐ๐—ต๐—ฎ๐—ป๐—ด๐—ฒ๐Ÿ“ข
June 2nd, 14:30 - 19:00

UPF Campus de la Ciutadella
Room 40.101

maps.app.goo.gl/1216LJRsWmTE...

26.05.2025 10:44 โ€” ๐Ÿ‘ 5    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1

โญ Registration open til May 27th! โญ
Website: www.upf.edu/web/colt/sym...

June 2nd, UPF

๐—ฆ๐—ฝ๐—ฒ๐—ฎ๐—ธ๐—ฒ๐—ฟ ๐—น๐—ถ๐—ป๐—ฒ๐˜‚๐—ฝ:
Arianna Bisazza (language acquisition with NNs)
Naomi Saphra (emergence in LLM training dynamics)
Jean-Rรฉmi King (TBD)
Louise McNally (pitfalls of contextual/formal accounts of semantics)

20.05.2025 08:13 โ€” ๐Ÿ‘ 4    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 2
Preview
Unique Hard Attention: A Tale of Two Sides Understanding the expressive power of transformers has recently attracted attention, as it offers insights into their abilities and limitations. Many studies analyze unique hard attention transformers...

๐Ÿงต Excited to share our paper "Unique Hard Attention: A Tale of Two Sides" with Selim, Jiaoda, and Ryan, where we show that the way transformers break ties in attention scores has profound implications on their expressivity! And it got accepted to ACL! :)

The paper: arxiv.org/abs/2503.14615

17.05.2025 14:28 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Announcing the COLT Symposium on June 2nd!

๐—˜๐—บ๐—ฒ๐—ฟ๐—ด๐—ฒ๐—ป๐˜ ๐—ณ๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ๐˜€ ๐—ผ๐—ณ ๐—น๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐—ถ๐—ป ๐—บ๐—ถ๐—ป๐—ฑ๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—บ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ๐˜€

What properties of language are emerging from work in experimental and theoretical linguistics, neuroscience & LLM interpretability?

Info: tinyurl.com/colt-site
Register: tinyurl.com/colt-register

๐Ÿงต1/3

13.05.2025 09:00 โ€” ๐Ÿ‘ 4    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2

๐ŸŒ๐Ÿ“ฃ๐Ÿฅณ
I could not be more excited for this to be out!

With a fully automated pipeline based on Universal Dependencies, 43 non-Indoeuropean languages, and the best LLMs only scoring 90.2%, I hope this will be a challenging and interesting benchmark for multilingual NLP.

Go test your language models!

07.04.2025 15:03 โ€” ๐Ÿ‘ 13    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Une confรฉrence de la Dre Joanne Liu annulรฉe par NYU L'universitรฉ de New York allรจgue que son contenu peut รชtre perรงu comme antisรฉmite, mais Joanne Liu croit que NYU craint en fait de dรฉplaire ร  Donald Trump.

NYU canceled an invited talk by the former president of Doctors Without Borders, out of fear her talk would be accused by the government of being both anti-Trump and antisemitic: ici.radio-canada.ca/nouvelle/215...

28.03.2025 04:12 โ€” ๐Ÿ‘ 418    ๐Ÿ” 229    ๐Ÿ’ฌ 14    ๐Ÿ“Œ 73
Preview
LLMs as a synthesis between symbolic and continuous approaches to language Since the middle of the 20th century, a fierce battle is being fought between symbolic and continuous approaches to language and cognition. The success of deep learning models, and LLMs in particular,...

new pre-print: LLMs as a synthesis between symbolic and continuous approaches to language arxiv.org/abs/2502.11856

24.02.2025 16:29 โ€” ๐Ÿ‘ 14    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 2
Preview
Prediction hubs are context-informed frequent tokens in LLMs Hubness, the tendency for few points to be among the nearest neighbours of a disproportionate number of other points, commonly arises when applying standard distance measures to high-dimensional data,...

The project I did with Marco Baroni and Iuri Macocco while I was in Barcelona is now on Arxiv: arxiv.org/abs/2502.10201 ๐ŸŽ‰

TLDR below ๐Ÿ‘‡

24.02.2025 08:06 โ€” ๐Ÿ‘ 3    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Scatterplot titled โ€œEmpirical Evidence of Ideological Targeting in Federal Layoffs: Agencies seen as liberal are significantly more likely to face DOGE layoffs.โ€
	โ€ข	The x-axis represents Perceived Ideological Leaning of federal agencies, ranging from -2 (Most Liberal) to +2 (Most Conservative), based on survey responses from over 1,500 federal executives.
	โ€ข	The y-axis shows Agency Size (Number of Staff) on a logarithmic scale from 1,000 to 1,000,000.

Each point represents a federal agency:
	โ€ข	Red dots indicate agencies that experienced DOGE layoffs.
	โ€ข	Gray dots indicate agencies with no layoffs.

Key Observations:
	โ€ข	Liberal-leaning agencies (left side of the plot) are disproportionately represented among red dots, indicating higher layoff rates.
	โ€ข	Notable targeted agencies include:
	โ€ข	HHS (Health & Human Services)
	โ€ข	EPA (Environmental Protection Agency)
	โ€ข	NIH (National Institutes of Health)
	โ€ข	CFPB (Consumer Financial Protection Bureau)
	โ€ข	Dept. of Education
	โ€ข	USAID (U.S. Agency for International Development)
	โ€ข	The National Nuclear Security Administration (DOE), despite its conservative leaning (+1 on the scale), is an exception among targeted agencies.
	โ€ข	A notable outlier: the Department of Veterans Affairs (moderately conservative) also faced layoffs despite its size.

Takeaway:

The figure visually demonstrates that DOGE layoffs disproportionately targeted liberal-leaning agencies, supporting claims of ideological bias. The pattern reveals that layoffs were not driven by agency size or budget alone but were strongly associated with perceived ideology.

Source: Richardson, Clinton, & Lewis (2018). Elite Perceptions of Agency Ideology and Workforce Skill. The Journal of Politics, 80(1).

Scatterplot titled โ€œEmpirical Evidence of Ideological Targeting in Federal Layoffs: Agencies seen as liberal are significantly more likely to face DOGE layoffs.โ€ โ€ข The x-axis represents Perceived Ideological Leaning of federal agencies, ranging from -2 (Most Liberal) to +2 (Most Conservative), based on survey responses from over 1,500 federal executives. โ€ข The y-axis shows Agency Size (Number of Staff) on a logarithmic scale from 1,000 to 1,000,000. Each point represents a federal agency: โ€ข Red dots indicate agencies that experienced DOGE layoffs. โ€ข Gray dots indicate agencies with no layoffs. Key Observations: โ€ข Liberal-leaning agencies (left side of the plot) are disproportionately represented among red dots, indicating higher layoff rates. โ€ข Notable targeted agencies include: โ€ข HHS (Health & Human Services) โ€ข EPA (Environmental Protection Agency) โ€ข NIH (National Institutes of Health) โ€ข CFPB (Consumer Financial Protection Bureau) โ€ข Dept. of Education โ€ข USAID (U.S. Agency for International Development) โ€ข The National Nuclear Security Administration (DOE), despite its conservative leaning (+1 on the scale), is an exception among targeted agencies. โ€ข A notable outlier: the Department of Veterans Affairs (moderately conservative) also faced layoffs despite its size. Takeaway: The figure visually demonstrates that DOGE layoffs disproportionately targeted liberal-leaning agencies, supporting claims of ideological bias. The pattern reveals that layoffs were not driven by agency size or budget alone but were strongly associated with perceived ideology. Source: Richardson, Clinton, & Lewis (2018). Elite Perceptions of Agency Ideology and Workforce Skill. The Journal of Politics, 80(1).

The DOGE firings have nothing to do with โ€œefficiencyโ€ or โ€œcutting waste.โ€ Theyโ€™re a direct push to weaken federal agencies perceived as liberal. This was evident from the start, and now the data confirms it: targeted agencies overwhelmingly those seen as more left-leaning. ๐Ÿงตโฌ‡๏ธ

20.02.2025 02:18 โ€” ๐Ÿ‘ 10805    ๐Ÿ” 4872    ๐Ÿ’ฌ 258    ๐Ÿ“Œ 404
list of banned keywords

list of banned keywords

๐ŸšจBREAKING. From a program officer at the National Science Foundation, a list of keywords that can cause a grant to be pulled. I will be sharing screenshots of these keywords along with a decision tree. Please share widely. This is a crisis for academic freedom & science.

04.02.2025 01:26 โ€” ๐Ÿ‘ 28149    ๐Ÿ” 15957    ๐Ÿ’ฌ 1296    ๐Ÿ“Œ 3736
Preview
Emergence of a High-Dimensional Abstraction Phase in Language Transformers A language model (LM) is a mapping from a linguistic context to an output token. However, much remains to be known about this mapping, including how its geometric properties relate to its function. We...

arxiv.org/abs/2405.15471

with Diego Doimo, Corentin Kervadec, Iuri Macocco, Jade Yu, Alessandro Laio, and Marco Baroni.

6/6

02.02.2025 18:46 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

3๏ธโƒฃLLMs that are better at next-token prediction have higher, earlier ID peaks.

5/6

02.02.2025 18:46 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

2๏ธโƒฃ The ID peak (beige) is where different LLMs are most similar (big shapes).

All LLMs share this high-dimensional phase of linguistic abstraction, but...

4/6

02.02.2025 18:46 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

... the ID peak marks where syntactic, semantic, and abstract linguistic features like toxicity and sentiment are first decodable.

โญuse these layers for downstream transfer!

(e.g., for brain encoding models, see arxiv.org/abs/2409.05771)

3/6

02.02.2025 18:46 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

1๏ธโƒฃ The ID peak is linguistically relevant.

- it collapses on shuffled text (destroying syntactic/semantic structure)
- it grows over the course of training...

2/6

02.02.2025 18:46 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Here's our work accepted to #ICLR2025!

We look at how intrinsic dimension evolves over LLM layers, spotting a universal high-dimensional phase.

This ID peak is where:

- linguistic features are built
- different LLMs are most similar,

with implications for task transfer

๐Ÿงต 1/6

02.02.2025 18:46 โ€” ๐Ÿ‘ 11    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Preview
White House pauses all federal grants, sparking confusion The Trump administration has put a hold on all federal financial grants and loans, affecting tens of billions of dollars in payments.

I think some people hear โ€œgrantsโ€ and think that without them, scientists and government workers just have less stuff to play with at work. But grants fund salaries for students, academics, researchers, and people who work in all areas of public service.

โ€œPausingโ€ grants means people donโ€™t eat.

28.01.2025 03:03 โ€” ๐Ÿ‘ 43909    ๐Ÿ” 14564    ๐Ÿ’ฌ 1621    ๐Ÿ“Œ 966
Post image

๐Ÿ”ŠNew EMNLP paper from Eleonora Gualdoni & @gboleda.bsky.social !

Why do objects have many names?

Human lexicons contain different words that speakers can use to refer to the same object, e.g., purple or magenta for the same color.

We investigate using tools from efficient coding...๐Ÿงต

1/3

02.12.2024 10:38 โ€” ๐Ÿ‘ 27    ๐Ÿ” 7    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

โšกPostdoc opportunity w/ COLT

Beatriu de Pinรณs contract, 3 yrs, competitive call by Catalan government.

Apply with a PI (Marco Gemma or Thomas)

Reqs: min 2y postdoc experience outside Spain, not having lived in Spain for >12 months in the last 3y.

Application ~December-February (exact dates TBD)

25.11.2024 09:51 โ€” ๐Ÿ‘ 6    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Hello๐ŸŒ! We're a computational linguistics group in Barcelona headed by Gemma Boleda, Marco Baroni & Thomas Brochhagen

We do psycholinguistics, cogsci, language evolution & NLP, with diverse backgrounds in philosophy, formal linguistics, CS & physics

Get in touch for postdoc, PhD & MS openings!

25.11.2024 10:17 โ€” ๐Ÿ‘ 14    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

My lab has been working on comparing neural representations for the past few years - methods like RSA, CKA, CCA, Procrustes distance

We are often asked: What do these things tell us about the system's function? How do they relate to decoding?

Our new paper has some answers arxiv.org/abs/2411.08197

18.11.2024 18:17 โ€” ๐Ÿ‘ 216    ๐Ÿ” 70    ๐Ÿ’ฌ 6    ๐Ÿ“Œ 0

@emcheng is following 20 prominent accounts