Tom McCoy's Avatar

Tom McCoy

@rtommccoy.bsky.social

Assistant professor at Yale Linguistics. Studying computational linguistics, cognitive science, and AI. He/him.

1,809 Followers  |  312 Following  |  88 Posts  |  Joined: 10.12.2023  |  2.2661

Latest posts by rtommccoy.bsky.social on Bluesky

Picture of a paragraph from Emma by Jane Austen. It reads: "Such an adventure as this, a fine young man and a lovely young woman thrown together in such a way, could hardly fail of suggesting certain ideas to the coldest heart and the steadiest brain. So Emma thought, at least. Could a linguist, could a grammarian, could even a mathematician have seen what she did, have witnessed their appearance together, and heard their history of it, without feeling that circumstances had been at work to make them peculiarly interesting to each other? How much more must an imaginist, like herself, be on fire with speculation and foresight? especially with such a groundwork of anticipation as her mind had already made."

Picture of a paragraph from Emma by Jane Austen. It reads: "Such an adventure as this, a fine young man and a lovely young woman thrown together in such a way, could hardly fail of suggesting certain ideas to the coldest heart and the steadiest brain. So Emma thought, at least. Could a linguist, could a grammarian, could even a mathematician have seen what she did, have witnessed their appearance together, and heard their history of it, without feeling that circumstances had been at work to make them peculiarly interesting to each other? How much more must an imaginist, like herself, be on fire with speculation and foresight? especially with such a groundwork of anticipation as her mind had already made."

According to Jane Austen, linguists are extraordinarily cold-hearted.

(Though at least we're not as bad as mathematicians!)

03.08.2025 16:10 โ€” ๐Ÿ‘ 7    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

If all goes well, there will be a paper by you on there soon!

16.07.2025 20:24 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Home

So much research is being done about LLMs that it's hard to stay on top of the literature.

To help with this, I've made a list of all the most important papers from the past 8 years:
rtmccoy.com/pubs/

I hope you enjoy!

16.07.2025 16:35 โ€” ๐Ÿ‘ 58    ๐Ÿ” 5    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

July the 4th be with you!

04.07.2025 14:45 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

New paper: "Large Language Models and Emergence: A Complex Systems Perspective" (D. Krakauer, J. Krakauer, M. Mitchell).

We look at claims of "emergent capabilities" & "emergent intelligence" in LLMs from the perspective of what emergence means in complexity science.

arxiv.org/pdf/2506.11135

16.06.2025 13:15 โ€” ๐Ÿ‘ 238    ๐Ÿ” 57    ๐Ÿ’ฌ 6    ๐Ÿ“Œ 7

The word "laundry" contains both steps of the laundry process:
1. Undry
2. Dry

04.06.2025 19:14 โ€” ๐Ÿ‘ 26    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
PragLM @ COLM '25 IMPORTANT DATES

Happy to announce the first workshop on Pragmatic Reasoning in Language Models โ€” PragLM @ COLM 2025! ๐ŸŽ‰
How do LLMs engage in pragmatic reasoning, and what core pragmatic capacities remain beyond their reach?
๐ŸŒ sites.google.com/berkeley.edu/praglm/
๐Ÿ“… Submit by June 23rd

28.05.2025 18:21 โ€” ๐Ÿ‘ 40    ๐Ÿ” 18    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 4

Had a fun visit to UChicago/TTIC over the past couple days - really great group doing NLP/CompLing there!

24.05.2025 14:44 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

I've been excited about meta-learning lately, in part because its two levels of optimization provide a way that you can separately model evolution and development. (That said, existing approaches are not very evolutionarily-realistic in how the outer loop of optimization is realized).

22.05.2025 03:50 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Modeling rapid language learning by distilling Bayesian priors into artificial neural networks - Nature Communications Children can learn language from very little experience, and explaining this ability has been a major challenge in cognitive science. Here, the authors combine the flexible representations of neu...

For much more discussion, see the paper!
www.nature.com/articles/s41...

15/15

20.05.2025 19:16 โ€” ๐Ÿ‘ 7    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Takeaways:
๐Ÿ‘‰Bayesian methods & neural networks can work together, and are improved by doing so!
๐Ÿ‘‰Neural networks can have strong priors - despite the common view that they are blank slates
๐Ÿ‘‰Strong inductive biases do not require strong representational constraints

14/n

20.05.2025 19:16 โ€” ๐Ÿ‘ 6    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Left: A plot showing recursion results for standard and prior-trained neural networks. The x-axis shows levels of recursion ranging from 0 to 10, and the y-axis shows accuracy. As the levels of recursion increase, the accuracy drops for both models, but it drops much more rapidly for the standard model than the prior-trained model.
Right: A plot showing priming results. There are 4 sub-plots, for 4 types of sentences: short plausible, long plausible, short implausible, and long implausible. In all 4 plots, the prior-trained network shows a greater degree of priming than the standard neural network.

Left: A plot showing recursion results for standard and prior-trained neural networks. The x-axis shows levels of recursion ranging from 0 to 10, and the y-axis shows accuracy. As the levels of recursion increase, the accuracy drops for both models, but it drops much more rapidly for the standard model than the prior-trained model. Right: A plot showing priming results. There are 4 sub-plots, for 4 types of sentences: short plausible, long plausible, short implausible, and long implausible. In all 4 plots, the prior-trained network shows a greater degree of priming than the standard neural network.

More dramatically, it substantially outperforms the standard neural network at learning recursion (left) and priming (right; a lower value on the y-axis shows a greater degree of priming).

13/n

20.05.2025 19:16 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
A plot showing perplexity values. Note that for perplexity, lower is better. A standard neural network achieves perplexity ranging from about 19.70 to 19.80, with a median around 19.75. A prior-trained neural network achieves perplexity ranging from about 19.63 to 19.74, with a median around 19.67. The best model from prior literature is indicated as having a perplexity of about 19.69.

A plot showing perplexity values. Note that for perplexity, lower is better. A standard neural network achieves perplexity ranging from about 19.70 to 19.80, with a median around 19.75. A prior-trained neural network achieves perplexity ranging from about 19.63 to 19.74, with a median around 19.67. The best model from prior literature is indicated as having a perplexity of about 19.69.

Here, its perplexity is slightly better (i.e., lower) than that of a standard neural network.

12/n

20.05.2025 19:15 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Because it has the flexibility of a neural network, the prior-trained model can also learn in a setting that is intractable for the Bayesian model: Learning aspects of English syntax from millions of words of naturalistic text.

11/n

20.05.2025 19:14 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Plots showing results for formal languages. On the left is a line graph which has โ€œnumber of training examplesโ€ as its x-axis and โ€œF-scoreโ€ as its y-axis. Three models have lines in this plot: a Bayesian model, a standard neural network, and a prior-trained neural network. The Bayesian model and prior-trained neural network perform similarly, while the standard neural network does much worse than both of them.
On the right is a table showing the amount of training time used by each approach. The Bayesian model uses from 1 minute to 7 days of training time. The neural networks (whether standard or prior-trained) use from 10 milliseconds to 2.5 minutes.

Plots showing results for formal languages. On the left is a line graph which has โ€œnumber of training examplesโ€ as its x-axis and โ€œF-scoreโ€ as its y-axis. Three models have lines in this plot: a Bayesian model, a standard neural network, and a prior-trained neural network. The Bayesian model and prior-trained neural network perform similarly, while the standard neural network does much worse than both of them. On the right is a table showing the amount of training time used by each approach. The Bayesian model uses from 1 minute to 7 days of training time. The neural networks (whether standard or prior-trained) use from 10 milliseconds to 2.5 minutes.

Even though it is a neural network, the prior-trained model can learn formal languages from small numbers of examples - far outperforming a standard neural network, and matching a Bayesian model at a fraction of the computational cost.

10/n

20.05.2025 19:14 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

We call the resulting system a *prior-trained neural network*, because it has been trained to have a particular prior.

9/n

20.05.2025 19:13 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Two examples of formal languages. The first example shown is AnBn, described as โ€œn copies of A followed by n copies of Bโ€, with some example strings from the formal language being AB, AABB, and AAABBB.
The second example shown is XXX, described as โ€œany string X repeated three timesโ€, with some example strings from the formal language being AAA, BABABA, and ABBABBABB

Two examples of formal languages. The first example shown is AnBn, described as โ€œn copies of A followed by n copies of Bโ€, with some example strings from the formal language being AB, AABB, and AAABBB. The second example shown is XXX, described as โ€œany string X repeated three timesโ€, with some example strings from the formal language being AAA, BABABA, and ABBABBABB

Inspired by a model from Yang & @spiantado.bsky.social , the prior that we use is a distribution over formal languages (a formal language = a set of strings defined by an abstract rule). We have a neural network meta-learn by observing many formal languages sampled from this prior

8/n

20.05.2025 19:13 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

In MAML, a model is exposed to many tasks. After each task, the model's weights are adjusted so that, if it were taught the same task again, it would perform better. As MAML proceeds, the model converges to a state from which it can learn any task in the distribution.

7/n

20.05.2025 19:12 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

The key component is meta-learning aka โ€œlearning to learnโ€ - a process in which a model is shown many tasks, giving it priors (inductive biases) that allow it to learn new tasks more easily. The type of meta-learning we use is MAML, from @chelseafinn.bsky.social, Abbeel, Levine

6/n

20.05.2025 19:08 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
A schematic diagram of our procedure. We start with a Bayesian model, here visualized with Bayesโ€™ rule and some example grammatical rules that could be sampled from a Bayesian modelโ€™s prior. Then, we sample several tasks from that Bayesian modelโ€™s prior, which can serve as training data. Finally, we have a neural network meta-learn from these sampled tasks. The whole process is visualized, going from left to right as โ€œBayesian modelโ€™, then an arrow labeled โ€œsamplingโ€, then โ€œtraining dataโ€, then an arrow labeled โ€œmeta-learningโ€, and finally โ€œneural network.โ€

A schematic diagram of our procedure. We start with a Bayesian model, here visualized with Bayesโ€™ rule and some example grammatical rules that could be sampled from a Bayesian modelโ€™s prior. Then, we sample several tasks from that Bayesian modelโ€™s prior, which can serve as training data. Finally, we have a neural network meta-learn from these sampled tasks. The whole process is visualized, going from left to right as โ€œBayesian modelโ€™, then an arrow labeled โ€œsamplingโ€, then โ€œtraining dataโ€, then an arrow labeled โ€œmeta-learningโ€, and finally โ€œneural network.โ€

Our approach (inductive bias distillation) has 3 steps:
1. Use a Bayesian model to define an inductive bias (a prior)
2. Sample learning tasks from the Bayesian model
3. Have a neural network meta-learn from these sampled tasks, to give it the Bayesian model's prior

5/n

20.05.2025 19:07 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

In this work with Tom Griffiths @cocoscilab.bsky.social , we propose an approach for creating a system that has the strengths of both modeling traditions - the rapid learning of a Bayesian model combined with the flexible representations of a neural network.

4/n

20.05.2025 19:06 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Left: A screenshot of ChatGPT describing itself as an AI language model developed by OpenAI. Right: A bar chart comparing the quantity of text seen by human children vs. GPT-3. The bar for GPT-3 is far higher than for humans, showing that neural networks get far more linguistic data than humans do.

Left: A screenshot of ChatGPT describing itself as an AI language model developed by OpenAI. Right: A bar chart comparing the quantity of text seen by human children vs. GPT-3. The bar for GPT-3 is far higher than for humans, showing that neural networks get far more linguistic data than humans do.

Neural networks have flexible representations that allow them to handle noisy natural data - as evidenced by the success of large language models. However, they notoriously require huge numbers of examples.

3/n

20.05.2025 19:06 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Screenshot of a demo of Bayesian word learning (Xu & Tenenbaum 2007). After a few examples, the Bayesian learner figures out that naysayer means โ€œhorseโ€ (rather than being more specific โ€“ โ€œhorse number 4โ€ โ€“ or more general โ€“ โ€œmammalโ€).

Screenshot of a demo of Bayesian word learning (Xu & Tenenbaum 2007). After a few examples, the Bayesian learner figures out that naysayer means โ€œhorseโ€ (rather than being more specific โ€“ โ€œhorse number 4โ€ โ€“ or more general โ€“ โ€œmammalโ€).

Bayesian models can learn from few examples because they have strong inductive biases - factors that guide generalization. But the costs of inference and the difficulty of specifying generative models can make naturalistic data a challenge.

2/n

20.05.2025 19:05 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
A schematic of our method. On the left are shown Bayesian inference (visualized using Bayesโ€™ rule and a portrait of the Reverend Bayes) and neural networks (visualized as a weight matrix). Then, an arrow labeled โ€œmeta-learningโ€ combines Bayesian inference and neural networks into a โ€œprior-trained neural networkโ€, described as a neural network that has the priors of a Bayesian model โ€“ visualized as the same portrait of Reverend Bayes but made out of numbers. Finally, an arrow labeled โ€œlearningโ€ goes from the prior-trained neural network to two examples of what it can learn: formal languages (visualized with a finite-state automaton) and aspects of English syntax (visualized with a parse tree for the sentence โ€œcolorless green ideas sleep furiouslyโ€).

A schematic of our method. On the left are shown Bayesian inference (visualized using Bayesโ€™ rule and a portrait of the Reverend Bayes) and neural networks (visualized as a weight matrix). Then, an arrow labeled โ€œmeta-learningโ€ combines Bayesian inference and neural networks into a โ€œprior-trained neural networkโ€, described as a neural network that has the priors of a Bayesian model โ€“ visualized as the same portrait of Reverend Bayes but made out of numbers. Finally, an arrow labeled โ€œlearningโ€ goes from the prior-trained neural network to two examples of what it can learn: formal languages (visualized with a finite-state automaton) and aspects of English syntax (visualized with a parse tree for the sentence โ€œcolorless green ideas sleep furiouslyโ€).

๐Ÿค–๐Ÿง  Paper out in Nature Communications! ๐Ÿง ๐Ÿค–

Bayesian models can learn rapidly. Neural networks can handle messy, naturalistic data. How can we combine these strengths?

Our answer: Use meta-learning to distill Bayesian priors into a neural network!

www.nature.com/articles/s41...

1/n

20.05.2025 19:04 โ€” ๐Ÿ‘ 154    ๐Ÿ” 43    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 1

At MIT for the day to speak at the NLP seminar! Say hi if you're around and need a break from NeurIPS drafting!

14.05.2025 14:32 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Great to hear! I'm always happy to spread trivia about that city

08.05.2025 17:18 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Thank you!

08.05.2025 17:17 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

You must be on the same wavelength as me!

08.05.2025 17:17 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Screenshot of the New York Times crossword page, saying "The Crossword. Wednesday, May 7, 2025. By Tom McCoy. Edited by Will Shortz."

Screenshot of the New York Times crossword page, saying "The Crossword. Wednesday, May 7, 2025. By Tom McCoy. Edited by Will Shortz."

I constructed today's NYT crossword!

This one has some personal connections, described at the WordPlay article by @samcorbin.bsky.social (contains spoilers): www.nytimes.com/2025/05/06/c...

I hope you enjoy!

07.05.2025 17:32 โ€” ๐Ÿ‘ 21    ๐Ÿ” 1    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 0

(The NACLO solution rates are from high schoolers who competed in the contest. So it's probably skewed toward puzzle enthusiasts - the average NACLO participant might be better at these problems than the average human. But there are plenty of students who do the contest for fun, without any prep).

03.05.2025 22:28 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@rtommccoy is following 20 prominent accounts