Guy Davidson's Avatar

Guy Davidson

@guydav.bsky.social

@guyd33 on the X-bird site. PhD student at NYU, broadly cognitive science x machine learning, specifically richer representations for tasks and cognitive goals. Otherwise found cooking, playing ultimate frisbee, and making hot sauces.

945 Followers  |  669 Following  |  122 Posts  |  Joined: 20.09.2023  |  2.3433

Latest posts by guydav.bsky.social on Bluesky

Wherever good coffee is to be found, the rest of the time. Don't hesitate to reach out!

(also happy to talk about job search in industry and what that looks and feels like these days)

30.07.2025 15:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Goal Inference using Reward-Producing Programs in a Novel Physics Environment Author(s): Davidson, Guy; Todd, Graham; Colas, CΕ½dric; Chu, Junyi; Togelius, Julian; Tenenbaum, Joshua B.; Gureckis, Todd M; Lake, Brenden | Abstract: A child invents a game, describes its rules, and ...

Saturday's poster session (P3-D-44) to talk about our goal inference work, in a new, physics-based environment we developed: escholarship.org/uc/item/6tb2...

30.07.2025 15:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Today's Minds in the Making: Design Thinking and Cognitive Science Workshop (Pacific E):

minds-making.github.io

30.07.2025 15:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

#CogSci2025 friends! I'm here all week and would love to chat. I'd particularly love to talk to anyone thinking about Theory of Mind and how to evaluate it better (in both minds and machines, in different settings and contexts), and about goals and their representations. Find me at:

30.07.2025 15:47 β€” πŸ‘ 7    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Cool new work on localizing and removing concepts using attention heads from colleagues at NYU and Meta!

08.07.2025 13:54 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

You (yes, you!) should work with Sydney! Either short-term this summer, or longer term at her nascent lab at NYU!

06.06.2025 18:15 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Fantastic new work by @johnchen6.bsky.social (with @brendenlake.bsky.social and me trying not to cause too much trouble).

We study systematic generalization in a safety setting and find LLMs struggle to consistently respond safely when we vary how we ask naive questions. More analyses in the paper!

30.05.2025 17:32 β€” πŸ‘ 10    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
Preview
Guy Davidson Guy Davidson's academic website

Finally, if this work makes you think "I'd like to work with this person," please reach out -- I'm on the job market for industry post-PhD roles (keywords: language models, interpretability, open-endedness, user intent understanding, alignment).
See more: guydavidson.me

23.05.2025 17:38 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Do different prompting methods yield a common task representation in language models? Demonstrations and instructions are two primary approaches for prompting language models to perform in-context learning (ICL) tasks. Do identical tasks elicited in different ways result in similar rep...

If you made it this far, thank you, and don't hesitate to reach out! 17/N=17
Paper: arxiv.org/abs/2505.12075
Code: github.com/guydav/promp...

23.05.2025 17:38 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

As with pretty much everything else I've worked on in grad school, this work would have looked different (and almost certainly worse) without the guidance of my advisors, @brendenlake.bsky.social and @toddgureckis.bsky.social . I continue to appreciate your thoughtful engagement with my work! 16/N

23.05.2025 17:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

This work would also have been impossible without @adinawilliams.bsky.social 's guidance, the freedom she gave me in picking a problem to study, and believing in me that I could tackle it despite it being my first foray into (mechanistic) interpretability work. 15/N

23.05.2025 17:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We owe a great deal of gratitude to @ericwtodd.bsky.social d , not only for open-sourcing their code, but also for answering our numerous questions over the last few months. If you find this interesting, you should also read their paper introducing function vectors. 14/N

23.05.2025 17:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

See the paper for a description of the methods, the many different controls we ran, our discussion and limitations, examples of our instructions and baselines, and other odd findings (applying an FV twice can be beneficial! Some attention heads have negative causal effects!) 13/N

23.05.2025 17:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Finding 5 bonus: Which post-training steps facilitate this? Using the OLMo-2 model family, we find that the SFT and DPO stages each bring a jump in performance, but the final RLVR step doesn't make a difference for the ability to extract instruction FVs. 12/N

23.05.2025 17:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Finding 5: We can steer base models with instruction FVs extracted from their post-trained versions. We didn't expect this to work! It's less effective for the Llama-3.2 models that are distilled and smaller. We're also excited to dig into this and see where we can push it. 11/N

23.05.2025 17:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Finding 4: The relationship between demonstrations and instructions is asymmetrical. Especially in post-trained models, the top attention heads for instructions appear peripherally useful for demonstrations, more than the opposite case (see paper for details). 10/N

23.05.2025 17:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We (preliminarily) interpret this as evidence that the effect of post-training is _not_ in adapting the model to represent instructions with the mechanism used for demonstrations, but in developing a mostly complementary mechanism. We're excited to dig into this further. 9/N.

23.05.2025 17:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Finding 3 bonus: examining activations in the shared attention heads, we see (a) generally increased similarity with increasing model depth, and (b) no difference in similarity between base and post-trained models (circles and squares). 8/N

23.05.2025 17:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Finding 3: Different attention heads are identified by the FV procedure between demonstrations and instructions => different mechanisms are involved in creating task representations from different prompt forms. We also see consistent base/post-trained model differences. 7/N

23.05.2025 17:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Finding 2: Demonstration and instruction FVs help when applied to a model together (again, with the caveat of the 3.1-8B base model) => they carry (at least some) different information => these different forms elicit non-identical task representations (at least, as FVs). 6/N

23.05.2025 17:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Finding 1: Instruction FVs increase zero-shot task accuracy (even if not as much as demonstration FVs increase accuracy in a shuffled 10-shot evaluation). The 3.1-8B base model trails the rest; we think it has to do with sensitivity to the chosen FV intervention depth. 5/N

23.05.2025 17:38 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

TL;DR: We successfully extend FVs to ICL instruction prompts and extract instruction function vectors that raise zero-shot task accuracy. We offer evidence that they carry different information from demonstration FVs and are represented by mostly different attention heads. 4/N

23.05.2025 17:38 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We were inspired by @davidbau.bsky.social 's talk at NYU last fall, in which he discussed the function vector work led by @ericwtodd.bsky.social β€ͺβ€ͺ‬ . They show how to extract task representations (= FVs) from ICL demonstrations. Could we extend FVs to instructions? What would we learn? 3/N

23.05.2025 17:38 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I've been interested in goal representations (in cogsci) for most of my PhD. When I started visiting Meta FAIR to work with @adinawilliams.bsky.social. in the fall, I wanted to study a similar question in language models (and as a side quest, try my hand at interpretability work). 2/N

23.05.2025 17:38 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

New preprint alert! We often prompt ICL tasks using either demonstrations or instructions. How much does the form of the prompt matter to the task representation formed by a language model? Stick around to find out 1/N

23.05.2025 17:38 β€” πŸ‘ 44    πŸ” 7    πŸ’¬ 1    πŸ“Œ 2

All of the above?! But the mech interp one is probably most relevant to my day-to-day thoughts, while the AI safety one to my occasional wandering mind deeper concerns.

28.03.2025 17:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Alternatively, could it be that time and growth blunt some of the oddity? Not everyone and not all of it, but I think the mid 20s offer a lot of growth and change.

24.03.2025 01:06 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Exciting! Congrats, Anna and co!

16.03.2025 01:15 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Congrats, looks great!

11.03.2025 00:41 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Another banger from Jenn, Felix, and Tomer that jumps right to the top of my reading list.

06.03.2025 18:03 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@guydav is following 20 prominent accounts