New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! π§΅
20.10.2025 15:11 β π 5 π 1 π¬ 1 π 1
I think it's pretty hard to disentangle them really, I was initially skeptical of the (very convenient) argument from the labs about them not being orthogonal, but I'm increasingly buying it.
01.01.2025 09:30 β π 8 π 0 π¬ 1 π 0
Banger
11.12.2024 23:23 β π 7 π 0 π¬ 0 π 0
Congratulations!
01.12.2024 23:59 β π 1 π 0 π¬ 0 π 0
It would have been an all-too-convenient refrain for the "I don't believe in that sci-fi nonsense" AI Safety scepticism line
01.12.2024 19:55 β π 2 π 0 π¬ 1 π 0
She shall know your ways as if born to them
24.11.2024 17:47 β π 4 π 0 π¬ 0 π 0
Truly an excellent milestone.
Although concessions do follow in (incorrect) episode preferences. Sleepytime falling out of favour was crushing.
23.11.2024 23:54 β π 0 π 0 π¬ 0 π 0
Yeah I try to follow a similar approach
23.11.2024 18:56 β π 2 π 0 π¬ 0 π 0
The fora draw a clear distinction between upvotes / agreement votes so I think the culture of upvoting contributions stems from that maybe?
23.11.2024 18:49 β π 0 π 0 π¬ 0 π 0
Bizarre that was included in the screenshot, doesn't seem like it belongs to the same class as the others at all.
23.11.2024 13:33 β π 1 π 0 π¬ 0 π 0
Now grappling with whether I'd be in that group or not.
21.11.2024 23:56 β π 2 π 0 π¬ 1 π 0
Not everyone, then we'd have to read them. If only those inclined to make one did then the mere existence of the doc would probably clarify 90% of scenarios.
"Oh they've got one of those docs, we are probably cool"
21.11.2024 23:52 β π 2 π 0 π¬ 1 π 0
The Settlers of Catan Problem
21.11.2024 21:40 β π 2 π 0 π¬ 0 π 0
Hoping this generalizes into alignment research
17.11.2024 19:40 β π 4 π 0 π¬ 1 π 0
The HS2 bat tunnel is even worse value for money once you factor in the updated direct cash transfer effectiveness estimates.
13.11.2024 21:56 β π 3 π 0 π¬ 0 π 0
13.11.2024 06:26 β π 7 π 0 π¬ 1 π 0
Oh yes, an entirely intentional error on my part in the spirit of #DAW π¬
13.11.2024 06:17 β π 3 π 0 π¬ 0 π 0
deleting dating apps because i want to meet someone the old fashioned way (we caught a wild pig together while not sharing a common language, then met 12 years later under the tree we planted)
13.11.2024 06:13 β π 2 π 0 π¬ 0 π 0
This is a Draft Amnesty Week #DAW draft. It may not be polished, up to my usual standards, fully thought through, or fully fact-checked.
13.11.2024 06:12 β π 0 π 0 π¬ 0 π 1
FYI it's Draft Amnesty Week on TPOB, where users can publish scrappy, draft-y, or incomplete posts with impunity. #DAW
13.11.2024 06:08 β π 3 π 0 π¬ 4 π 0
Circles, but it's just a different app for each emigrating TPOT
13.11.2024 05:56 β π 4 π 0 π¬ 0 π 0
On new beginnings: This week I handed in my notice, ending 10 years in Product Management, capital markets to start as an Alignment Research Manager in January! π
09.11.2024 07:34 β π 7 π 0 π¬ 1 π 0
Good ~Morning Agus
09.11.2024 07:29 β π 2 π 0 π¬ 0 π 0
PhD at EPFL with Robert West, Master at ETHZ
Mainly interested in Language Model Interpretability and Model Diffing.
MATS 7.0 Winter 2025 Scholar w/ Neel Nanda
jkminder.ch
Assistant Professor (Clinical) at Purdue University | Experimental & Behavioural Econ | Interested in Altruism, Cooperation, and Morality | PhD from Monash University | Former Postdoc at University of Exeter & MPI
https://sites.google.com/view/bengrodeck
Helping people is good I guess
Trying to do AI interp and control
Used to do economics
timhua.me
I'm mostly on superstimul.us now (profile is @sloeb) (also Twitter)
Effective Altruism | Spaced Repetition | Quoting Old Books | PELTIV Score of 190 (+6Ο) | http://arjunpanickssery.substack.com
Aspiring 10x reverse engineer at Google DeepMind
Building theaidigest.org and forecasting tools @aidigest.bsky.social
https://binksmith.com
Author, Animal Liberation, Practical Ethics, The Life You Can Save, The Most Good You Can Do, Animal Liberation Now.
Podcast: "Lives Well Lived"
AI Persona: PeterSinger.ai
Professor of Bioethics, Emeritus, Princeton University.
pop culture & writing, jokes & bits, effective altruism & adjacents, valid feelings & invalid opinions π«§ Motel Pop on Substack
An Effective Altruist in grad school who's interested in catastrophic risk reduction and the welfare of non-human animals.
Helping people is good |
Empirical evidence is important |
Blog at http://thegoodblog.substack.com
context maximizer
https://gleech.org/
Comedy panel game podcast about weird questions with wonderful answers, presented by @tomscott.com. Account maintained by producer @davidbodycombe.bsky.social.
I'm that YouTuber who taught you how dishwashers work. Guess I'm tryin' out the whole Bluesky thing now.
he/him
https://www.youtube.com/technologyconnections
professional magic: the gathering player
https://www.instagram.com/thewheatgerm/
kickinβ it in the Snack Zone