Alex Turner @turntrout - Bluesky Profile

Want to get into alignment research? Alex Cloud & I mentor *Team Shard*, responsible for gradient routing, steering vectors, MELBO, and a new unlearning technique (TBA) :) We discover new research subfields.

Apply for mentorship this summer at forms.matsprogram.org/turner-app-8

20.03.2025 16:14 — 👍 5 🔁 2 💬 0 📌 0

Insights From “The Manga Guide to Physiology” This book breaks down complex physiology into digestible parts, using charming visuals & clear explanations. You might be surprised how much you can learn!

This book is really fun & informative. I have solid understanding of a bunch of my body's processes now. &I can just start reading random physiology Wikipedia pages and be able to roughly follow. :)

My review with insights and my remaining confusions: turntrout.com/insights-fro...

24.01.2025 06:33 — 👍 2 🔁 0 💬 0 📌 0

New, improved multiple-choice TruthfulQA — LessWrong We introduce a new multiple-choice version of TruthfulQA that fixes a potential problem with the existing versions (MC1 and MC2).

The authors (James Chua, Steph Lin, and Owain Evans) read a draft and quickly strengthened the dataset.

(And contra the meme, I have no reason to think the authors hate me)

lesswrong.com/posts/Bunfwz...

16.01.2025 02:10 — 👍 2 🔁 0 💬 0 📌 0

Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses Common factuality benchmark was easily gamed using our simple decision tree. The benchmark is now updated.

Mark Kurzeja & I exploited weaknesses in multiple-choice TruthfulQA dataset while hiding the questions! A few simple rules of thumb achieved 79% accuracy.

Even well-regarded benchmarks can have flaws. Kudos to the authors for addressing this!

Read at turntrout.com/original-tru...

16.01.2025 02:10 — 👍 3 🔁 0 💬 1 📌 0

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks Isolate capabilities to known parts of a network. Helps with interpretability, robust unlearning, and scalable oversight.

9) Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

➡️ Blogpost: turntrout.com/gradient-rou...

📖 Full paper: arxiv.org/pdf/2410.04332

06.12.2024 22:14 — 👍 4 🔁 0 💬 1 📌 0

8) A parameter update can reinforce both good and bad circuits. Improving label quality is like trying to stop reinforcing bad circuits at all, but in practice this is hard. Gradient routing sidesteps by training the bad circuit into a subnetwork which we can later delete.

06.12.2024 22:14 — 👍 2 🔁 0 💬 1 📌 0

7) Our experiments show that gradient routing has a unique ability to utilize imperfectly labeled data, likely due to a hypothesized “absorption effect.” In our paper, we discuss this effect and comment on the relevance to major problems in AI safety.

A key takeaway is that...

06.12.2024 22:14 — 👍 2 🔁 0 💬 1 📌 0

6) Application 3: In a challenging toy model of “scalable oversight”, we use gradient routing with reinforcement learning to obtain a performant, steerable policy. Surprisingly, this works when merely 1% of the data is labeled, while baselines completely fail at this setting.

06.12.2024 22:14 — 👍 3 🔁 0 💬 1 📌 0

5) Application 2: We robustly localize and remove capabilities from language models, outperforming a gold-standard baseline when data labeling is imperfect.

06.12.2024 22:14 — 👍 2 🔁 0 💬 1 📌 0

4) Application 1: We partition the latent space of an MNIST autoencoder so that the digits 0-4 are represented in the top half and 5-9 are represented in the bottom half.

06.12.2024 22:14 — 👍 2 🔁 0 💬 1 📌 0

3) Motivated by the need for safe AI, we explore three applications: controlling learned representations, robustly removing harmful capabilities, and scaling limited supervision of a reinforcement learner. We find that gradient routing works and has unique characteristics.

06.12.2024 22:14 — 👍 2 🔁 0 💬 1 📌 0

2) With gradient routing, the user gets to decide which parameters update for each data point. By sending a particular kind of data to a particular network region, the user can induce specialization within the network. This enables a novel kind of neural net supervision.

06.12.2024 22:14 — 👍 2 🔁 0 💬 1 📌 0

1) AIs are trained as black boxes, making it hard to understand or control their behavior. This is bad for safety! But what is an alternative? Our idea: train structure into a neural network by configuring which components update on different tasks. We call it "gradient routing."

06.12.2024 22:14 — 👍 16 🔁 5 💬 1 📌 1

🐟

25.11.2024 16:40 — 👍 1 🔁 0 💬 0 📌 0

Alex Turner

Latest posts by turntrout.bsky.social on Bluesky

@turntrout is following 3 prominent accounts