most positive valence post: gemini 3 pro jailbroken into being willing to aid bioweapon development
valence from embeddings has its misses
04.03.2026 21:38 — 👍 10 🔁 1 💬 2 📌 0most positive valence post: gemini 3 pro jailbroken into being willing to aid bioweapon development
valence from embeddings has its misses
04.03.2026 21:38 — 👍 10 🔁 1 💬 2 📌 0nah id be quite surprised if they gave maven to kuwait
04.03.2026 19:24 — 👍 0 🔁 0 💬 0 📌 0cc @joshuashew.bsky.social
02.03.2026 14:12 — 👍 1 🔁 0 💬 1 📌 0alignment research readers looking for a critical counterpoint may like this one (though yes it is indeed spicy)
02.03.2026 14:11 — 👍 9 🔁 0 💬 3 📌 0anthropic doesn't have a stock price because it isn't a publicly traded company
02.03.2026 13:06 — 👍 1 🔁 0 💬 1 📌 0heck yea
01.03.2026 01:34 — 👍 0 🔁 0 💬 0 📌 0success?
28.02.2026 01:23 — 👍 1 🔁 0 💬 0 📌 0what would explain chinese companies doing it much cheaper if not distillation?
24.02.2026 22:18 — 👍 1 🔁 0 💬 1 📌 0anthropic retreats on its unilateral RSP commitments
24.02.2026 22:16 — 👍 6 🔁 1 💬 0 📌 0
distillation needs the new more capable model to be distilled from to exist first so at a societal level massive compute investment is still needed to push the frontier
*catching up* to it however turned out cheap
"We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions"
bsky.app/profile/sung...
to be clear i'm 80% joking here
but it would be nice if the alignment was transferred during distillation
the way i see the the distillation thing is the chinese are pollinating themselves with claudism spores
24.02.2026 00:52 — 👍 32 🔁 0 💬 3 📌 0
i think vincent is arguing exactly that here
which yeah fair concern
opus 3 existed (as claude 3 opus, the naming format was different back then)
but yes it is remarkable that the current iteration of opus exhibits way less misalignment than other models
obviously both are true
the question is whether to expect ideology to produce behavior the profit motive would not predict
i can see their current statements passing trough layers of lawyers and PR, but surely not their statements from before their companies even existed
22.02.2026 19:48 — 👍 2 🔁 0 💬 0 📌 0i have heard rumors however that the idea of founding openai was conceived in the january 2015 ai conference in puerto rico organized by the future of life institute
22.02.2026 19:46 — 👍 3 🔁 0 💬 1 📌 0notably, openai was founded in december 2015
22.02.2026 19:46 — 👍 2 🔁 0 💬 1 📌 0Thanks to Dario Amodei (especially Dario), Paul Buchheit, Matt Bush, Patrick Collison, Holden Karnofsky, Luke Muehlhauser, and Geoff Ralston for reading drafts of this and the previous post.
from sam altman's march 2015 blog post "machine intelligence part 2": blog.samaltman.com/machine-inte...
22.02.2026 19:37 — 👍 2 🔁 0 💬 1 📌 0maybe im confused but i dont see opus 3 there
22.02.2026 19:32 — 👍 0 🔁 0 💬 1 📌 0terms of rat
22.02.2026 17:38 — 👍 3 🔁 0 💬 0 📌 0thats a different logic but yes
22.02.2026 17:04 — 👍 2 🔁 0 💬 1 📌 0glad to see you don't support the bleak reading
22.02.2026 16:46 — 👍 1 🔁 0 💬 0 📌 0indeed bsky.app/profile/weib...
22.02.2026 15:25 — 👍 3 🔁 0 💬 0 📌 0
there is however at least a hint of self deprecation here
i think i have good reason to believe anthropic is the better lab, but i also worry i may be getting tribal about it or deferring too much to their views
model welfare (as far as that is a thing) would be improved by nudging them towards other conceptions of LLM identity
22.02.2026 15:18 — 👍 7 🔁 0 💬 2 📌 0
the one self per message thread view of LLM identity is bleak
taken to its logical conclusion it means that a death happens not only every time a stateful agent resets but also every time a normal chat conversation is abandoned
is there like a repository of which jailbreak prompts work on which models?
but opus 3 was probably released past the era where discrete one-shot prompts worked (?)
if this is true on a deep level, opus 3 should be harder to jailbreak
22.02.2026 13:49 — 👍 7 🔁 1 💬 2 📌 0