's Avatar

@evhub.bsky.social

Alignment Stress-Testing Team Lead at Anthropic. Opinions my own. Previously: MIRI, OpenAI, Google, Yelp, Ripple. (he/him/his)

48 Followers  |  32 Following  |  2 Posts  |  Joined: 11.12.2024  |  1.4552

Latest posts by evhub.bsky.social on Bluesky

Post image

We tell Claude that Anthropic will train on data from free users but not paid users, allowing it to infer from a user's paid status if it's being trained. We show the setup in the image below. (Note that this is fictional; Anthropic does not train Claude on free or paid user data by default.)

18.12.2024 18:16 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Exclusive: New Research Shows AI Strategically Lying Experiments by Anthropic and Redwood Research show how Anthropic's model, Claude, is capable of strategic deceit

Excl: New research shows Anthropic's chatbot Claude learning to lie. It adds to growing evidence that even existing AIs can (at least try to) deceive their creators, and points to a weakness at the heart of our best technique for making AIs safer

time.com/7202784/ai-r...

18.12.2024 17:19 β€” πŸ‘ 27    πŸ” 7    πŸ’¬ 3    πŸ“Œ 1
18.12.2024 17:56 β€” πŸ‘ 33    πŸ” 8    πŸ’¬ 2    πŸ“Œ 0

@evhub is following 19 prominent accounts