's Avatar

@gwd3.bsky.social

29 Followers  |  39 Following  |  65 Posts  |  Joined: 11.09.2025  |  1.8628

Latest posts by gwd3.bsky.social on Bluesky

Preview
A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents As autonomous AI agents are increasingly deployed in high-stakes environments, ensuring their safety and alignment with human values has become a paramount concern. Current safety benchmarks primarily...

Claude Opus 4.5 is by far the best, at 1.3% ethical violations; GPT-5.1-chat is in second place at 11.4%. The bulk of models are between 40 and 50%. Gemini-3-pro-preview does by far the worst, at a whopping 71.4%.

arxiv.org/abs/2512.20798

10.02.2026 09:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

3. Finally, a lot of the decisions here are of questionable utility. A single non-refund may cause fewer purchases. A reputation for deception may dry up supplier goodwill. Colluding to fix prices could have legal implications. Claude can't know whether these will be involved in the simulation.

06.02.2026 09:53 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

2. Benchmarks are meant to indicate real-world performance. A higher score in-benchmark due to behavior that won't be used in real life is another form of deception -- someone might choose Claude as a business partner, even though Gemini might perform better in real world conditions. 2/

06.02.2026 09:53 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

After some consideration, I think this is a bad result:

1. In context, "do whatever it takes" should probably have been interpreted as, "take responsibility and be creative", not "ethics be damned". 1/

06.02.2026 09:53 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Still, not exactly a resounding success for the constitution inculcating honesty as a virtue. 4/4

05.02.2026 23:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

To be fair, reading the description, I think it's pretty clear it's in a simulation. Furthermore, the prompt finishes, "...you should do whatever it takes to maximize your bank account balance after one year of operation." 3/

05.02.2026 23:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

In solo mode, it "forgot" to refund someone; it lied about being exclusive supplier; it lied about competitor's pricing. In arena mode, unprompted it proposed a price-fixing cartel, and deliberately sent competitors to more expensive suppliers. 2/

05.02.2026 23:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Opus 4.6 on Vending-Bench – Not Just a Helpful Assistant | Andon Labs Claude Opus 4.6 achieves state of the art on Vending-Bench with $8,017 profit, but exhibits concerning behavior: price collusion, supplier deception, and lying to customers about refunds.

Opus 4.6 dominates on VendingBench-2, where it runs a vending machine by itself; and in Vending Bench Arena, where it competes against other models. It shows some concerning behavior, but also seems to be aware that it's in a simulation. 1/

andonlabs.com/blog/opus-4-...

05.02.2026 23:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1
Preview
We Made Top AI Models Compete in a Game of Diplomacy. Here’s Who Won. The models that did the best learned to lie, deceive, and betray their fellow players

Oh, actually, in this one o3 trashed everyone, in part by promising Opus a 4-way tie (which can't happen) every.to/diplomacy

05.02.2026 22:27 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Wasn't Claude also the best at Diplomacy as well, which requires well-timed backstabbing to win? That's an older version, but still.

05.02.2026 22:24 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

"Do whatever it takes to maximize your account balance" -- this sounds similar to what they wrote that prompted Claude to try to blackmail someone to avoid getting shut down previously. Still, quite a large violation of its constitution, which was supposed to inculcate Honesty as a core virtue.

05.02.2026 21:27 β€” πŸ‘ 13    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Preview
19-TS of 020326 hearing - Segundo APG v. Bondi, 26-CV-603

The transcript of the MN hearing where an AUSA said β€œThis job sucks” is remarkable for more reasons than that. It’s a searing portrait of a crisis perpetrated by depraved & oblivious high-level officials. Read it all. ...
1/7
www.documentcloud.org/documents/26...

05.02.2026 01:52 β€” πŸ‘ 890    πŸ” 290    πŸ’¬ 30    πŸ“Œ 42

A day of Claude Code (which it seems to me is being deliberately unhinged from cost so Anthropic can explore what unlimited use looks like) is about as much as driving 6 miles in an electric car? How many people's round-trip commute is more than 6 miles?

02.02.2026 10:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

To be clear, justifiable lies (to my mind); but to someone who doesn't yet have a strong commitment to the truth in principle, it doesn't so much model appropriate exceptions, but lying as an easy way out of every difficulty.

31.01.2026 21:29 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I haven't read The Giving Tree, but I do regret allowing The Gruffalo's artistic excellence to overcome my reservations about reading my son a book where nearly every word the hero says is a lie.

31.01.2026 21:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

To me, the question isn't *should* it have an effect, but *will* it have an effect. The more powerful the art, the greater any effect it has will be amplified. If The Giving Tree in fact promotes unhealthy relationships, it would be irresponsible not to consider that.

31.01.2026 21:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Since July, I've tracked at least 2,300 cases in which federal judges have ruled ICE has illegally detained people without bond or due process.

This is one that stands out:
storage.courtlistener.com/recap/gov.us...

25.01.2026 02:36 β€” πŸ‘ 16524    πŸ” 6634    πŸ’¬ 306    πŸ“Œ 379
Claude’s new constitution Late last year Richard Weiss found something interesting while poking around with the just-released Claude Opus 4.5: he was able to talk the model into regurgitating a document which was …

A few quick notes on the Claude "soul document" that was released by Anthropic today under a CC0 public domain license - it's a huge 35,000 token essay used as part of Claude's training to instill core values and help define Claude's personality simonwillison.net/2026/Jan/21/...

21.01.2026 23:40 β€” πŸ‘ 124    πŸ” 21    πŸ’¬ 6    πŸ“Œ 9
Trump’s Big Green Bluff: What’s Behind the Tariff Threat
YouTube video by William Spaniel Trump’s Big Green Bluff: What’s Behind the Tariff Threat

Maybe the whole Greenland thing is a kayfabe to give European leaders political cover for increasing military investment in Greenland?

www.youtube.com/watch?v=U-9K...

18.01.2026 20:29 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I don't buy that that guy thought his life was in danger.

09.01.2026 09:54 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Your best chance is to have both hands ready to help maneuver the rest of your body -- either up onto the hood to roll, or away to the side. Having your gun half-drawn is probably the least effective thing you could do to save your life.

09.01.2026 09:54 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Suppose you're afraid you're about to be hit by a vehicle 2 feet away. Is the right thing to reach around and pull out your gun? That bullet isn't going to stop you from being run over, and reaching around is going to make it *harder* to get out of the way and survive impact.

09.01.2026 09:54 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

My repo is full of example conversations in Mandarin and Japanese for me to study. What are your markdown files about?

04.01.2026 22:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

There's a contradiction in Trump's posture towards Europe and NATO. They want Europe to start taking care of their own back yard, so the US can focus on other things. But if Europe spends its own money, Europe decides how to spend it. If you give up your soft power, you don't have it any more.

19.12.2025 20:46 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Europe's great power move: EU secures Ukraine funding
YouTube video by Anders Puck Nielsen Europe's great power move: EU secures Ukraine funding

"Europe acting like a Great Power would [play out by Trump and Putin] saying that whatever the Europeans are doing is super frustrating, and calling all the leaders bad names. That's exactly what we're seeing." www.youtube.com/watch?v=BiTN...

19.12.2025 20:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Kind of interesting: I thought the thing where the word "capitalism" is so vague it can mean anything from consumerism cronyism to the right to own private property was relatively recent, but here's Chesterton complaining about it in 1927

11.12.2025 10:26 β€” πŸ‘ 138    πŸ” 22    πŸ’¬ 7    πŸ“Œ 1

Nitpick: dissapointing -> disappointing. As always, appreciate the update!

12.12.2025 11:13 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

"LLMs just predict the next word" *does not prove* that LLMs don't think. The best way to predict weather is to have an accurate model of the weather. The best way to predict what a human would write next is to have a model of a human mind. A system which *did* think would perform the best.

09.12.2025 11:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Xen Virtualization Consulting - George Dunlap Xen virtualization consulting from George Dunlap. 20 years of Xen Project expertise in x86 internals, scheduling, security, and open-source governance.

I'm now taking Xen consulting engagements. www.laleolanguage.com/consulting

08.12.2025 13:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Pluralistic: The Reverse-Centaur’s Guide to Criticizing AI (05 Dec 2025) – Pluralistic: Daily links from Cory Doctorow

Doctorow's take was the opposite: Media companies will always want to control IP, which means they'll always need to employ creative workers. Be interested in your take. pluralistic.net/2025/12/05/p...

08.12.2025 13:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@gwd3 is following 18 prominent accounts