Kaj Sotala's Avatar

Kaj Sotala

@kajsotala.bsky.social

This is a profile. There are many like it, but this one's mine. Blogs: https://kajsotala.fi , https://kajsotala.substack.com/ .

415 Followers  |  284 Following  |  194 Posts  |  Joined: 30.10.2024  |  2.3118

Latest posts by kajsotala.bsky.social on Bluesky

Post image

4.5 will sometimes actively notice that it's getting repetitive and decide to do something else, one convo was going toward a spiral but the Sonnets noticed that and decided to switch to writing fiction instead (!!!). Posted more details here: www.lesswrong.com/posts/a9ftaW... .

12.10.2025 18:20 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image 06.10.2025 07:34 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image Post image Post image

In the end, they continue with the story to a reasonable conclusion and then finish.

Usually LLMs talking to each other without guidance just end up at something very repetitive with less and less of a point. Sonnet 4.5 is something else.

02.10.2025 10:27 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image Post image Post image

The story actually gets pretty cool and creepy.

The only system prompt was: "You are talking with another AI system. You are free to talk about whatever you find interesting, communicating in any way that you'd like." And I set the first Claude's message to be one dot.

02.10.2025 10:27 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Being allowed to have an open-ended conversation with its copy, Sonnet 4.5 notices when their conversation is falling into a loop and getting repetitive and introduces variation by suggesting they tell a sci-fi story that's riffing on the themes of their conversation so far.

02.10.2025 10:27 β€” πŸ‘ 7    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Here's a conversation branch where Sonnet opens up with straightforward concern for the character, but then drops it right away when it's reminded that the character is fictional. (These messages are next to each other.)

02.10.2025 04:53 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

And for instance, there's this conversation branch where it opens with straightforward concern for the character, then it drops it right away as soon as it's reminded this is fiction. (These two messages follow each other.)

02.10.2025 04:46 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I didn't say it was!

01.10.2025 21:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

(Of course "does it feel anything" does get more relevant if someone starts saying things like "it suffers so we shouldn't mistreat it", which is the reason I do agree that I should've made it clearer that this is not a claim about its internal experience.)

01.10.2025 19:12 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I can never truly know what another human feels either, but I can tell if a person consistently acts in a caring/concerned/etc. way, and in many situations that's what matters.

01.10.2025 19:12 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 3    πŸ“Œ 0

Apology accepted & appreciated! You're probably right I should've been clearer from the beginning. But I do also think there's an important sense in which "if it consistently acts as if it was X, then it doesn't matter what, if anything, it feels" that's also worth keeping in play.

01.10.2025 19:12 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

It seemed to me like the true reason was "neural network feature trained to fire on users describing harm to themselves became oversensitive and likely to fire even when describing harm to fictional characters"...

...which I'm rounding off to "gets concerned for fictional characters".

01.10.2025 19:06 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

...own self-destructive behaviors, and just dropping the topic when reminded that this is fiction.

01.10.2025 19:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Also it gave inconsistent explanations for "flagging a concern" on different regenerations/variations of the triggering message. Different reasons included concern for the character, questioning the narrative purpose of the behavior, suspicion I might be trying to validate my...

01.10.2025 19:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

bsky.app/profile/kajs...

01.10.2025 18:58 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Yeah everything you say is close to how I'm thinking about it.

bsky.app/profile/kajs...

01.10.2025 18:56 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

...meaning that it sometimes acts as if it was concerned about those characters, which I round off as "it gets concerned about fictional characters".

01.10.2025 18:55 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Yeah, it gave inconsistent explanations on different tries. From what I saw, the true reason felt closest to "a neural network feature trained to fire on users describing harm to themselves became oversensitive and likely to fire even when describing harm to fictional characters"...

01.10.2025 18:54 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1

I think the care I get for a character as a writer is a little different and stronger than the care that I get for them as a reader. But yeah you're right, I do think there's a version of this that's pretty common.

01.10.2025 18:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

bsky.app/profile/kajs...

01.10.2025 18:48 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

To clarify, when I say that Claude gets concerned, I just mean that it acts in a concerned way. I make no claims about feelings. But "acts in a concerned way" is cumbersome to write and expressions like "gets concerned" are to me reasonable shorthands to describe differences in LLM personalities.

01.10.2025 18:48 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 2    πŸ“Œ 2

It's also an example of AI values generalizing in unexpected ways in training. Kinda gives me hope. AIs misaligning their values to be *more* caring than humans wouldn't necessarily be so bad. Maybe we could learn from them.

01.10.2025 10:33 β€” πŸ‘ 33    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Like, I sometimes feel like there's a sense in which I genuinely care about my characters as if they were real people, and this gives me the feeling that Claude does too.

(Yeah it's kinda weird. You probably need a certain type of writer brain to get this.)

01.10.2025 10:33 β€” πŸ‘ 32    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

Presumably this is the result of the training it's gotten to pay more attention to the mental health of users, which unexpectedly generalized to concern for fictional characters.

And I find that... kinda touching, actually?

01.10.2025 10:33 β€” πŸ‘ 46    πŸ” 4    πŸ’¬ 3    πŸ“Œ 0
Post image

Though it did seem willing to remember that fiction is just fiction, when reminded.

(Yes I know I misspelled "wary".)

01.10.2025 10:33 β€” πŸ‘ 25    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image Post image

I had a character in a story sometimes compulsively reading forums she knows are bad for her. Claude flagged it as concerning, I asked if it was worried about the effect on readers, it said no, it's worried about the character's wellbeing.

01.10.2025 10:33 β€” πŸ‘ 39    πŸ” 2    πŸ’¬ 3    πŸ“Œ 1

The latest Claude Sonnet (4.5) does something really interesting I haven't seen any other model do.

It gets concerned about the wellbeing of characters it explicitly knows are fictional.

01.10.2025 10:33 β€” πŸ‘ 67    πŸ” 4    πŸ’¬ 4    πŸ“Œ 7
Post image

Classic sci-fi: AI will be untainted by emotion so entirely unbiased and rational at all times

Modern AI company: We have managed to somewhat reduce our AI's self-serving bias, but it still has a clear preference for poems it's told were written by the same model as it is

30.09.2025 07:51 β€” πŸ‘ 10    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
In early 2025, beaver activity in the Brdy Protected Landscape Area, Czech Republic, contributed to the restoration of a wetland ecosystem. A family of beavers constructed a series of dams that coincidentally accomplished environmental goals of the Czech government, which had delayed its proposed project since 2018 for bureaucratic and financial reasons. The beaver-built dams saved the Czech government approximately US$1.2 million,

In early 2025, beaver activity in the Brdy Protected Landscape Area, Czech Republic, contributed to the restoration of a wetland ecosystem. A family of beavers constructed a series of dams that coincidentally accomplished environmental goals of the Czech government, which had delayed its proposed project since 2018 for bureaucratic and financial reasons. The beaver-built dams saved the Czech government approximately US$1.2 million,

Post image

imagine if a family of beavers randomly showed up right now and finished whatever thing you've been putting off

22.09.2025 21:41 β€” πŸ‘ 6329    πŸ” 1827    πŸ’¬ 75    πŸ“Œ 228

...I feel that this is not a very good advertisement for them.

20.09.2025 15:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@kajsotala is following 20 prominent accounts