Simon Lermen's Avatar

Simon Lermen

@simonlermen.bsky.social

I work on AI safety and AI in cybersecurity

136 Followers  |  330 Following  |  25 Posts  |  Joined: 19.11.2024  |  1.2236

Latest posts by simonlermen.bsky.social on Bluesky

Our paper on AI-powered spear phishing, co-authored with @fredheiding.bsky.social , has been accepted at the ICML 2025 Workshop on Reliable and Responsible Foundation Models!
openreview.net/pdf?id=f0uFp...

04.07.2025 22:49 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Do you think there is any comparable thing in China to AI Twitter or Bluesky? Where people discuss ideas

20.04.2025 10:51 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Are you working at DeepSeek?

20.04.2025 10:48 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Why so mean old man

28.02.2025 16:02 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Grok's DeepSearch was launched with Zero safety features, you can ask it about assasslnations, dru*gs. This has been online for a few days now with no changes.

25.02.2025 13:38 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I’m mostly interested in not dying

23.01.2025 13:19 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

If you are trying to understand its reasoning, it seems like a necessary step to have legible chain-of-thought.

22.01.2025 22:41 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
OpenAI’s Economic Blueprint The Blueprint outlines policy proposals for how the US can maximize AI’s benefits, bolster national security, and drive economic growth

you should be carefully here, huge datacenters with their own powerstructures are being discussed, huge new semiconductor facilities. situation might change
openai.com/global-affai...

15.01.2025 15:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

To be fair, the pre-training and all those mega datacenters do have some significant environmental impact. buying products from AI labs does fund this. But agree that individual energy use per reply is like the weakest argument against AI.

15.01.2025 10:41 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Human study on AI spear phishing campaigns β€” LessWrong TL;DR: We ran a human subject study on whether language models can successfully spear-phish people. We use AI agents built from GPT-4o and Claude 3.5…

I published a human study with @fredheiding.bsky.social
We use AI agents built from GPT-4o and Claude 3.5 Sonnet to search the web for available information on a target and use this for highly personalized phishing messages. achieved click-through rates above 50%
www.lesswrong.com/posts/GCHyDK...

04.01.2025 13:48 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Has anyone ever tried with constitutional AI to add something on: always show your entire reasoning? What happens if you ask the model if it left out steps in its reasoning? can it verbalize them?

02.01.2025 23:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

They achieve this in part by immediately releasing models after training such as o3, other companies wait for safety and security evaluations and estimates of societal impact. They also used to wait with releases such as with GPT-4

28.12.2024 10:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

sometimes fancy terms just serve to confuse people

27.12.2024 10:00 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

They have already made billions in revenue, but defining it as profits makes it almost impossible to reach

27.12.2024 09:54 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

crazy that they use profits instead of revenue. so they can always just hack this by spending a bit more on R&D

27.12.2024 09:53 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

my guess is he thinks of some sort of conscious experience of wanting here...

27.12.2024 08:19 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

its behavior is at if it wants to win, same will be true about powerful AI agents. whether it actually wants something in a way that satisfies you doesn't matter

26.12.2024 23:13 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

So RL-training the model to achieve some goal such as with constitutional AI can't lead to the model having a goal? do you think AlphaZero wants to win at chess?

26.12.2024 19:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ’―

22.12.2024 11:35 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Well, we observe computation in superposition

16.12.2024 21:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I agree that it doesn't PROVE multiverses. But I don't like the sneering tone, what is superposition? It sure seems like the electron is in many places at once, all interpretations of that seem a bit crazy. Everett's manyworlds is a common position among physicists, including some i know.

16.12.2024 21:15 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

The many worlds interpretation is a commonly held view by many physicists. And it is not like other interpretations are less "weird".

16.12.2024 13:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

The many worlds interpretation is a commonly held view by many physicists. And it is not like other interpretations are less "weird"

16.12.2024 13:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

I don't understand why we don't have more conferences in countries with easy visa policies

14.12.2024 16:33 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Current safety training techniques do not fully transfer to the agent setting β€” LessWrong TL;DR: We are presenting three recent papersΒ which all share a similar finding, i.e. the safety training techniques for chat models don’t transfer we…

I'll be at the SafeGenAI workshop on Sunday presenting on research I did on safety in AI agents.
I will talk about results from these two blog posts:
www.lesswrong.com/posts/ZoFxTq...
And:
www.lesswrong.com/posts/Lgq2Dc...

13.12.2024 18:56 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I'm very bullish on automated research engineering soon, but even I was surprised that AI agents are twice as good as humans with 5+ years of experience or from a top AGI or safety lab at doing tasks in 2 hours. Paper: metr.org/AI_R_D_Evalu...

22.11.2024 22:21 β€” πŸ‘ 8    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0