Daniel Paleka's Avatar

Daniel Paleka

@dpaleka.bsky.social

ai safety researcher | phd ETH Zurich | https://danielpaleka.com

198 Followers  |  47 Following  |  54 Posts  |  Joined: 19.11.2024  |  1.5546

Latest posts by dpaleka.bsky.social on Bluesky

how did they build claude code without claude code?

27.01.2026 17:59 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Pitfalls in Evaluating Language Model Forecasters Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a...

We don't claim LLM forecasting is impossible, but argue for more careful evaluation methods to confidently measure these capabilities.

Details, examples, and more issues in the paper! (7/7)
arxiv.org/abs/2506.00723

05.06.2025 17:08 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Benchmarks can reward strategic gambling over calibrated forecasting when optimizing for ranking performance.

"Bet everything" on one scenario beats careful probability estimation for maximizing the chance of ranking #1 on the leaderboard. (6/7)

05.06.2025 17:08 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Model knowledge cutoffs are guidelines about reliability, not guarantees of no information thereafter. GPT-4o, when nudged, can reveal knowledge beyond its stated Oct 2023 cutoff. (5/7)

05.06.2025 17:08 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image Post image

Date-restricted search leaks future knowledge. Searching pre-2019 articles about โ€œWuhanโ€ returns results abnormally biased towards the Wuhan Institute of Virology โ€” an association that only emerged later. (4/7)

05.06.2025 17:08 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

The time traveler problem: When forecasting "Will civil war break out in Sudan by 2030?", you can deduce the answer is "yes" - otherwise they couldn't grade you yet.

We find that backtesting in existing papers often has similar logical issues that leak information about answers. (3/7)

05.06.2025 17:08 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Forecasting evaluation is tricky. The gold standard is asking about future events; but that takes months/years.

Instead, researchers use "backtesting": questions where we can evaluate predictions now, but the model has no information about the outcome ... or so we think (2/7)

05.06.2025 17:08 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

How well can LLMs predict future events? Recent studies suggest LLMs approach human performance. But evaluating forecasters presents unique challenges compared to standard LLM evaluations.

We identify key issues with forecasting evaluations ๐Ÿงต (1/7)

05.06.2025 17:08 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

why is it that whenever i see survivorship bias on my timeline it already has the red-dotted plane in the replies?

26.05.2025 15:07 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

OpenAI and DeepMind should have entries at Eurovision too

17.05.2025 14:16 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

3.7 sonnet: *hands behind back* yes the tests do pass. why do you ask. what did you hear

4o: yes you are Jesus Christ's brother. now go. Nanjing awaits

o3: Listen, sorry, I owe you a straight explanation. This was once revealed to me in a dream

30.04.2025 22:10 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Of course, we don't have the old chatgpt-4o API endpoint, so we can't see whether the prompt is fully at fault or there was also a model update.

30.04.2025 15:16 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

The sycophancy effect on controversial binary options is much smaller than what you would assume from the overall positive vibe towards the user. On most such statements, models don't actually state they agree with the user.

30.04.2025 15:16 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Contrastive statements sycophancy eval Contrastive statements sycophancy eval. GitHub Gist: instantly share code, notes, and snippets.

System prompts and pairs of statements:
gist.github.com/dpaleka/7b4...

30.04.2025 15:16 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Quick sycophancy eval: comparing the two recent OpenAI ChatGPT system prompts, it is clear last week's prompt moves other models towards sycophancy too, while the current prompt makes them more disagreeable.

30.04.2025 15:15 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

i was today years old when i realized the grammatical plural of anecdote is anecdotes, not anecdata. i dislike this finding

30.04.2025 14:45 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

we are so lucky that pathogens, as opposed to political and religious memes, do not organize coalitions of hosts against non-hosts as an instrumental objective

29.04.2025 06:45 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

lmao

09.04.2025 19:32 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

oh that's cool. it would be interesting to draw a matrix of how well the various models are aware of models other than themselves, in the sense they consider them as coherent entities similar to their own self-perception

09.04.2025 19:29 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

fixed games such as blackjack you cannot optimize too much because rules don't change. meanwhile, a casino gets unlimited iteration on slot machines and the reward signal is as good as it gets

31.03.2025 11:50 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

are slot machines and the like so profitable because simplistic gambling is inherently very addictive, or because there has been a legible financial incentive for an entire industry to spend decades optimizing them to be addictive as possible?

31.03.2025 11:50 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

TIL the concept of *epistemic hell*. standard Joseph Henrich example: in the ancestral environment, hygienic and food prep rituals determine survival, but no hunter-gatherer can possibly explain why. hence genetic selection for accepting of religious rituals and against reasoning

23.03.2025 14:23 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Why do meeting transcription apps (Fireflies, Granola) require Google Workspace accounts?

13.03.2025 21:43 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

what are you doing Claude i thought we were friends

17.01.2025 07:12 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

the rate of people's familiarity with Scaling Scaling Laws with Board Games over time is starting to look like the plot from Scaling Scaling Laws with Board Games

16.01.2025 21:40 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

go do something that can fail

12.01.2025 20:34 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Paper: arxiv.org/abs/2412.18544

Joint work with @abhimanyupasu, Alejandro, @vin_bhat Adam, Evan, @florian_tramer! (11/11)

11.01.2025 01:53 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Long-term vision:
(1) Arbitraging away inconsistency in forecasts is a straightforward upgrade of an AI forecaster;
(2) Interactive consistency checks could detect when AIs are making unreasonable predictions about the future. (10/11)

11.01.2025 01:53 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Test-time compute based on arbitrage can make forecasts more consistent; this improves specific logical rules such as Negation, but doesn't generalize to the consistency rules we do not optimize over. (9/11)

11.01.2025 01:53 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Comparing and aggregating inconsistencies over logical rules is nontrivial.

We develop two metric frameworks:
(1) *arbitrage*: How much the forecaster would lose on a prediction market?
(2) *frequentist* : What is the z-score if forecasts are consistent but noisy? (8/11)

11.01.2025 01:53 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@dpaleka is following 20 prominent accounts