Scalable oversight / debate, to an extent
13.05.2025 16:02 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0@sarafish.bsky.social
PhD student at Harvard interested in EconCS and ML / previously Caltech undergrad in math
Scalable oversight / debate, to an extent
13.05.2025 16:02 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0More details, including statistical significance, in the paper.
joint w/ Julia Shephard, Minkai Li, @yannaigonch.bsky.social , @ranshorrer.bsky.social
Paper: arxiv.org/abs/2503.18825
Code: github.com/sara-fish/ec... 6/6
In addition to the EconEvals benchmarks, in the EconEvals โlitmus testsโ, we quantify tendencies of LLMs and LLM agents when faced with tradeoffs for which there is no objectively correct choice: for example efficiency vs. equality. 5/6
04.04.2025 15:47 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0(And a score of 70% on each of our benchmarks has a specific economic meaning. For example, 70% at pricing corresponds to capturing 70% of total possible profits. Very different from 70% accuracy at a closed-ended Q&A benchmark!) 4/6
04.04.2025 15:47 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0To forestall saturation, we can scale the difficulty of our benchmark questions by scaling parameters of the economic environment. Our HARD difficulty level is challenging: no LLM we test, including o3-mini, scores above 70%. (Low scores of o3-mini possibly driven by underexploration.) 3/6
04.04.2025 15:47 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0In EconEvals benchmarks, LLM agents repeatedly take actions in an economic environment, and must learn optimal actions via trial and error (a capability SoTA LLMs struggle with!) 2/6
04.04.2025 15:47 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0Screenshot of first page of "EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments"
New paper: "EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments"
We construct economic environments to measure the capabilities and tendencies of LLMs and LLM agents in pricing, procurement, task allocation and more. 1/6
This is part of a super interesting line of work. For this paper I got to help as a member of an AI-led team, the most fun treatment ๐ (I may be biased). Current LLMs often "get stuck" on complex tasks like reproducing a paper, but we can expect some of these limitations to go away with time.
22.01.2025 20:59 โ ๐ 4 ๐ 1 ๐ฌ 0 ๐ 0๐จMajor new version๐จ
Algorithmic Collusion by Large Language Models
Joint w/ @sarafish.bsky.social & @ranshorrer.bsky.social
LLMs are automating many business decisions. Pricing might be next (or is already).
What if multiple firms, in good faith, to use off-the-shelf-LLMs for pricing? 1/3
#EconSky