Sara Fish's Avatar

Sara Fish

@sarafish.bsky.social

PhD student at Harvard interested in EconCS and ML / previously Caltech undergrad in math

30 Followers  |  58 Following  |  8 Posts  |  Joined: 19.11.2024  |  1.6184

Latest posts by sarafish.bsky.social on Bluesky

Scalable oversight / debate, to an extent

13.05.2025 16:02 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

More details, including statistical significance, in the paper.

joint w/ Julia Shephard, Minkai Li, @yannaigonch.bsky.social , @ranshorrer.bsky.social

Paper: arxiv.org/abs/2503.18825
Code: github.com/sara-fish/ec... 6/6

04.04.2025 15:47 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

In addition to the EconEvals benchmarks, in the EconEvals โ€œlitmus testsโ€, we quantify tendencies of LLMs and LLM agents when faced with tradeoffs for which there is no objectively correct choice: for example efficiency vs. equality. 5/6

04.04.2025 15:47 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

(And a score of 70% on each of our benchmarks has a specific economic meaning. For example, 70% at pricing corresponds to capturing 70% of total possible profits. Very different from 70% accuracy at a closed-ended Q&A benchmark!) 4/6

04.04.2025 15:47 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image Post image

To forestall saturation, we can scale the difficulty of our benchmark questions by scaling parameters of the economic environment. Our HARD difficulty level is challenging: no LLM we test, including o3-mini, scores above 70%. (Low scores of o3-mini possibly driven by underexploration.) 3/6

04.04.2025 15:47 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

In EconEvals benchmarks, LLM agents repeatedly take actions in an economic environment, and must learn optimal actions via trial and error (a capability SoTA LLMs struggle with!) 2/6

04.04.2025 15:47 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Screenshot of first page of "EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments"

Screenshot of first page of "EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments"

New paper: "EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments"

We construct economic environments to measure the capabilities and tendencies of LLMs and LLM agents in pricing, procurement, task allocation and more. 1/6

04.04.2025 15:47 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

This is part of a super interesting line of work. For this paper I got to help as a member of an AI-led team, the most fun treatment ๐Ÿ™‚ (I may be biased). Current LLMs often "get stuck" on complex tasks like reproducing a paper, but we can expect some of these limitations to go away with time.

22.01.2025 20:59 โ€” ๐Ÿ‘ 4    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

๐ŸšจMajor new version๐Ÿšจ
Algorithmic Collusion by Large Language Models
Joint w/ @sarafish.bsky.social & @ranshorrer.bsky.social

LLMs are automating many business decisions. Pricing might be next (or is already).
What if multiple firms, in good faith, to use off-the-shelf-LLMs for pricing? 1/3
#EconSky

28.11.2024 02:14 โ€” ๐Ÿ‘ 74    ๐Ÿ” 18    ๐Ÿ’ฌ 5    ๐Ÿ“Œ 3

@sarafish is following 20 prominent accounts