Lakshya A Agrawal, et al.: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning https://arxiv.org/abs/2507.19457 https://arxiv.org/pdf/2507.19457 https://arxiv.org/html/2507.19457
28.07.2025 06:30 โ ๐ 1 ๐ 4 ๐ฌ 0 ๐ 0@lakshyaaagrawal.bsky.social
PhD @ucberkeleyofficial.bsky.social | Past: AI4Code Research Fellow @msftresearch.bsky.social | Summer @EPFL Scholar, CS and Applied Maths @IIITDelhi | Hobbyist Saxophonist https://lakshyaaagrawal.github.io Maintainer of https://aka.ms/multilspy
Lakshya A Agrawal, et al.: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning https://arxiv.org/abs/2507.19457 https://arxiv.org/pdf/2507.19457 https://arxiv.org/html/2507.19457
28.07.2025 06:30 โ ๐ 1 ๐ 4 ๐ฌ 0 ๐ 0Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, ...
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
https://arxiv.org/abs/2507.19457
Super excited for our CogSci paper on the dynamics of conversation led by @helen-schmidt.bsky.social and @clairebergey.bsky.social !
21.06.2025 22:15 โ ๐ 34 ๐ 4 ๐ฌ 0 ๐ 0๐ Hot off the presses @natcomms.nature.com! We created a custom #Minecraft environment to study a long-standing puzzle in cognitive science:
How do humans flexibly adapt their individual and social learning strategies in dynamic, realistic situations? Check it out ๐ www.nature.com/articles/s41... ๐งต๐
this is great, and a better reflection of how companies are actually going to be using and testing these tools
03.03.2025 20:02 โ ๐ 5 ๐ 1 ๐ฌ 0 ๐ 0Yes there's an evals crisis, but evaluating *models* is not even the right question most of the time
LangProBe from Shangyin Tan, @lakshyaaagrawal.bsky.social, Arnav Singhvi, Liheng Lai, @michaelryan207.bsky.social et al begins to ask what complete *AI systems* we should build & under what settings
13/13: Work done with amazing collaborators: Shangyin Tan, Arnav Singhvi, Liheng Lai, Michael Ryan, Dan Klein, Omar Khattab, Koushik Sen and Matei Zaharia from @ucberkeleyofficial.bsky.social Sky, Berkeley NLP, @stanfordnlp.bsky.social and Databricks
๐ Paper: arxiv.org/abs/2502.20315
12/13: The code and evaluation data for LangProBe will be open-sourced, providing much-needed infra and benchmark for e2e testing new prompt optimizers and language program architectures. We look forward to community contributions for new tasks, language programs and optimizers!
03.03.2025 18:58 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 011/13: Further, LangProBe analysis shows that for now, human judgment or iterative development around which compositions to pursue is still necessary for best performance - there's no universal "set it and forget it" strategy that works across all tasks and models, yet!
03.03.2025 18:58 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 010/13: LangProBe demonstrates that the future of AI systems isn't just about bigger models, but smarter composition. By carefully designing language programs and optimization strategies, we can build more capable and cost-effective systems.
03.03.2025 18:58 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 09/13: Among optimizers, MIPROv2, which constructs instructions and few-shot examples and explores their cross-module combinations through Bayesian search, performed best on avg.
But bootstrapping few-shot examples with random search and RuleInfer remains highly competitive!
8/13: We also introduce RuleInfer, a new program-level prompt optimizer that induces rules from bootstrapped examples. RuleInfer offers particularly strong performance in tasks with clear, discrete constraints such as classification.
03.03.2025 18:58 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 07/13: LangProBe analyses reveals empirically that different program architectures shine in different contexts. Modular programs are essential for tasks requiring external information or tools. RAG and multi-hop retrieval excel at tasks needing long-tail world knowledge.
03.03.2025 18:58 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 06/13: Further, in almost all tasks, both optimized and unoptimized Language Programs significantly outperform raw model predictions, even irrespective of costs:
03.03.2025 18:58 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 05/13: For example, gpt-4o-mini with optimized language programs achieved 11.68% higher scores than baseline gpt-4o at just 50% of the cost, and outperforms gpt-4o with programs at just 10% of the cost! This has huge implications for building cost-effective AI systems.
03.03.2025 18:58 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 04/13: We find that optimized language programs offer strong cost-quality improvements over raw model calls, though the best system compositions still need thoughtful design.
Smaller LMs within an optimized program can often outperform larger LMs at a fraction of the cost. ๐๐ฐ
3/13: LangProBe evaluates 15+ datasets across diverse categories: coding tasks, math reasoning, classification, QA, and agent benchmarks. It implements 10+ program architectures from simple LM calls to complex modular systems with multiple reasoning and retrieval steps.
03.03.2025 18:58 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 02/13: LLMs are no longer standalone tools. They are used as part of Language Programs: Modular systems composing multiple LLM calls w/ external tools, RAG and inference/agentic techniques to solve complex tasks. But very few evals study their e2e cost-performance tradeoffs!
03.03.2025 18:58 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0๐งตIntroducing LangProBe: the first benchmark testing where and how composing LLMs into language programs affects cost-quality tradeoffs!
We find that, on avg across diverse tasks, smaller models within optimized programs beat calls to larger models at a fraction of the cost.
arXiv:2502.20315v1 Announce Type: new
Abstract: Composing language models (LMs) into multi-step language programs and automatically optimizing their modular prompts is now a mainstream paradigm for building AI systems, but [1/5 of https://arxiv.org/abs/2502.20315v1]
Shangyin Tan, Lakshya A Agrawal, Arnav Singhvi, Liheng Lai, Michael J Ryan, Dan Klein, Omar Khattab, Koushik Sen, Matei Zaharia: LangProBe: a Language Programs Benchmark https://arxiv.org/abs/2502.20315 https://arxiv.org/pdf/2502.20315 https://arxiv.org/html/2502.20315
28.02.2025 05:59 โ ๐ 1 ๐ 1 ๐ฌ 1 ๐ 0Shangyin Tan, Lakshya A Agrawal, Arnav Singhvi, Liheng Lai, Michael J Ryan, Dan Klein, Omar Khattab, Koushik Sen, Matei Zaharia
LangProBe: a Language Programs Benchmark
https://arxiv.org/abs/2502.20315
An excellent post by Kevin Buzzard on informal reasoning methods like o3. The key point, one I wholeheartedly agree with, is that informal methods continue to struggle with proof even when they give the correct answers, and this is a critical liability. xenaproject.wordpress.com/2024/12/22/c...
23.12.2024 21:55 โ ๐ 17 ๐ 4 ๐ฌ 2 ๐ 1๐ฐ Multilspy: Building a common LSP client handtuned for all Language servers
๐ฌ Exciting tech for seamless coding: Monitor-Guided Decoding + multilspy aims to unify language server setups. Feedback welcome! ๐
https://news.ycombinator.com/item?id=42438918
Multilspy: Building a common LSP client handtuned for all Language servers (@github.com)
Main Link | Discussion
Multilspy: Building a common LSP client handtuned for all Language servers
https://github.com/microsoft/multilspy
[comments] [32 points]
Multilspy: Building a common LSP client handtuned for all Language servers https://github.com/microsoft/multilspy (https://news.ycombinator.com/item?id=42438918)
17.12.2024 09:06 โ ๐ 1 ๐ 1 ๐ฌ 0 ๐ 0Multilspy: Building a common LSP client handtuned for all Language servers https://github.com/microsoft/multilspy (https://news.ycombinator.com/item?id=42438918)
17.12.2024 13:31 โ ๐ 1 ๐ 1 ๐ฌ 0 ๐ 0Multilspy: Building a common LSP client handtuned for all Language servers
17.12.2024 13:49 โ ๐ 1 ๐ 1 ๐ฌ 0 ๐ 0Multilspy: Building a common LSP client handtuned for all Language servers
Discussion