Lakshya A Agrawal @lakshyaaagrawal

Lakshya A Agrawal, et al.: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning https://arxiv.org/abs/2507.19457 https://arxiv.org/pdf/2507.19457 https://arxiv.org/html/2507.19457

28.07.2025 06:30 — 👍 1 🔁 4 💬 0 📌 0

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, ...
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
https://arxiv.org/abs/2507.19457

28.07.2025 04:43 — 👍 1 🔁 1 💬 0 📌 0

Super excited for our CogSci paper on the dynamics of conversation led by @helen-schmidt.bsky.social and @clairebergey.bsky.social !

21.06.2025 22:15 — 👍 34 🔁 4 💬 0 📌 0

🙌 Hot off the presses @natcomms.nature.com! We created a custom #Minecraft environment to study a long-standing puzzle in cognitive science:
How do humans flexibly adapt their individual and social learning strategies in dynamic, realistic situations? Check it out 👉 www.nature.com/articles/s41... 🧵👇

25.04.2025 10:57 — 👍 174 🔁 55 💬 8 📌 3

this is great, and a better reflection of how companies are actually going to be using and testing these tools

03.03.2025 20:02 — 👍 5 🔁 1 💬 0 📌 0

Yes there's an evals crisis, but evaluating *models* is not even the right question most of the time

LangProBe from Shangyin Tan, @lakshyaaagrawal.bsky.social, Arnav Singhvi, Liheng Lai, @michaelryan207.bsky.social et al begins to ask what complete *AI systems* we should build & under what settings

03.03.2025 19:42 — 👍 10 🔁 2 💬 0 📌 0

LangProBe: a Language Programs Benchmark Composing language models (LMs) into multi-step language programs and automatically optimizing their modular prompts is now a mainstream paradigm for building AI systems, but the tradeoffs in this spa...

13/13: Work done with amazing collaborators: Shangyin Tan, Arnav Singhvi, Liheng Lai, Michael Ryan, Dan Klein, Omar Khattab, Koushik Sen and Matei Zaharia from @ucberkeleyofficial.bsky.social Sky, Berkeley NLP, @stanfordnlp.bsky.social and Databricks

📜 Paper: arxiv.org/abs/2502.20315

03.03.2025 18:58 — 👍 3 🔁 0 💬 0 📌 0

12/13: The code and evaluation data for LangProBe will be open-sourced, providing much-needed infra and benchmark for e2e testing new prompt optimizers and language program architectures. We look forward to community contributions for new tasks, language programs and optimizers!

03.03.2025 18:58 — 👍 0 🔁 0 💬 1 📌 0

11/13: Further, LangProBe analysis shows that for now, human judgment or iterative development around which compositions to pursue is still necessary for best performance - there's no universal "set it and forget it" strategy that works across all tasks and models, yet!

03.03.2025 18:58 — 👍 0 🔁 0 💬 1 📌 0

10/13: LangProBe demonstrates that the future of AI systems isn't just about bigger models, but smarter composition. By carefully designing language programs and optimization strategies, we can build more capable and cost-effective systems.

03.03.2025 18:58 — 👍 0 🔁 0 💬 1 📌 0

9/13: Among optimizers, MIPROv2, which constructs instructions and few-shot examples and explores their cross-module combinations through Bayesian search, performed best on avg.

But bootstrapping few-shot examples with random search and RuleInfer remains highly competitive!

03.03.2025 18:58 — 👍 0 🔁 0 💬 1 📌 0

8/13: We also introduce RuleInfer, a new program-level prompt optimizer that induces rules from bootstrapped examples. RuleInfer offers particularly strong performance in tasks with clear, discrete constraints such as classification.

03.03.2025 18:58 — 👍 0 🔁 0 💬 1 📌 0

7/13: LangProBe analyses reveals empirically that different program architectures shine in different contexts. Modular programs are essential for tasks requiring external information or tools. RAG and multi-hop retrieval excel at tasks needing long-tail world knowledge.

03.03.2025 18:58 — 👍 0 🔁 0 💬 1 📌 0

6/13: Further, in almost all tasks, both optimized and unoptimized Language Programs significantly outperform raw model predictions, even irrespective of costs:

03.03.2025 18:58 — 👍 0 🔁 0 💬 1 📌 0

5/13: For example, gpt-4o-mini with optimized language programs achieved 11.68% higher scores than baseline gpt-4o at just 50% of the cost, and outperforms gpt-4o with programs at just 10% of the cost! This has huge implications for building cost-effective AI systems.

03.03.2025 18:58 — 👍 1 🔁 0 💬 1 📌 0

4/13: We find that optimized language programs offer strong cost-quality improvements over raw model calls, though the best system compositions still need thoughtful design.

Smaller LMs within an optimized program can often outperform larger LMs at a fraction of the cost. 📉💰

03.03.2025 18:58 — 👍 0 🔁 0 💬 1 📌 0

3/13: LangProBe evaluates 15+ datasets across diverse categories: coding tasks, math reasoning, classification, QA, and agent benchmarks. It implements 10+ program architectures from simple LM calls to complex modular systems with multiple reasoning and retrieval steps.

03.03.2025 18:58 — 👍 0 🔁 0 💬 1 📌 0

2/13: LLMs are no longer standalone tools. They are used as part of Language Programs: Modular systems composing multiple LLM calls w/ external tools, RAG and inference/agentic techniques to solve complex tasks. But very few evals study their e2e cost-performance tradeoffs!

03.03.2025 18:58 — 👍 0 🔁 0 💬 1 📌 0

🧵Introducing LangProBe: the first benchmark testing where and how composing LLMs into language programs affects cost-quality tradeoffs!

We find that, on avg across diverse tasks, smaller models within optimized programs beat calls to larger models at a fraction of the cost.

03.03.2025 18:58 — 👍 5 🔁 3 💬 1 📌 2

arXiv:2502.20315v1 Announce Type: new
Abstract: Composing language models (LMs) into multi-step language programs and automatically optimizing their modular prompts is now a mainstream paradigm for building AI systems, but [1/5 of https://arxiv.org/abs/2502.20315v1]

28.02.2025 05:59 — 👍 1 🔁 1 💬 1 📌 0

Shangyin Tan, Lakshya A Agrawal, Arnav Singhvi, Liheng Lai, Michael J Ryan, Dan Klein, Omar Khattab, Koushik Sen, Matei Zaharia: LangProBe: a Language Programs Benchmark https://arxiv.org/abs/2502.20315 https://arxiv.org/pdf/2502.20315 https://arxiv.org/html/2502.20315

28.02.2025 05:59 — 👍 1 🔁 1 💬 1 📌 0

Shangyin Tan, Lakshya A Agrawal, Arnav Singhvi, Liheng Lai, Michael J Ryan, Dan Klein, Omar Khattab, Koushik Sen, Matei Zaharia
LangProBe: a Language Programs Benchmark
https://arxiv.org/abs/2502.20315

28.02.2025 06:04 — 👍 2 🔁 1 💬 0 📌 0

Can AI do maths yet? Thoughts from a mathematician. So the big news this week is that o3, OpenAI’s new language model, got 25% on FrontierMath. Let’s start by explaining what this means.

An excellent post by Kevin Buzzard on informal reasoning methods like o3. The key point, one I wholeheartedly agree with, is that informal methods continue to struggle with proof even when they give the correct answers, and this is a critical liability. xenaproject.wordpress.com/2024/12/22/c...

23.12.2024 21:55 — 👍 17 🔁 4 💬 2 📌 1

📰 Multilspy: Building a common LSP client handtuned for all Language servers

💬 Exciting tech for seamless coding: Monitor-Guided Decoding + multilspy aims to unify language server setups. Feedback welcome! 😄

https://news.ycombinator.com/item?id=42438918

17.12.2024 06:55 — 👍 1 🔁 1 💬 0 📌 0

Multilspy: Building a common LSP client handtuned for all Language servers (@github.com)

Main Link | Discussion

17.12.2024 14:42 — 👍 1 🔁 1 💬 0 📌 0

Multilspy: Building a common LSP client handtuned for all Language servers
https://github.com/microsoft/multilspy
[comments] [32 points]

17.12.2024 10:03 — 👍 1 🔁 1 💬 0 📌 0

GitHub - microsoft/multilspy: multispy is a lsp client library in Python intended to be used to build applications around language servers. multispy is a lsp client library in Python intended to be used to build applications around language servers. - microsoft/multilspy

Multilspy: Building a common LSP client handtuned for all Language servers https://github.com/microsoft/multilspy (https://news.ycombinator.com/item?id=42438918)

17.12.2024 09:06 — 👍 1 🔁 1 💬 0 📌 0

GitHub - microsoft/multilspy: multispy is a lsp client library in Python intended to be used to build applications around language servers. multispy is a lsp client library in Python intended to be used to build applications around language servers. - microsoft/multilspy

Multilspy: Building a common LSP client handtuned for all Language servers https://github.com/microsoft/multilspy (https://news.ycombinator.com/item?id=42438918)

17.12.2024 13:31 — 👍 1 🔁 1 💬 0 📌 0

GitHub - microsoft/multilspy: multispy is a lsp client library in Python intended to be used to build applications around language servers. multispy is a lsp client library in Python intended to be used to build applications around language servers. - microsoft/multilspy

Multilspy: Building a common LSP client handtuned for all Language servers

17.12.2024 13:49 — 👍 1 🔁 1 💬 0 📌 0

Multilspy: Building a common LSP client handtuned for all Language servers

Discussion

17.12.2024 06:54 — 👍 1 🔁 1 💬 0 📌 0

Lakshya A Agrawal

Latest posts by lakshyaaagrawal.bsky.social on Bluesky

@lakshyaaagrawal is following 20 prominent accounts