Pranjal's Avatar

Pranjal

@pranjal2041.bsky.social

PhD Student @ltiatcmu.bsky.social. Working on reasoning, code-gen agents and test-time compute.

36 Followers  |  49 Following  |  16 Posts  |  Joined: 10.12.2024  |  2.1238

Latest posts by pranjal2041.bsky.social on Bluesky

Post image Post image

Want to use PwP for your own agents? Weโ€™ve built a super simple API to interact with PwP & evaluate on PwP-Bench.

Find more interesting details on our website & paper here:

๐ŸŒ: programmingwithpixels.com

with
@wellecks.bsky.social

26.02.2025 17:17 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Yet the trajectory is clear: once these agents improve, will we ever need handcrafted tool-specific pipelines again?

Perhaps general SWE agents are just a special case of general-purpose computer-use agents?

๐Ÿงต

26.02.2025 17:17 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image Post image

However, itโ€™s not perfect โ€“ todayโ€™s computer use agents are still bad at visual groundingโ€”they struggle to fully exploit IDE features.

Tiny icons or complex menus can confuse the agent, and multi-step operations (e.g. a debugger workflow) remain challenging.

๐Ÿงต

26.02.2025 17:17 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

So we tested a computer-use agent (screen interaction + file, bash tools) in PwP on PwP-Bench.

Results? It competes with or even surpasses tool-based agentsโ€”without any domain-specific hand-engineering.

26.02.2025 17:17 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image Post image Post image

To evaluate agents, we unified 15 existing SWE benchmarks into PwP-Bench, covering multiple languages and tasks like debugging, multimodal code synthesis, ethical hacking & more.

PwP-Bench systematically tests a broad dev skillset within a single, realistic IDE setup.

๐Ÿงต

26.02.2025 17:17 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

We built an environment PwP for generalist agents that use the computer like a developer.

Agents see the IDE (pixel screenshots) and act by typing & clickingโ€”no special integrations are needed. They leverage all developer tools, just like an autonomous programmer at keyboard!

๐Ÿงต

26.02.2025 17:17 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image Post image

Our key insight: most SWE tasks boil down to using an IDE, i.e

โœ… Seeing VS Codeโ€™s UI
โœ… Typing, clicking & basic file operations
โœ… Using built-in tools (debuggers, refactoring, etc)

A simple, general-purpose agent should leverage any SWE tool-without extra engineering.

๐Ÿงต

26.02.2025 17:17 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Why? Because most AI coding assistants today rely on a handful of fixed tools or APIs to write or fix code.

Thatโ€™s limiting โ€“ they canโ€™t easily adapt to new tasks or fully leverage complex IDEs like human developers do. In other words, their โ€œtoolboxโ€ is too narrow.

๐Ÿงต

26.02.2025 17:17 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image Post image

What if AI agents did software engineering like humansโ€”seeing the screen & using any developer tool?

Introducing Programming with Pixels: an SWE environment where agents control VSCode via screen perception, typing & clicking to tackle diverse tasks.

programmingwithpixels.com

๐Ÿงต

26.02.2025 17:17 โ€” ๐Ÿ‘ 8    ๐Ÿ” 4    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

Special thanks for the great guidance and support by my amazing advisor
@wellecks.bsky.social
and collaborator Bryan Parno!!

10.12.2024 22:33 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
AlphaVerus: Bootstrapping Formally Verified Code Generation AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement

Check out our website and paper for more details!

๐ŸŒ: alphaverus.github.io
๐Ÿ“œ: arxiv.org/abs/2412.06176

We believe this is an important step towards trustworthy code generation and teaching models to generate complex algorithmic code reliably!

10.12.2024 22:33 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image Post image

Results: Despite no human intervention, we achieve state-of-the-art results on verified code generation for verified versions of Human-Eval and MBPP.

Even better, AlphaVerus can improve GPT-4o or any other model without any fine-tuning!!

๐Ÿงต

10.12.2024 22:33 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image Post image

We find these models self-learn to game the system by using subtle techniques such as generating incomplete specifications, degenerate solutions, and even exploiting verifier limitations!

Our critique module solves this problem! And yes, our critique module self-improves too!

๐Ÿงต

10.12.2024 22:33 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

AlphaVerus solves these challenges by translating from a high-resource language. Incorrect translations are refined using a novel tree search algorithm guided by verifier feedback!

Correct translations and refinements improve future models!

Challenge? Reward Hacking!

๐Ÿงต

10.12.2024 22:33 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

How does it work? Given input/output specifications, LLMs generate corresponding Rust code along with proof hints like invariants. A verifier (Verus) checks if the generated proof & code is correct.

Challenge: Data is scarce & proofs are complex!

10.12.2024 22:33 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

LLMs often generate incorrect code.

Instead, what if they can prove code correctness?

Presenting AlphaVerus: A self-reinforcing method that automatically learns to generate correct code using inference-time search and verifier feedback.

๐ŸŒ : alphaverus.github.io

๐Ÿงต

10.12.2024 22:33 โ€” ๐Ÿ‘ 4    ๐Ÿ” 1    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1

@pranjal2041 is following 20 prominent accounts