Wayne's Avatar

Wayne

@waynechi.bsky.social

CS Ph.D. at CMU. Building Copilot Arena. Editor at http://blog.ml.cmu.edu

33 Followers  |  170 Following  |  17 Posts  |  Joined: 19.11.2024  |  1.6082

Latest posts by waynechi.bsky.social on Bluesky

Inaugurating new acct to share work from my PhD student!

Wayne et al have been running a live eval platform Copilot Arena - a VSCode extension serving code completions from AI systems to real developers. See 🧡 for findings and preprint

Excited to be evaluating human-AI *workflows* holistically!

05.03.2025 17:01 β€” πŸ‘ 10    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
Preview
Copilot Arena - Visual Studio Marketplace Extension for Visual Studio Code - Code with and evaluate the latest LLMs and Code Completion models

Interested in trying out Copilot Arena for yourself?
Download at lmarena.ai/copilot.
Follow for more updates!

05.03.2025 16:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Full Paper with additional analyses: arxiv.org/abs/2502.09328
Code: github.com/lmarena/copi...

w/ Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, @chrisdonahue.com , @atalwalkar.bsky.social

05.03.2025 16:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Our paper analyzes human preferences across 10 SOTA coding models, but we continue to add more models to the live Copilot Arena leaderboard on lmarena.ai!

05.03.2025 16:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Different data slices affect user preferences disproportionally. There is a drastic difference in relative model performance between real-world tasks such as frontend or backend development versus leetcode style coding challenges but little difference between programming languages.

05.03.2025 16:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We attribute these differences to a significant shift in our data distribution. Compared to previous benchmarks, Copilot Arena observes more programming languages (PL), natural languages (NL), longer context lengths, multiple task types, and various code structures.

05.03.2025 16:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Our leaderboard differs from existing evaluations. In particular, smaller models over perform in static benchmarks compared to real development workflows.

05.03.2025 16:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We evaluate models in a developer's IDE by presenting pairs of code completions generated by two different models. This workflow evaluates human preferences across models with real users and tasks in their native environment.

05.03.2025 16:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

What do developers 𝘳𝘦𝘒𝘭𝘭𝘺 think of AI coding assistants?

In October, we launched Copilot Arena to collect user preferences on real dev workflows. After months of live service, we’re here to share our findings in our recent preprint.

Here's what we have learned /🧡

05.03.2025 16:49 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 2
Post image

Got to test out InceptionAILab's newest model, Mercury Coder Mini, on Copilot Arena!

Mercury Coder Mini is blazing fast and overtakes Codestral as the fastest coding model out there (0.24s end-to-end latency) while boasting similar performance (joint #2).

Congrats to InceptionAILabs! πŸ“Έ

26.02.2025 23:51 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I had the same problem. I only use cursor for newer, small projects. I use Copilot Arena's edit feature for projects in VSCode (but obviously I'm biased)

05.01.2025 08:11 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Deepseek v3 (FiM) is now available in Copilot Arena for free!

Download at lmarena.ai/copilot

31.12.2024 21:12 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

These lists are better than most "2024's best games" lists

27.12.2024 04:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots

Copilot Arena's leaderboard is now live on lmarena.ai/leaderboard!

We've collected over 15k votes on 11 models (2 new models since our last blogpost release). Congrats @deepseek.bsky.socialπŸ₯‡and @anthropic.comπŸ₯‡!

23.12.2024 21:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

I'm not physically at NeurIPS, but my good friend
@naveenraman.bsky.social will be presenting in my stead.

In this work, we found that UI element ordering significantly affected GUI agent performance. Come check out the poster (and quiz Naveen) at the Workshop on Open-World Agents (OWA-2024)!

13.12.2024 07:32 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Bruh what... πŸ’€

10.12.2024 17:51 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

We've open sourced CopilotArena’s server code!

Check out how we handle code completions and share your ideas for new system prompts!

Github:
github.com/lmarena/copi...
Technical details in the blog: blog.lmarena.ai/blog/2024/co...

Download Copilot now at: lmarena.ai/copilot

05.12.2024 19:44 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Trying out Bluesky. Will mostly be posting about Copilot Arena!

20.11.2024 06:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@waynechi is following 20 prominent accounts