Inaugurating new acct to share work from my PhD student!
Wayne et al have been running a live eval platform Copilot Arena - a VSCode extension serving code completions from AI systems to real developers. See π§΅ for findings and preprint
Excited to be evaluating human-AI *workflows* holistically!
05.03.2025 17:01 β π 10 π 3 π¬ 0 π 0
Full Paper with additional analyses: arxiv.org/abs/2502.09328
Code: github.com/lmarena/copi...
w/ Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, @chrisdonahue.com , @atalwalkar.bsky.social
05.03.2025 16:49 β π 0 π 0 π¬ 1 π 0
Our paper analyzes human preferences across 10 SOTA coding models, but we continue to add more models to the live Copilot Arena leaderboard on lmarena.ai!
05.03.2025 16:49 β π 0 π 0 π¬ 1 π 0
Different data slices affect user preferences disproportionally. There is a drastic difference in relative model performance between real-world tasks such as frontend or backend development versus leetcode style coding challenges but little difference between programming languages.
05.03.2025 16:49 β π 0 π 0 π¬ 1 π 0
We attribute these differences to a significant shift in our data distribution. Compared to previous benchmarks, Copilot Arena observes more programming languages (PL), natural languages (NL), longer context lengths, multiple task types, and various code structures.
05.03.2025 16:49 β π 0 π 0 π¬ 1 π 0
Our leaderboard differs from existing evaluations. In particular, smaller models over perform in static benchmarks compared to real development workflows.
05.03.2025 16:49 β π 0 π 0 π¬ 1 π 0
We evaluate models in a developer's IDE by presenting pairs of code completions generated by two different models. This workflow evaluates human preferences across models with real users and tasks in their native environment.
05.03.2025 16:49 β π 0 π 0 π¬ 1 π 0
What do developers π³π¦π’πππΊ think of AI coding assistants?
In October, we launched Copilot Arena to collect user preferences on real dev workflows. After months of live service, weβre here to share our findings in our recent preprint.
Here's what we have learned /π§΅
05.03.2025 16:49 β π 1 π 0 π¬ 1 π 2
Got to test out InceptionAILab's newest model, Mercury Coder Mini, on Copilot Arena!
Mercury Coder Mini is blazing fast and overtakes Codestral as the fastest coding model out there (0.24s end-to-end latency) while boasting similar performance (joint #2).
Congrats to InceptionAILabs! πΈ
26.02.2025 23:51 β π 1 π 0 π¬ 0 π 0
I had the same problem. I only use cursor for newer, small projects. I use Copilot Arena's edit feature for projects in VSCode (but obviously I'm biased)
05.01.2025 08:11 β π 1 π 0 π¬ 0 π 0
Deepseek v3 (FiM) is now available in Copilot Arena for free!
Download at lmarena.ai/copilot
31.12.2024 21:12 β π 0 π 0 π¬ 0 π 0
These lists are better than most "2024's best games" lists
27.12.2024 04:16 β π 0 π 0 π¬ 0 π 0
Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots
Copilot Arena's leaderboard is now live on lmarena.ai/leaderboard!
We've collected over 15k votes on 11 models (2 new models since our last blogpost release). Congrats @deepseek.bsky.socialπ₯and @anthropic.comπ₯!
23.12.2024 21:41 β π 0 π 0 π¬ 0 π 0
I'm not physically at NeurIPS, but my good friend
@naveenraman.bsky.social will be presenting in my stead.
In this work, we found that UI element ordering significantly affected GUI agent performance. Come check out the poster (and quiz Naveen) at the Workshop on Open-World Agents (OWA-2024)!
13.12.2024 07:32 β π 0 π 0 π¬ 0 π 0
Bruh what... π
10.12.2024 17:51 β π 0 π 0 π¬ 0 π 0
We've open sourced CopilotArenaβs server code!
Check out how we handle code completions and share your ideas for new system prompts!
Github:
github.com/lmarena/copi...
Technical details in the blog: blog.lmarena.ai/blog/2024/co...
Download Copilot now at: lmarena.ai/copilot
05.12.2024 19:44 β π 0 π 0 π¬ 0 π 0
Trying out Bluesky. Will mostly be posting about Copilot Arena!
20.11.2024 06:59 β π 0 π 0 π¬ 0 π 0
Assistant prof at LTI CMU; Research scientist at Meta AI. Working on NLP: language interfaces, applied pragmatics, language-to-code, grounding. https://dpfried.github.io/
Research in generative AI for **human** creativity in music + more.
Assistant professor at CMU CSD, leading the πΌ G-CLef lab. Part time research scientist at Google DeepMind on the Magenta team (views my own)
PhD Student in Machine Learning at CMU.
π¦ twitter.com/steph_milani
π stephmilani.github.io
Machine Learning (the science part) | PhD student @ CMU
Data Quality x Privacy
PhD student @ CMU with Zico Kolter and Zack Lipton | Founding Member @datologyai.com | Prev. Comp Sc @iitdelhi
http://pratyushmaini.github.io/
Machine Learning PhD Student at CMU | Student Researcher at Google | dsam99.github.io
PhD student at Machine Learning Department @ CMU
PhD student @ CMU with Zico Kolter | Prev. research scientist @abacusai, ml eng @primer_ai | Prev. Prev. CS+Stats @Stanford
PhD Student in Machine Learning at CMU. yewonbyun.github.io
phd student in machine learning @ CMU | prev: Penn
PhD student @CMU | Prev. undergrad @Tsinghua
https://chenwu.io/
Ph.D. Student at Carnegie Mellon,
Student Research at Google
Formerly Applied Science Intern Amazon, Undergrad at Delhi Technological University
π Foundation Models for Structured Data (Time Series, Tabular), applications in healthcare.
PhD student @CMU / CuriosityοΌLove / Dynamics to ASI
PhD Student in Machine Learning @CMU | BS @UCLA | Interning @Meta | Interned @MSFTResearch @DeterminedAI
Professor a NYU; Chief AI Scientist at Meta.
Researcher in AI, Machine Learning, Robotics, etc.
ACM Turing Award Laureate.
http://yann.lecun.com
Parent, spouse, Australian, Professor of Machine Learning in Oxford. Long Covid, trans rights, music, reggae on Fridays, AI must be good for humans, https://www.robots.ox.ac.uk/~mosb