Peng Qi's Avatar

Peng Qi

@qi2peng2.bsky.social

Multimodal Agents Research @ Orby AI. Ex-AWS AI, JD AI. PhD from @stanfordnlp.bsky.social, UG Tsinghua U. He/him. Opinions my own.

289 Followers  |  43 Following  |  79 Posts  |  Joined: 18.11.2024  |  1.7297

Latest posts by qi2peng2.bsky.social on Bluesky

How do we prove that #AI can't do #maths?

Real Mathematics (yes, "real" is a pun here):

a+b+c = (a+b)+c = a+(b+c)

AI Mathematics (well, floating point maths, really):

>>> 0.1+0.2+0.3
0.6000000000000001
>>> 0.1+(0.2+0.3)
0.6

QED.

25.07.2025 23:59 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0


This project was joint work with my Amazon colleagues (led by Yumo Xu), and it's great to see it finally published. Hope this helps motivate more careful eval work in the near future!

#AI #agent #evaluation #RAG #NLP

16.07.2025 00:40 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

b) as builders, we evaluate the technology soberly and help users navigate these risks in product design.

Want to learn more? Checkout
Our paper: arxiv.org/pdf/2506.01829
Open-source code: github.com/amazon-scien...

16.07.2025 00:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Why should you care? As businesses / individuals leverage AI more and more to speed up research and decision-making, it is important that, a) as users, we examine the tools we are using to understand their limitations and avoid pitfalls with significant potential downsides, and

16.07.2025 00:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

With a new, carefully annotated dataset and an automated evaluation metric we designed, we find that although LLMs are reasonably good at citing accurate sources most of the time, SOTA LLMs still cite incorrectly 5-28% of the time, and miss citations anywhere from 16% to an alarming 95% of the time.

16.07.2025 00:40 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

"๐‚๐ข๐ญ๐ž๐„๐ฏ๐š๐ฅ: ๐๐ซ๐ข๐ง๐œ๐ข๐ฉ๐ฅ๐ž-๐ƒ๐ซ๐ข๐ฏ๐ž๐ง ๐‚๐ข๐ญ๐š๐ญ๐ข๐จ๐ง ๐„๐ฏ๐š๐ฅ๐ฎ๐š๐ญ๐ข๐จ๐ง ๐Ÿ๐จ๐ซ ๐’๐จ๐ฎ๐ซ๐œ๐ž ๐€๐ญ๐ญ๐ซ๐ข๐›๐ฎ๐ญ๐ข๐จ๐ง", we propose a framework to systematically study citation accuracy by considering previously neglected contexts such as user-provided information and LLMs' parametric knowledge.

16.07.2025 00:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

As ๐Ÿ”Ž AI deep research agents ๐Ÿ”Ž become an essential part of many people's day-to-day work, it is more essential than ever before that we can trust what they produce.

When these agents cite sources they claim the report is based on, how much can we actually trust them? In our new #ACL2025 paper, ...

16.07.2025 00:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Why You Should Stop Using HotpotQA for AI Agents Evaluation in 2025 | Peng Qi We published HotpotQA, a groundbreaking multi-step question answering dataset in 2018, which has since motivated and facilitated numerous AI agent research works. But you should probably reconsiderโ€ฆ

AI agents research, and share a skeleton of a research proposal if I ever find the time / need to write one from my perspective today: qipeng.me/blog/stop-us...

02.07.2025 18:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Why You Should Stop Using HotpotQA for AI Agents Evaluation in 2025 | Peng Qi We published HotpotQA, a groundbreaking multi-step question answering dataset in 2018, which has since motivated and facilitated numerous AI agent research works. But you should probably reconsiderโ€ฆ

a good problem in its historical context, some of my own research attempts at solving this problem that I believe are on the critical path to autonomous agents, and what's changed today to make the dataset less relevant in its original form. I also reflect on the possible paths forward for ...

02.07.2025 18:39 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Why You Should Stop Using HotpotQA for AI Agents Evaluation in 2025 | Peng Qi We published HotpotQA, a groundbreaking multi-step question answering dataset in 2018, which has since motivated and facilitated numerous AI agent research works. But you should probably reconsiderโ€ฆ

Seven years ago, I co-led a paper called ๐—›๐—ผ๐˜๐—ฝ๐—ผ๐˜๐—ค๐—” that has motivated and facilitated many #AI #Agents research works since. Today, I'm asking that you stop using HotpotQA blindly for agents research in 2025 and beyond.

In my new blog post, I revisit the brief history of ๐—›๐—ผ๐˜๐—ฝ๐—ผ๐˜๐—ค๐—”, why it defined ...

02.07.2025 18:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

The longer employers donโ€™t acknowledge and embrace this discrepancy, the faster they lose the top candidates they spent enormous efforts to hire and retain, leaving the organization in self-fulfilling mediocrity.

22.05.2025 15:01 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

but not perfectly aligned with those of the employer. They ask: Will I get the opportunity to build a career beyond what is immediately required of me? Will I learn and grow, be part of a great team and culture? Will I make a name for myself while doing great work? Will I remain competitive?
3/

22.05.2025 15:01 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

We are never settling for a candidate that does exactly the thing that needs to be done right now, since that thing itself can change before you know it.

But too often employers and managers forget, that highly motivated and capable candidates also hold expectations parallel to these, ...
2/

22.05.2025 15:01 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

When making great hashtag#hiring decisions, we often look for growth potential in a candidate. Will they rise to the occasion when unforeseen challenges arise? Will they grow in the role, and lift up others in the team? Will they still be able to contribute if business direction changes?
1/

22.05.2025 15:01 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

While many aspects of our work (especially in the digital world) can be amenable to #AI #automation, it is also through automation that we continuously rediscover again and again the true meaning of our work and our unique humanness.

#MondayReflection /fin

06.05.2025 01:27 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

this coding phase alone, and I ended up delivering something slightly better than I would've done without it.

As with any technological evolution, tools themselves never fully replace the humans doing the work, but greatly enhance the ones that embrace them and adapt to working with them. 6/

06.05.2025 01:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

But, of course, this has its implications. I did save a lot of time looking up programming resources on things I have a vague understanding of and wasn't very familiar with, and didn't have to type all those many characters. By my estimate, the AI assistant did save me 50-80% of the effort of 5/

06.05.2025 01:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

(especially to the general public) the fact that the act of putting code down is typically the *least* mentally effortful part of the work. It's as if saying "my 3D printer made 100% of my new shiny collections" -- true in the narrow sense of the printing effort, but it's missing the point. 4/

06.05.2025 01:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

how to fix things that aren't working (the AI helped a bit with this too at times), and how to keep things future-proof. In this regard, I still did >90% of the most important *work* in this project. Saying AI 99% of the code, while factually correct in one particular sense (line count), obscures 3/

06.05.2025 01:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

AI with no manual edits from me, or that the AI assistant was the last to "touch" those lines of code.

What this 99% number oversimplifies is the amount of time my colleague and I engage in numerous offline discussions, times where I had to stop and think about what to ask the AI to code next, 2/

06.05.2025 01:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

In one of my recent projects, AI code assistants actually DID write 99% of my code, and the project was reasonably complex starting from scratch. Does this mean I'm obsolete now? Here's the catch:when I say AI wrote 99% of the code, I was counting roughly how many lines were directly generated by 1/

06.05.2025 01:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

#AI ๐˜„๐—ฟ๐—ผ๐˜๐—ฒ ๐Ÿต๐Ÿต% ๐—ผ๐—ณ ๐—บ๐˜† ๐—ฐ๐—ผ๐—ฑ๐—ฒ, ๐—ป๐—ผ๐˜„ ๐˜„๐—ต๐—ฎ๐˜?

Big tech executives and business analysts are racing to share eye-catching statements like "AI will write XX% of the code at MetaCorp by 20YY." How much truth is there to these, and what implications might this have?

๐Ÿงต

06.05.2025 01:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Non-native speakers sometimes have a unique advantage to language-based humor stemming from their unfamiliarity with idiomatic expressions. I saw an โ€œassembly of godโ€ on the road and thought to myself, โ€œwait, they have a factory to build gods here?โ€

28.03.2025 21:50 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
AI is the New Rocket Science | Peng Qi AI science of today has astonishing similarities to rocket science in its prime days, if one pays close attention to history. What are some of these, and what can the history of rocket science tellโ€ฆ

Is #AI the new #RocketScience? In my new blog post, I explore the similarities and connections between the two seemingly distant relatives, and reflect on what today's AI scientists can learn from their rocket cousins, plus what makes AI science unique:

13.03.2025 18:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
AI is the New Rocket Science | Peng Qi AI science of today has astonishing similarities to rocket science in its prime days, if one pays close attention to history. What are some of these, and what can the history of rocket science tellโ€ฆ

Is #AI the new #RocketScience? In my new blog post, I explore the similarities and connections between the two seemingly distant relatives, and reflect on what today's AI scientists can learn from their rocket cousins, plus what makes AI science unique:

13.03.2025 18:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

my manual work of at least 10+ minutes into a 30s breeze. How can we enable 99% of the the population to build this tooling for themselves?
#Reflection #Automation

12.03.2025 18:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

#Reflection The need for tooling appears everywhere. While oftentimes unmet, it can significantly scale and improve the productivity of many people when fulfilled, especially in the digital world. Kudos to whoever wrote the AC checklist notebook for ACL 2025 Senior Area Chairs, which reduced ...

12.03.2025 18:39 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0


I was worried at first that the story required too much context form the MT original to resonate. It turned out a self-contained story with James' unique journey as he navigates racism and slavery, the voice acting brings the characters/personas to life esp you are not that familiar with AAVE.

28.02.2025 19:39 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

When people ask me about good intro audiobooks, I've always recommended The Martian, which is one of my personal sci-fi favorites accompanied with a voice performance that truly brought the story to life. Now I'll be adding James (by Percival Everett, performed by Dominic Hoffman) to my personal rec

28.02.2025 19:39 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

Top 6 jobs LinkedIn recommended for me:
* VP of AI at a unicorn public company
* Applied AI Engineer II at a F50 company
* Applied Scientist II at a F10 co.
* Research Intern at a F50 co.
* Senior Principal Applied Scientist at a F500 co.
* Senior Director of Applied Science at a F500 co.
Who am I?!

21.02.2025 22:59 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@qi2peng2 is following 20 prominent accounts