How do we prove that #AI can't do #maths?
Real Mathematics (yes, "real" is a pun here):
a+b+c = (a+b)+c = a+(b+c)
AI Mathematics (well, floating point maths, really):
>>> 0.1+0.2+0.3
0.6000000000000001
>>> 0.1+(0.2+0.3)
0.6
QED.
@qi2peng2.bsky.social
Multimodal Agents Research @ Orby AI. Ex-AWS AI, JD AI. PhD from @stanfordnlp.bsky.social, UG Tsinghua U. He/him. Opinions my own.
How do we prove that #AI can't do #maths?
Real Mathematics (yes, "real" is a pun here):
a+b+c = (a+b)+c = a+(b+c)
AI Mathematics (well, floating point maths, really):
>>> 0.1+0.2+0.3
0.6000000000000001
>>> 0.1+(0.2+0.3)
0.6
QED.
This project was joint work with my Amazon colleagues (led by Yumo Xu), and it's great to see it finally published. Hope this helps motivate more careful eval work in the near future!
#AI #agent #evaluation #RAG #NLP
b) as builders, we evaluate the technology soberly and help users navigate these risks in product design.
Want to learn more? Checkout
Our paper: arxiv.org/pdf/2506.01829
Open-source code: github.com/amazon-scien...
Why should you care? As businesses / individuals leverage AI more and more to speed up research and decision-making, it is important that, a) as users, we examine the tools we are using to understand their limitations and avoid pitfalls with significant potential downsides, and
16.07.2025 00:40 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0With a new, carefully annotated dataset and an automated evaluation metric we designed, we find that although LLMs are reasonably good at citing accurate sources most of the time, SOTA LLMs still cite incorrectly 5-28% of the time, and miss citations anywhere from 16% to an alarming 95% of the time.
16.07.2025 00:40 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0"๐๐ข๐ญ๐๐๐ฏ๐๐ฅ: ๐๐ซ๐ข๐ง๐๐ข๐ฉ๐ฅ๐-๐๐ซ๐ข๐ฏ๐๐ง ๐๐ข๐ญ๐๐ญ๐ข๐จ๐ง ๐๐ฏ๐๐ฅ๐ฎ๐๐ญ๐ข๐จ๐ง ๐๐จ๐ซ ๐๐จ๐ฎ๐ซ๐๐ ๐๐ญ๐ญ๐ซ๐ข๐๐ฎ๐ญ๐ข๐จ๐ง", we propose a framework to systematically study citation accuracy by considering previously neglected contexts such as user-provided information and LLMs' parametric knowledge.
16.07.2025 00:40 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0As ๐ AI deep research agents ๐ become an essential part of many people's day-to-day work, it is more essential than ever before that we can trust what they produce.
When these agents cite sources they claim the report is based on, how much can we actually trust them? In our new #ACL2025 paper, ...
AI agents research, and share a skeleton of a research proposal if I ever find the time / need to write one from my perspective today: qipeng.me/blog/stop-us...
02.07.2025 18:39 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0a good problem in its historical context, some of my own research attempts at solving this problem that I believe are on the critical path to autonomous agents, and what's changed today to make the dataset less relevant in its original form. I also reflect on the possible paths forward for ...
02.07.2025 18:39 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0Seven years ago, I co-led a paper called ๐๐ผ๐๐ฝ๐ผ๐๐ค๐ that has motivated and facilitated many #AI #Agents research works since. Today, I'm asking that you stop using HotpotQA blindly for agents research in 2025 and beyond.
In my new blog post, I revisit the brief history of ๐๐ผ๐๐ฝ๐ผ๐๐ค๐, why it defined ...
The longer employers donโt acknowledge and embrace this discrepancy, the faster they lose the top candidates they spent enormous efforts to hire and retain, leaving the organization in self-fulfilling mediocrity.
22.05.2025 15:01 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0but not perfectly aligned with those of the employer. They ask: Will I get the opportunity to build a career beyond what is immediately required of me? Will I learn and grow, be part of a great team and culture? Will I make a name for myself while doing great work? Will I remain competitive?
3/
We are never settling for a candidate that does exactly the thing that needs to be done right now, since that thing itself can change before you know it.
But too often employers and managers forget, that highly motivated and capable candidates also hold expectations parallel to these, ...
2/
When making great hashtag#hiring decisions, we often look for growth potential in a candidate. Will they rise to the occasion when unforeseen challenges arise? Will they grow in the role, and lift up others in the team? Will they still be able to contribute if business direction changes?
1/
While many aspects of our work (especially in the digital world) can be amenable to #AI #automation, it is also through automation that we continuously rediscover again and again the true meaning of our work and our unique humanness.
#MondayReflection /fin
this coding phase alone, and I ended up delivering something slightly better than I would've done without it.
As with any technological evolution, tools themselves never fully replace the humans doing the work, but greatly enhance the ones that embrace them and adapt to working with them. 6/
But, of course, this has its implications. I did save a lot of time looking up programming resources on things I have a vague understanding of and wasn't very familiar with, and didn't have to type all those many characters. By my estimate, the AI assistant did save me 50-80% of the effort of 5/
06.05.2025 01:27 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0(especially to the general public) the fact that the act of putting code down is typically the *least* mentally effortful part of the work. It's as if saying "my 3D printer made 100% of my new shiny collections" -- true in the narrow sense of the printing effort, but it's missing the point. 4/
06.05.2025 01:27 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0how to fix things that aren't working (the AI helped a bit with this too at times), and how to keep things future-proof. In this regard, I still did >90% of the most important *work* in this project. Saying AI 99% of the code, while factually correct in one particular sense (line count), obscures 3/
06.05.2025 01:27 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0AI with no manual edits from me, or that the AI assistant was the last to "touch" those lines of code.
What this 99% number oversimplifies is the amount of time my colleague and I engage in numerous offline discussions, times where I had to stop and think about what to ask the AI to code next, 2/
In one of my recent projects, AI code assistants actually DID write 99% of my code, and the project was reasonably complex starting from scratch. Does this mean I'm obsolete now? Here's the catch:when I say AI wrote 99% of the code, I was counting roughly how many lines were directly generated by 1/
06.05.2025 01:27 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0#AI ๐๐ฟ๐ผ๐๐ฒ ๐ต๐ต% ๐ผ๐ณ ๐บ๐ ๐ฐ๐ผ๐ฑ๐ฒ, ๐ป๐ผ๐ ๐๐ต๐ฎ๐?
Big tech executives and business analysts are racing to share eye-catching statements like "AI will write XX% of the code at MetaCorp by 20YY." How much truth is there to these, and what implications might this have?
๐งต
Non-native speakers sometimes have a unique advantage to language-based humor stemming from their unfamiliarity with idiomatic expressions. I saw an โassembly of godโ on the road and thought to myself, โwait, they have a factory to build gods here?โ
28.03.2025 21:50 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Is #AI the new #RocketScience? In my new blog post, I explore the similarities and connections between the two seemingly distant relatives, and reflect on what today's AI scientists can learn from their rocket cousins, plus what makes AI science unique:
13.03.2025 18:39 โ ๐ 0 ๐ 1 ๐ฌ 0 ๐ 0Is #AI the new #RocketScience? In my new blog post, I explore the similarities and connections between the two seemingly distant relatives, and reflect on what today's AI scientists can learn from their rocket cousins, plus what makes AI science unique:
13.03.2025 18:39 โ ๐ 0 ๐ 1 ๐ฌ 0 ๐ 0my manual work of at least 10+ minutes into a 30s breeze. How can we enable 99% of the the population to build this tooling for themselves?
#Reflection #Automation
#Reflection The need for tooling appears everywhere. While oftentimes unmet, it can significantly scale and improve the productivity of many people when fulfilled, especially in the digital world. Kudos to whoever wrote the AC checklist notebook for ACL 2025 Senior Area Chairs, which reduced ...
12.03.2025 18:39 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
I was worried at first that the story required too much context form the MT original to resonate. It turned out a self-contained story with James' unique journey as he navigates racism and slavery, the voice acting brings the characters/personas to life esp you are not that familiar with AAVE.
When people ask me about good intro audiobooks, I've always recommended The Martian, which is one of my personal sci-fi favorites accompanied with a voice performance that truly brought the story to life. Now I'll be adding James (by Percival Everett, performed by Dominic Hoffman) to my personal rec
28.02.2025 19:39 โ ๐ 2 ๐ 0 ๐ฌ 2 ๐ 0Top 6 jobs LinkedIn recommended for me:
* VP of AI at a unicorn public company
* Applied AI Engineer II at a F50 company
* Applied Scientist II at a F10 co.
* Research Intern at a F50 co.
* Senior Principal Applied Scientist at a F500 co.
* Senior Director of Applied Science at a F500 co.
Who am I?!