[Weekend Read] DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents - arxiv.org/pdf/2506.11763 The best deep research AI agents have yet to cross the 50% mark in terms of comprehensiveness and depth . They also exhibit a 20% citation hallucination rate.
#AI #Research #search
07.09.2025 20:51 β π 0 π 0 π¬ 0 π 0
[Weekend Read] How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? arxiv.org/abs/2502.14502 TL;DR:Don't use LoRA to add knowledge to LLMs.
Full research note and previous weekend read papers: notes.elie.net/Papers+revie...
#AI #LLM #LoRa
23.08.2025 14:17 β π 0 π 0 π¬ 0 π 0
[Weekend Read] Subliminal Learning Language models transmit behavioral traits via hidden signals in data arxiv.org/abs/2507.14805 Surprisingly during knowledge distillation, student models unconsciously acquire teacher model characteristics even when training on unrelated data
#AI #LLM #RLM
16.08.2025 16:23 β π 0 π 1 π¬ 0 π 0
Happy to announce that we open sourcedΒ LMEval, a large model evaluation framework purposely built to accurately and efficiently compare how models from various providers perform across benchmark datasets opensource.googleblog.com/2...
#AI #LLM #OSS
27.05.2025 20:00 β π 0 π 0 π¬ 0 π 0
The leaderboard illusion: arxiv.org/abs/2504.20879 Look at some of the shortcomings of the Chatbot arena which has emerged as the go-to leaderboard for ranking the most capable #AI . Recently Meta abused some of them to game #Llama 4 Behemoth results - sherwood.news/tech/meta-scr...
24.05.2025 09:05 β π 0 π 0 π¬ 0 π 0
The Phare Benchmark results key insights include that popularity on benchmarks like LMArena doesn't guarantee factual reliability and that the more confidently a user phrase its query the less willing models are willing to refute controversial claims (sycophancy) - www.giskard.ai/knowledge/go...
01.05.2025 02:55 β π 0 π 0 π¬ 0 π 0
[Weekend Read] Exploring LLM Reasoning Through Controlled Prompt Variations - arxiv.org/abs/2504.02111 Show how critical it is to have only relevant data in the model context. Accurately filtering out data is very difficult and simply relying on vector search is NOT the answer.
#AI #RAG
27.04.2025 00:22 β π 0 π 0 π¬ 0 π 0
[Weekend read] Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification - arxiv.org/pdf/2502.01839 If you are interested in understanding how scaling computation at generation time help improve model performance this is the paper to read.
06.04.2025 07:43 β π 1 π 0 π¬ 0 π 0
[Weekend Read] Measuring AI Ability to Complete Long Tasks - arxiv.org/pdf/2503.14499 The ability of models to perform longer and longer tasks roughly double every 7 months. That's encouraging however I am unsure how true this hold at 99% success rate which is needed to trust agents.
#AI #LLM
22.03.2025 16:30 β π 0 π 0 π¬ 0 π 0
Accelerating Large-Scale Test Migration with LLMs
How Airbnb migrated nearly 3.5K Enzyme test files to React Testing Library in just 6 weeks using automation and LLMs
Accelerating Large-Scale Test Migration with LLMs - medium.com/airbnb-engineeri... Airbnb was able to leverage AI to reduce migration time by about 90% (6weeks / 1.5y)
#AI #airbnb
21.03.2025 11:19 β π 0 π 0 π¬ 0 π 0
[Weekend Read] AI Search Has A Citation Problem - www.cjr.org/tow_center/we-c... TL;DR: Doing an agent is one thing, getting to the level of reliability where an agent can be trusted is a total different ball game. Evaluating reliability is critical to true progress.
#AI #Agent #research
15.03.2025 16:23 β π 0 π 0 π¬ 0 π 0
#Gemma 3 is here! Our new open models are incredibly efficient - the largest 27B model runs on just one H100 GPU. You'd need at least 10x the compute to get similar performance from other models π
#AI #LLM
13.03.2025 03:06 β π 0 π 0 π¬ 0 π 0
[Weekend Read] Reasoning Language Models: A Blueprint - arxiv.org/abs/2501.11223 All you need to know on how thinking/reasoning models are trained and evaluated.
#AI #RLM #LLM
08.03.2025 12:17 β π 0 π 0 π¬ 0 π 0
[Weekend Read] The Impact of Generative AI on Critical Thinking: Self-Reported
Reductions in Cognitive Effort and Confidence Effects From a
Survey of Knowledge Workers - www.microsoft.com/en-us/res... The more people are confident in #GenAI the less they think critically.
#Research #AI #education
16.02.2025 18:47 β π 1 π 0 π¬ 0 π 0
[Weekend Read] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training - arxiv.org/pdf/2501.17161 - Using Reinforcement Learning (RL) help generalizing and SFT help stabilizing.
#AI #LLM #research #RL
02.02.2025 01:06 β π 0 π 0 π¬ 0 π 0
[Weekend Read] Humanityβs Last Exam - static.scale.com/uploads/65... New large scale knowledge benchmark where the best models barely reach 9% with DeepSeek-R1 outperforming everyone.
#benchmark #research #AI #LLM #deepseek #openai #anthropic #gemini
25.01.2025 23:49 β π 2 π 0 π¬ 0 π 0
[Tool Tuesday] LLM the best CLI utility for interacting with Large Models - https://github.com/simonw/llm This comprehensive tool support images, local/remote models and shell workflow. You can for example type: cat https://mycode.py | llm -s "Explain this code"
#LLM #tool #AI
07.01.2025 21:00 β π 0 π 0 π¬ 0 π 0
[weekend read] Human Creativity in the Age of LLMs - https://arxiv.org/abs/2410.03703 - Worryingly this study shows that #AI might boost short-term creativity at the expense of long-term one. Figuring out how to leverage #LLM without degrading human long-term capabilities is a very pressing issue.
01.12.2024 02:25 β π 0 π 1 π¬ 2 π 0
[Weekend Read] NeRF: Neural Radiance Field in 3D Vision, A Comprehensive Review
: https://arxiv.org/pdf/2210.00379 - NeRF models allow to synthetize and render a 3D scene from all direction based of 2D images. This paper is a good summary of this must know type of model.
#AI #research #3D
03.11.2024 19:47 β π 0 π 0 π¬ 0 π 0
[Weekend Read] Scaling Retrieval-Based Language Models with a Trillion-Token Datastore - https://arxiv.org/abs/2407.12854 Good baseline experiment that put RAG lift at about 8% on knowledge tasks with smaller models benefiting more from it.
#LLM # LLama #research #AI #RAG
06.10.2024 15:21 β π 0 π 0 π¬ 0 π 0
[Tool Tuesday] RagFlow: Open-source #RAG retrieval system - https://github.com/infiniflow/ragflow
It as a nice UI, is easy to deploy and implement a lot of novel techniques such as graphrag. Great out of box system if you want to have a chatbot that leverage specialized data
#LLM #AI #OSS
27.08.2024 22:28 β π 1 π 0 π¬ 0 π 0
[Weekend Read] Fairness Definitions in Language Models Explained - https://arxiv.org/pdf/2407.18454 This paper do a great job at explaining simply the key ideas on how to evaluate #LLM #fairness and provide key references. Very useful read if you are interested in the topic.
#AI #Research
24.08.2024 19:01 β π 0 π 0 π¬ 0 π 0
[Weekend Read] Adversaries Can Misuse Combinations of Safe Models - https://arxiv.org/abs/2406.14595?utm_source=bluesky&utm_campaign=elie Study how to combine frontiers models with weaker ones to complete dangerous tasks.
#LLM #AI #Research #Cybersecurity
13.07.2024 16:43 β π 0 π 0 π¬ 0 π 0
Vanity Fair No 41 | Transformation Deck
Vanity Fair No 41, a Transformation Deck by The United States Playing Card Company from 1895. Complete and in excellent condition..
[Friday Fun] The Vanity Fair No. 41 playing cards deck is the first transformation deck manufactured by The United States Playing Card Company in 1895 https://etteilla.org/en/deck/63/vanity-fair-no-41
#playingcards #art #history
22.06.2024 05:42 β π 0 π 0 π¬ 0 π 0
[Weekend Read] A Careful Examination of Large Language Model Performance on Grade School Arithmetic: http://arxiv.org/abs/2405.00332 By creating from scratch a new math benchmark (GSM1K) the authors show that many models training data are likely polluted with benchmark data.
#AI #LLM #GPT
01.06.2024 22:54 β π 0 π 0 π¬ 0 π 0
[Weekend Read] Better & Faster Large Language Models via Multi-token Prediction - arxiv.org/abs/2404.19737 Quite an interesting technique to increase LLM performance and speed: predict multiple tokens instead of just the next one.
#AI #Research #LLM
12.05.2024 03:18 β π 0 π 0 π¬ 0 π 0
[Weekend Read] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training - arxiv.org/abs/2403.09611 this paper study the factors affecting Multimodal LLM performance including what I think is the best analysis of how data mixture greatly affect performance.
#AI #Research #LLM #VLM
13.04.2024 20:59 β π 0 π 0 π¬ 0 π 0