Elie's Avatar

Elie

@ebursztein.bsky.social

3 Followers  |  0 Following  |  44 Posts  |  Joined: 06.02.2024  |  1.1594

Latest posts by ebursztein.bsky.social on Bluesky

Post image

[Weekend Read] DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents - arxiv.org/pdf/2506.11763 The best deep research AI agents have yet to cross the 50% mark in terms of comprehensiveness and depth . They also exhibit a 20% citation hallucination rate.
#AI #Research #search

07.09.2025 20:51 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image Post image

[Weekend Read] Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence - lnkd.in/gnBWstn7 Early impact of #AI on #employment seems to indicate that entry-level white-collar jobs including software engineering, marketing, and support are the most affected.

01.09.2025 04:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[Weekend Read] How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? arxiv.org/abs/2502.14502 TL;DR:Don't use LoRA to add knowledge to LLMs.

Full research note and previous weekend read papers: notes.elie.net/Papers+revie...

#AI #LLM #LoRa

23.08.2025 14:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[Weekend Read] Subliminal Learning Language models transmit behavioral traits via hidden signals in data arxiv.org/abs/2507.14805 Surprisingly during knowledge distillation, student models unconsciously acquire teacher model characteristics even when training on unrelated data

#AI #LLM #RLM

16.08.2025 16:23 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Post image

Happy to announce that we open sourcedΒ LMEval, a large model evaluation framework purposely built to accurately and efficiently compare how models from various providers perform across benchmark datasets opensource.googleblog.com/2...

#AI #LLM #OSS

27.05.2025 20:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

The leaderboard illusion: arxiv.org/abs/2504.20879 Look at some of the shortcomings of the Chatbot arena which has emerged as the go-to leaderboard for ranking the most capable #AI . Recently Meta abused some of them to game #Llama 4 Behemoth results - sherwood.news/tech/meta-scr...

24.05.2025 09:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

The Phare Benchmark results key insights include that popularity on benchmarks like LMArena doesn't guarantee factual reliability and that the more confidently a user phrase its query the less willing models are willing to refute controversial claims (sycophancy) - www.giskard.ai/knowledge/go...

01.05.2025 02:55 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[Weekend Read] Exploring LLM Reasoning Through Controlled Prompt Variations - arxiv.org/abs/2504.02111 Show how critical it is to have only relevant data in the model context. Accurately filtering out data is very difficult and simply relying on vector search is NOT the answer.
#AI #RAG

27.04.2025 00:22 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
RealHarm: A Collection of Real-World Language Model Application Failures Language model deployments in consumer-facing applications introduce numerous risks. While existing research on harms and hazards of such applications follows top-down approaches derived from regulatory frameworks and theoretical analyses, empirical

[Weekend Read] RealHarm: A Collection of Real-World Language Model Application Failures arxiv.org/abs/2504.10277 This paper by looking at real world examples of AI failure highlight the disconnect between what safety filters block and what goes wrong in practice.

#safety #research #ai

19.04.2025 16:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

[Weekend read] Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification - arxiv.org/pdf/2502.01839 If you are interested in understanding how scaling computation at generation time help improve model performance this is the paper to read.

06.04.2025 07:43 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[Weekend Read] Measuring AI Ability to Complete Long Tasks - arxiv.org/pdf/2503.14499 The ability of models to perform longer and longer tasks roughly double every 7 months. That's encouraging however I am unsure how true this hold at 99% success rate which is needed to trust agents.

#AI #LLM

22.03.2025 16:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Accelerating Large-Scale Test Migration with LLMs How Airbnb migrated nearly 3.5K Enzyme test files to React Testing Library in just 6 weeks using automation and LLMs

Accelerating Large-Scale Test Migration with LLMs - medium.com/airbnb-engineeri... Airbnb was able to leverage AI to reduce migration time by about 90% (6weeks / 1.5y)

#AI #airbnb

21.03.2025 11:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

[Weekend Read] AI Search Has A Citation Problem - www.cjr.org/tow_center/we-c... TL;DR: Doing an agent is one thing, getting to the level of reliability where an agent can be trusted is a total different ball game. Evaluating reliability is critical to true progress.

#AI #Agent #research

15.03.2025 16:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

#Gemma 3 is here! Our new open models are incredibly efficient - the largest 27B model runs on just one H100 GPU. You'd need at least 10x the compute to get similar performance from other models πŸ‘‡

#AI #LLM

13.03.2025 03:06 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[Weekend Read] Reasoning Language Models: A Blueprint - arxiv.org/abs/2501.11223 All you need to know on how thinking/reasoning models are trained and evaluated.

#AI #RLM #LLM

08.03.2025 12:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[Weekend Read] The Impact of Generative AI on Critical Thinking: Self-Reported
Reductions in Cognitive Effort and Confidence Effects From a
Survey of Knowledge Workers - www.microsoft.com/en-us/res... The more people are confident in #GenAI the less they think critically.
#Research #AI #education

16.02.2025 18:47 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

[Weekend Read] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training - arxiv.org/pdf/2501.17161 - Using Reinforcement Learning (RL) help generalizing and SFT help stabilizing.

#AI #LLM #research #RL

02.02.2025 01:06 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[Weekend Read] Humanity’s Last Exam - static.scale.com/uploads/65... New large scale knowledge benchmark where the best models barely reach 9% with DeepSeek-R1 outperforming everyone.

#benchmark #research #AI #LLM #deepseek #openai #anthropic #gemini

25.01.2025 23:49 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[Tool Tuesday] LLM the best CLI utility for interacting with Large Models - https://github.com/simonw/llm This comprehensive tool support images, local/remote models and shell workflow. You can for example type: cat https://mycode.py | llm -s "Explain this code"

#LLM #tool #AI

07.01.2025 21:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[weekend read] Human Creativity in the Age of LLMs - https://arxiv.org/abs/2410.03703 - Worryingly this study shows that #AI might boost short-term creativity at the expense of long-term one. Figuring out how to leverage #LLM without degrading human long-term capabilities is a very pressing issue.

01.12.2024 02:25 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 2    πŸ“Œ 0
Post image

[Weekend Read] NeRF: Neural Radiance Field in 3D Vision, A Comprehensive Review
: https://arxiv.org/pdf/2210.00379 - NeRF models allow to synthetize and render a 3D scene from all direction based of 2D images. This paper is a good summary of this must know type of model.

#AI #research #3D

03.11.2024 19:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[Weekend Read] Scaling Retrieval-Based Language Models with a Trillion-Token Datastore - https://arxiv.org/abs/2407.12854 Good baseline experiment that put RAG lift at about 8% on knowledge tasks with smaller models benefiting more from it.

#LLM # LLama #research #AI #RAG

06.10.2024 15:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[Tool Tuesday] RagFlow: Open-source #RAG retrieval system - https://github.com/infiniflow/ragflow
It as a nice UI, is easy to deploy and implement a lot of novel techniques such as graphrag. Great out of box system if you want to have a chatbot that leverage specialized data

#LLM #AI #OSS

27.08.2024 22:28 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

[Weekend Read] Fairness Definitions in Language Models Explained - https://arxiv.org/pdf/2407.18454 This paper do a great job at explaining simply the key ideas on how to evaluate #LLM #fairness and provide key references. Very useful read if you are interested in the topic.

#AI #Research

24.08.2024 19:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[Weekend Read] Adversaries Can Misuse Combinations of Safe Models - https://arxiv.org/abs/2406.14595?utm_source=bluesky&utm_campaign=elie Study how to combine frontiers models with weaker ones to complete dangerous tasks.

#LLM #AI #Research #Cybersecurity

13.07.2024 16:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Vanity Fair No 41 | Transformation Deck Vanity Fair No 41, a Transformation Deck by The United States Playing Card Company from 1895. Complete and in excellent condition..

[Friday Fun] The Vanity Fair No. 41 playing cards deck is the first transformation deck manufactured by The United States Playing Card Company in 1895 https://etteilla.org/en/deck/63/vanity-fair-no-41

#playingcards #art #history

22.06.2024 05:42 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
How a single ChatGPT mistake cost us $10,000+ We first turned on monetization for our startup last May. We had low expectations but were pleasantly surprised when we got our first customer wi...

Interesting write-up on how relying on LLM to write code can be very costly https://web.archive.org/web/20240610032818/https://asim.bearblog.dev/how-a-single-chatgpt-mistake-cost-us-10000/

#LLM #AI

13.06.2024 21:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[Weekend Read] A Careful Examination of Large Language Model Performance on Grade School Arithmetic: http://arxiv.org/abs/2405.00332 By creating from scratch a new math benchmark (GSM1K) the authors show that many models training data are likely polluted with benchmark data.

#AI #LLM #GPT

01.06.2024 22:54 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[Weekend Read] Better & Faster Large Language Models via Multi-token Prediction - arxiv.org/abs/2404.19737 Quite an interesting technique to increase LLM performance and speed: predict multiple tokens instead of just the next one.
#AI #Research #LLM

12.05.2024 03:18 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

[Weekend Read] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training - arxiv.org/abs/2403.09611 this paper study the factors affecting Multimodal LLM performance including what I think is the best analysis of how data mixture greatly affect performance.

#AI #Research #LLM #VLM

13.04.2024 20:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0