EvoLM: In Search of Lost Language Model Training Dynamics
Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, ...
Dive in π: arxiv.org/abs/2506.16029
Blog Post π: zhentingqi.github.io/internal/pro...
Thread π§΅: x.com/_hanlin_zhan...
Work by Zhenting Qi, and the team Fan Nie, Alexandre Alahi, @jameszou.bsky.social, Himabindu Lakkaraju, Yilun Du, Eric Xing, @shamkakade.bsky.social
02.07.2025 20:05 β π 0 π 0 π¬ 0 π 0
EvoLM: In Search of Lost Language Model Training Dynamics
Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, ...
β
Open-source everything β models, data, training, and evaluation pipeline
β
Maintain the EvoLM model family with clear data provenance
β
Support the community in extending this foundation for future LLM research
02.07.2025 20:05 β π 1 π 0 π¬ 1 π 0
EvoLM: In Search of Lost Language Model Training Dynamics
Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, ...
We seek to:
β
Build a fully transparent and reproducible model suite for studying LM training
β
Quantify how each training phase contributes to upstream cloze task performance and downstream generative task performance, considering both in-domain and out-of-domain settings
02.07.2025 20:05 β π 0 π 0 π¬ 1 π 0
EvoLM: In Search of Lost Language Model Training Dynamics
Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, ...
Introducing EvoLM, a model suite with 100+ decoder-only LMs (1B/4B) trained from scratch, across four training stages β
π¦ Pre-training
π© Continued Pre-Training (CPT)
π¨ Supervised Fine-Tuning (SFT)
π₯ Reinforcement Learning (RL)
02.07.2025 20:05 β π 2 π 1 π¬ 1 π 0
New work [JSKZ25] w/ Jikai, Vasilis,
@shamkakade.bsky.social .
We introduce new formulations and tools for evaluating LM capabilities, which help explain observations of post-training behaviors of Qwen-series models.
More details:
- hanlin-zhang.com/causal-capab...
- x.com/_hanlin_zhan...
18.06.2025 18:02 β π 0 π 0 π¬ 0 π 0
Eliminating Position Bias of Language Models: A Mechanistic Approach
Position bias has proven to be a prevalent issue of modern language models (LMs), where the models prioritize content based on its position within the given context. This bias often leads to unexpecte...
[3/4] LMs can suffer from position biasβthey favor content based on where it appears. This can hurt reasoning and evaluation.
We introduce PINE, a training-free method that eliminates position bias via bidirectional attention+reordering docs by attention scores.
(arxiv.org/abs/2407.01100)
23.04.2025 01:35 β π 0 π 0 π¬ 1 π 0
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference. We explore a framework where the model verifies its own outputs, filters or reweights...
[2/4] Can LLMs self-improve by verifying their own outputs? This paper says yesβwith a twist. The key lies in a measure: the Generation-Verification Gap (GV-Gap) that scales with pretraining FLOPs in a log-linear trend.
Oral @yus167.bsky.social 6A: Sat 26 Apr 4:18-4:30.
(arxiv.org/abs/2412.02674)
23.04.2025 01:35 β π 0 π 0 π¬ 1 π 0
How the von Neumann bottleneck is impeding AI computing
The von Neumann architecture, which separates compute and memory, is perfect for conventional computing. But it creates a data traffic jam for AI.
[1/4] Modern large-scale LM training is limited not just by compute, but by data movementβa classic Von Neumann bottleneck (research.ibm.com/blog/why-von...).
Scaling batch size reduces optimization steps, but only up to a pointβthe Critical Batch Size (CBS).
23.04.2025 01:35 β π 0 π 0 π¬ 1 π 0
Highlights from #ICLR2025 β a brief thread π§΅
23.04.2025 01:35 β π 1 π 0 π¬ 1 π 0
I want to reshare @brandfonbrener.bsky.social's @NeurIPSConf 2024 paper on CoLoR-Filter: A simple yet powerful method for selecting high-quality data for language model pre-training!
With @hlzhang109.bsky.social @schwarzjn.bsky.social @shamkakade.bsky.social
05.04.2025 12:04 β π 17 π 8 π¬ 2 π 1
(1/n) π‘How can we speed up the serial runtime of long pre-training runs? Enter Critical Batch Size (CBS): the tipping point where the gains of data parallelism balance with diminishing efficiency. Doubling batch size halves the optimization stepsβuntil we hit CBS, beyond which returns diminish.
22.11.2024 20:19 β π 16 π 4 π¬ 2 π 0
LLM self-improvement has critical implications in synthetic data, post-training and test-time inference. To understand LLMs' true capability of self-improvement, we perform large-scale experiments with multiple families of LLMs, tasks and mechanisms. Here is what we found: (1/9)
06.12.2024 18:02 β π 12 π 4 π¬ 1 π 1
https://miguelhernan.org/
Using health data to learn what works.
Making #causalinference less casual.
Director, @causalab.bsky.social
Professor, @hsph.harvard.edu
Methods Editor, Annals of Internal Medicine @annalsofim.bsky.social
My opinions only here.
π¨βπ¬ RS DeepMind
Past:
π¨βπ¬ R Midjourney 1y π§βπ DPhil AIMS Uni of Oxford 4.5y
π§ββοΈ RE DeepMind 1y πΊ SWE Google 3y π TUM
π€ @nwspk
PhD student with the Harvard ML Foundations group.
ML/AI researcher & former stats professor turned LLM research engineer. Author of "Build a Large Language Model From Scratch" (https://amzn.to/4fqvn0D). Blogging about AI research at magazine.sebastianraschka.com.
Google Chief Scientist, Gemini Lead. Opinions stated here are my own, not those of Google. Gemini, TensorFlow, MapReduce, Bigtable, Spanner, ML things, ...
Distinguished Scientist at Google. Computational Imaging, Machine Learning, and Vision. Posts are personal opinions. May change or disappear over time.
http://milanfar.org
Associate Professor at Princeton
Machine Learning Researcher
Researcher and CIFAR Fellow, working on the intersection of machine learning and neuroscience in MontrΓ©al at @mcgill.ca and @mila-quebec.bsky.social.
AI professor at Caltech. General Chair ICLR 2025.
http://www.yisongyue.com
AI safety at Anthropic, on leave from a faculty job at NYU.
Views not employers'.
I think you should join Giving What We Can.
cims.nyu.edu/~sbowman
I work at Sakana AI ππ π‘ β @sakanaai.bsky.social
https://sakana.ai/careers
a mediocre combination of a mediocre AI scientist, a mediocre physicist, a mediocre chemist, a mediocre manager and a mediocre professor.
see more at https://kyunghyuncho.me/
source: https://arxiv.org/rss/stat.ML
maintainer: @tmaehara.bsky.social
Computer science, math, machine learning, (differential) privacy
Researcher at Google DeepMind
Kiwiπ³πΏ in CaliforniaπΊπΈ
http://stein.ke/
Professor and Head of Machine Learning Department at Carnegie Mellon. Board member OpenAI. Chief Technical Advisor Gray Swan AI. Chief Expert Bosch Research.
Mathematician at UCLA. My primary social media account is https://mathstodon.xyz/@tao . I also have a blog at https://terrytao.wordpress.com/ and a home page at https://www.math.ucla.edu/~tao/
Professor, Stanford University, Statistics and Mathematics. Opinions are my own.
Professor at Penn, Amazon Scholar at AWS. Interested in machine learning, uncertainty quantification, game theory, privacy, fairness, and most of the intersections therein