bourbaki7's Avatar

bourbaki7

@bourbaki7.bsky.social

Insurance quant, posting AI papers

83 Followers  |  144 Following  |  87 Posts  |  Joined: 26.12.2024  |  1.6177

Latest posts by bourbaki7.bsky.social on Bluesky


Preview
Video models are zero-shot learners and reasoners The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation...

arxiv.org/abs/2509.20328

26.09.2025 05:53 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
LIMI: Less is More for Agency We define Agency as the emergent capacity of AI systems to function as autonomous agents actively discovering problems, formulating hypotheses, and executing solutions through self-directed engagement...

arxiv.org/abs/2509.17567

25.09.2025 03:37 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Pre-training under infinite compute Since compute grows much faster than web text available for language model pre-training, we ask how one should approach pre-training under fixed data and no compute constraints. We first show that exi...

arxiv.org/abs/2509.14786

21.09.2025 22:57 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams We study the gradient-based training of large-depth residual networks (ResNets) from standard random initializations. We show that with a diverging depth $L$, a fixed embedding dimension $D$, and an a...

arxiv.org/abs/2509.10167

21.09.2025 03:20 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Causal Attention with Lookahead Keys In standard causal attention, each token's query, key, and value (QKV) are static and encode only preceding context. We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism ...

arxiv.org/abs/2509.07301

21.09.2025 03:18 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Cautious Optimizers: Improving Training with One Line of Code AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we...

Not a new paper, but I hadn't seen it til now

arxiv.org/abs/2411.16085

26.07.2025 05:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Fast and Simplex: 2-Simplicial Attention in Triton Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count toge...

arxiv.org/abs/2507.02754

26.07.2025 05:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large lang...

arxiv.org/abs/2507.10613

26.07.2025 05:22 β€” πŸ‘ 3    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Preview
Training Transformers with Enforced Lipschitz Constants Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and ove...

arxiv.org/abs/2507.13338

23.07.2025 21:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Scaling Laws for Optimal Data Mixtures Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard appro...

arxiv.org/abs/2507.09404

19.07.2025 23:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model y...

arxiv.org/abs/2507.06261

14.07.2025 03:24 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 1
Preview
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling Despite incredible progress in language models (LMs) in recent years, largely resulting from moving away from specialized models designed for specific tasks to general models based on powerful archite...

arxiv.org/abs/2507.07955

13.07.2025 16:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient s...

arxiv.org/abs/2506.19697

13.07.2025 07:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Thought Anchors: Which LLM Reasoning Steps Matter? Reasoning large language models have recently achieved state-of-the-art performance in many fields. However, their long-form chain-of-thought reasoning creates interpretability challenges as each gene...

arxiv.org/abs/2506.19143

13.07.2025 03:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Hardware-Efficient Attention for Fast Decoding LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decodi...

arxiv.org/abs/2505.21487

10.07.2025 21:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
any4: Learned 4-bit Numeric Representation for LLMs We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. a...

www.arxiv.org/abs/2507.04610

09.07.2025 04:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks What scaling limits govern neural network training dynamics when model size and training time grow in tandem? We show that despite the complex interactions between architecture, training algorithms, a...

arxiv.org/abs/2507.02119

08.07.2025 18:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Characterization and Mitigation of Training Instabilities in Microscaling Formats Training large language models is an expensive, compute-bound process that must be repeated as models scale, algorithms improve, and new data is collected. To address this, next-generation hardware ac...

arxiv.org/abs/2506.20752

02.07.2025 20:58 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models This paper revisits the implementation of $\textbf{L}$oad-$\textbf{b}$alancing $\textbf{L}$oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as $N_E \sum_...

arxiv.org/abs/2501.11873

30.06.2025 06:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
A Statistical Physics of Language Model Reasoning Transformer LMs show emergent reasoning that resists mechanistic understanding. We offer a statistical physics framework for continuous-time chain-of-thought reasoning dynamics. We model sentence-leve...

arxiv.org/abs/2506.04374

29.06.2025 03:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins. Although membership inference attacks and hidden canaries have be...

arxiv.org/abs/2506.14913

28.06.2025 03:36 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
OpenThoughts: Data Recipes for Reasoning Models Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-th...

arxiv.org/abs/2506.04178

28.06.2025 00:20 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Essential-Web v1.0: 24T tokens of organized web data Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We ...

arxiv.org/abs/2506.14111

19.06.2025 02:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concern...

arxiv.org/abs/2506.05209

19.06.2025 02:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Scaling Laws of Motion Forecasting and Planning -- A Technical Report We study the empirical scaling laws of a family of encoder-decoder autoregressive transformer models on the task of joint motion forecasting and planning in the autonomous driving domain. Using a 500 ...

arxiv.org/abs/2506.08228

15.06.2025 00:56 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
New Insights for Scaling Laws in Autonomous Driving Many recent AI breakthroughs have followed a common pattern: bigger models, trained on more data, with more compute, often deliver extraordinary gains. Waymo’s latest study explores whether this trend...

waymo.com/blog/2025/06...

15.06.2025 00:52 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to t...

arxiv.org/abs/2506.08300

12.06.2025 05:37 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Text-to-LoRA: Instant Transformer Adaption While Foundation Models provide a general tool for rapid content creation, they regularly require task-specific adaptation. Traditionally, this exercise involves careful curation of datasets and repea...

arxiv.org/abs/2506.06105

12.06.2025 05:31 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes…

machinelearning.apple.com/research/ill...

08.06.2025 03:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training Transformers consist of diverse building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feedforward networks. Thus, understanding the differences and...

arxiv.org/abs/2502.19002

06.06.2025 22:45 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@bourbaki7 is following 20 prominent accounts