This thread is a bit long, but I thought itβd be interesting to share just one of the mundane parts of the deep learning stack that break and have to be rethought as models and training scale.
08.06.2025 00:07 β π 1 π 0 π¬ 0 π 0@pbontrager.bsky.social
AI researcher & engineer @Meta working on @PyTorch torchtune in NYC; interests in generative models, RL, and evolutionary strategies π» https://github.com/pbontrager π https://tinyurl.com/philips-papers
This thread is a bit long, but I thought itβd be interesting to share just one of the mundane parts of the deep learning stack that break and have to be rethought as models and training scale.
08.06.2025 00:07 β π 1 π 0 π¬ 0 π 0To save, you need to let each GPU save their own partial safetensors, because communication is slow, and then line up the memory blocks and merge into one file.
08.06.2025 00:07 β π 1 π 0 π¬ 1 π 0Safetensors are great for hosting checkpoints and make no assumptions about if your model is distributed by saving full unshared parameters. To work natively with safetensors, DCP needs to tell each GPU the exact slice of data to read without loading the full parameter.
08.06.2025 00:07 β π 0 π 0 π¬ 1 π 0On startup, DCP has to map your old GPU layout to your new one so each GPU knows which file to read from and only read the data they need. But thereβs one last problem; when youβre ready to take your model to another tool (serving, eval, etc), it expects safetenors checkpoints.
08.06.2025 00:07 β π 0 π 0 π¬ 1 π 0Distributed Checkpoints (DCP) solve this by having every GPU save their own checkpoint asynchronously so you can save a checkpoint in less than a second. But this creates a new problem, the next time you want to use the model, you might have a different number of GPUs.
08.06.2025 00:07 β π 0 π 0 π¬ 1 π 0What goes into saving checkpoints is not something that many people think about, but as models get bigger this becomes a challenge. The biggest open models now have checkpoints over 700gb that can take tens of minutes every time you want to consolidate into a checkpoint.
pytorch.org/blog/hugging...
Iβm enjoying it while it lasts before everything fully homogenizes again
26.02.2025 02:04 β π 1 π 0 π¬ 0 π 0We've built a simulated driving agent that we trained on 1.6 billion km of driving with no human data.
It is SOTA on every planning benchmark we tried.
In self-play, it goes 20 years between collisions.
Arenβt these two paradoxes functionally the same? en.m.wikipedia.org/wiki/Braess%...
27.01.2025 08:54 β π 4 π 0 π¬ 0 π 0Original post here: x.com/jjitsev/stat...
25.01.2025 18:07 β π 5 π 0 π¬ 0 π 0In the Alice In Wonderland (github.com/LAION-AI/AIW) reasoning and generalization benchmark, DeepSeek R1 appears to perform much more like o1 mini than o1 -preview. (Plot from laion-ai)
25.01.2025 17:25 β π 4 π 0 π¬ 2 π 0What are the best benchmarks for reasoning models?
20.01.2025 10:32 β π 1 π 0 π¬ 0 π 0Can we just study LLM activations/behavior because itβs interesting and it can tell us things about language and AI without imbuing artificial importance or meaning on top of it?
14.01.2025 14:05 β π 2 π 0 π¬ 0 π 0Haha, that wasnβt lost on me. Facebookβs still going strong, but itβs a different site and users from when I was in HS.
13.01.2025 21:14 β π 2 π 0 π¬ 0 π 0If you can choose who follows you, that sounds more like βfriendsβ from the old Facebook days.
13.01.2025 20:53 β π 2 π 0 π¬ 1 π 0I found out about Warp because I was on jury duty with one of their devs π Itβs been great compared to the Macβs default terminal.
07.01.2025 23:58 β π 4 π 0 π¬ 0 π 0How do you add these?
07.01.2025 16:10 β π 2 π 0 π¬ 1 π 0Maybe letβs go the other direction and include blog posts in CVs too.
07.01.2025 15:30 β π 2 π 0 π¬ 1 π 0That would imply that we solved self-driving (image recognition) and search (language understanding), among other things.
07.01.2025 02:33 β π 2 π 0 π¬ 0 π 0This could be a good case for mixed models. The model parsing the text could likely be smaller or be fairly cheap like DeepSeek
04.01.2025 21:45 β π 1 π 0 π¬ 1 π 0Thankfully in a small startup you only have to sell an idea to a couple of people and you can get going.
03.01.2025 20:34 β π 0 π 0 π¬ 0 π 0One startup I joined had a model getting 95% on benchmarks but terrible in practice. Spent the first 6 months developing new benchmarks instead of a new model.
03.01.2025 01:30 β π 1 π 0 π¬ 1 π 0I always set out to propose a new idea and end up having to proposing a new benchmark instead
03.01.2025 00:31 β π 4 π 0 π¬ 1 π 0What if humanity knows X and wants to understand Z. If a computer can give us Y so that we can understand Z, that would be useful for science. Though Iβd say that we still didnβt know Y ourselves yet.
03.01.2025 00:26 β π 0 π 0 π¬ 0 π 0Imagine if under the hood O1 is just calling βwrite better codeβ over and over again π
03.01.2025 00:14 β π 5 π 0 π¬ 0 π 0I posted about this recently. Benchmarks show what models canβt do, not what they can do.
02.01.2025 23:49 β π 1 π 0 π¬ 0 π 0Plagiarize other peopleβs research
01.01.2025 00:54 β π 1 π 0 π¬ 0 π 1Imagine being an editor for an LLM, so much work with low confidence that youβll have something interesting in the end.
30.12.2024 22:33 β π 3 π 0 π¬ 0 π 0I remember a lot of focus being on the loss function. My impression was that we thought we had models that would work well if only we had a good perceptual loss to train them with. In comes the GAN
30.12.2024 00:59 β π 0 π 0 π¬ 0 π 0Base models are closer, but theyβre still affected by the companyβs decisions on which data to filter out and more indirectly on what data is given free hosting on the internet.
28.12.2024 20:52 β π 0 π 0 π¬ 0 π 0