Yep that's right! A very common use-case is for "document masking" (i.e. variable length sequences), and that requires recomputing the mask on every iteration (which isn't "free", but is on the order of microseconds to milliseconds and not seconds).
               
            
            
                06.02.2025 22:22 β π 3    π 0    π¬ 0    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            What does "there" mean in this case :)
               
            
            
                02.01.2025 13:13 β π 1    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
            
            
            
                YouTube video by Jane Street
                Building Machine Learning Systems for a Trillion Trillion Floating Point Operations
            
         
    
    
            @chhillee.bsky.social's talk at Jane Street is now up!
youtu.be/139UPjoq7Kw?...
               
            
            
                11.12.2024 12:19 β π 30    π 7    π¬ 0    π 1                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Iβll count it!
               
            
            
                03.12.2024 08:22 β π 2    π 0    π¬ 0    π 0                      
            
         
            
        
            
            
            
            
            
    
    
            
                        
                GitHub - Smith42/astroPT: Transformer for galaxy images (and general astronomy)
                Transformer for galaxy images (and general astronomy) - Smith42/astroPT
            
        
    
    
            Getting different attention masks working for AstroPT (a proto-foundation model for astronomy github.com/Smith42/astr...), so much nicer to do it with Flex Attention vs custom CUDA kernels -- thank you for releasing it to the world π«‘
               
            
            
                02.12.2024 09:30 β π 4    π 1    π¬ 0    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Kinda interesting to me that the books I obsessively read as an elementary schooler are still some of the most popular series today.
               
            
            
                01.12.2024 23:51 β π 3    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
            
                        
                x.com
                
            
        
    
    
            I think torch-xla is definitely usable if you donβt want to train anything particularly weird or use unusual parallelism schemes. See this tweet from Saining Xieβs lab on evaluating torchxla vs. Jax for their use case: x.com/tongpetersb/...
               
            
            
                01.12.2024 08:35 β π 1    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            The other nice parts about TPUs is that Google gives much more of them out for free compared to GPUs. Arguably this reflects how much people want to use them, but I think it's been a great boon for the academic labs willing to go through the effort.
               
            
            
                01.12.2024 02:24 β π 0    π 0    π¬ 2    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            I judge social networks by how many FlexAttention users I can find on each one, and by that metric, Bluesky is doing pretty good!
               
            
            
                01.12.2024 02:21 β π 50    π 1    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            ! What were you using it for?
               
            
            
                01.12.2024 01:49 β π 1    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            A lot of PyTorch is about dealing with this stuff nowadays!
               
            
            
                01.12.2024 01:47 β π 3    π 0    π¬ 0    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Out of curiosity, what kind of shapes are you typically looking at?
               
            
            
                01.12.2024 01:46 β π 1    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Are they actually using FlexAttention here? I didn't see it in the repo
               
            
            
                01.12.2024 01:44 β π 0    π 0    π¬ 1    π 0                      
            
         
            
        
            
        
            
            
            
            
                                                
                                                
    
    
    
    
            First thought: Seems kinda "FlexAttention-y": https://bsky.app/profile/sungkim.bsky.social/post/3lbjbfmyqts27 
Second thought: oh cool, they're already using FlexAttention! 
it's a nice usage of the `or_masks` and `and_masks` API - I think they do (causal & sliding_window) | (register_mask)
               
            
            
                23.11.2024 01:55 β π 9    π 0    π¬ 0    π 0                      
            
         
    
         
        
            
        
                            
                    
                    
                                            Google Chief Scientist, Gemini Lead. Opinions stated here are my own, not those of Google.  Gemini, TensorFlow, MapReduce, Bigtable, Spanner, ML things, ...
                                     
                            
                    
                    
                                            Reproducible bugs are candies ππ¬
                                     
                            
                    
                    
                                            HPC, BLAS, I make things FASTβ¨
Standing on the shoulders of giants
TLDR; π¨π¦π§π§πΌβπ»π΄ποΈπ§πΌ π©posting. 
Haver of opinions that are all my own. 
I mainly do #HPC #BLAS #AI #RVV and #clusters
Proud French Canadian, youβll hear about it
(I help with HPC.social)
                                     
                            
                    
                    
                                            AI x storytelling
AI Engineering: https://amazon.com/dp/1098166302
Designing ML Systems: http://amazon.com/dp/1098107969
@chipro
                                     
                            
                    
                    
                                            Interpretable Deep Networks. http://baulab.info/ @davidbau
                                     
                            
                    
                    
                                    
                            
                    
                    
                                            Differentiable Programming & Scientific Machine Learning
                                     
                            
                    
                    
                                            Mathematician at UCLA.  My primary social media account is https://mathstodon.xyz/@tao .  I also have a blog at https://terrytao.wordpress.com/ and a home page at https://www.math.ucla.edu/~tao/
                                     
                            
                    
                    
                                    
                            
                    
                    
                                    
                            
                    
                    
                                            Code, AI, and 3D printing. Opinions are my own, not my computer's...for now. Co-creator of DALL-E 2. Researcher @openai.
                                     
                            
                    
                    
                                            PhD @Stanford @HazyResearch in AI Systems, Incoming Assistant Professor @Caltech CMS
                                     
                            
                    
                    
                                            I work at Sakana AI ππ π‘ β @sakanaai.bsky.social
https://sakana.ai/careers
                                     
                            
                    
                    
                                    
                            
                    
                    
                                            ML Engineer at Anlatan (NovelAI). co-author of HDiT (Hourglass Diffusion Transformers). works on diffusion models and LLMs. ζ₯ζ¬θͺγεεΌ·γγ¦γγ
                                     
                            
                            
                    
                    
                                            building the future 
research at midjourney, deepmind. slinging ai hot takes π₯at artfintel.com
                                     
                            
                    
                    
                                            RS at at GDM, Science Team. Prev: Google Brain
                                     
                            
                    
                    
                                            Delver at contraptions.venkateshrao.com