Xiuying Wei's Avatar

Xiuying Wei

@xiuyingwei.bsky.social

PhD student @the Caglar Gulcehre Lab for AI Research (CLAIRE) @EPFL. Efficiency, foundation models. https://wimh966.github.io/

10 Followers  |  34 Following  |  10 Posts  |  Joined: 12.07.2025  |  1.7042

Latest posts by xiuyingwei.bsky.social on Bluesky

This is a collaboration with Anunay Yadav, @razvan-pascanu.bsky.social and @caglarai.bsky.social . Great thanks to them! πŸ™ Reach out if you’re interested in the paper or connecting! 🧡9/9

12.07.2025 10:04 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GitHub - CLAIRE-Labo/RAT: Official code for RAT: Bridging RNN Efficiencyand Attention Accuracy in Language Modeling (http://arxiv.org/abs/2507.044160) Official code for RAT: Bridging RNN Efficiencyand Attention Accuracy in Language Modeling (http://arxiv.org/abs/2507.044160) - CLAIRE-Labo/RAT

For more discussion and results, please see our preprint: arxiv.org/abs/2507.04416
Code: github.com/CLAIRE-Labo/...
Website: claire-labo.github.io/RAT/ 🧡8/9

12.07.2025 10:03 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We also explored other different aspects of RAT, including parameter allocation, positional encodings, and especially the use of NoPE for length generalization ability, and even the retrieval ability with the RULER benchmark. 🧡7/9

12.07.2025 10:02 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Accuracy: We trained 1.3B models and evaluated them on six short-context reasoning tasks, 14 long-context tasks from LongBench, and four SFT tasks. By interleaving RAT’s efficient long-range modeling with strong local interactions, we got top throughput and accuracy. 🧡6/9

12.07.2025 10:02 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Efficiency: Compared to the Attention, RAT has FLOPs and KV Cache reduced by chunk size L, thus enabling much faster training and generation speed. 🧡5/9

12.07.2025 10:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We then show RAT’s strong efficiency and accuracy performance below. We even explore a hybrid model that interleaves RAT and local attention layers, where the two can complement each other effectively. 🧡4/9

12.07.2025 10:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

In detail, gated recurrence first updates keys/values in each chunk. Softmax attention then queries final keys/values across all past chunks plus the current one. RAT is easy to implementβ€”no custom CUDA/Triton, just PyTorch higher-order ops like flex attention. 🧡3/9

12.07.2025 10:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

RAT splits long sequences into chunks. Inside each chunk, recurrence models local dependencies, softmax attention then operates on compressed chunk-level representations. By adjusting chunk size L, RAT moves between attention (L=1) and recurrence (L=T). 🧡2/9

12.07.2025 10:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We started by thinking that overusing attention on short-range context wastes its potential. Local patterns can be captured much more efficiently with lightweight recurrence. This motivates a new layer RAT that bridges the speed of RNNs and the global token access of softmax attention. 🧡1/9

12.07.2025 10:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

⚑️🧠 Excited to share our recent work on long-context efficiency! We propose a new layer called RATβ€”fast and lightweight like RNNs, yet powerful like Attention. 🐭✨ This is the joint effort with Anunay Yadav, @razvan-pascanu.bsky.social @caglarai.bsky.social !

12.07.2025 09:59 β€” πŸ‘ 7    πŸ” 3    πŸ’¬ 1    πŸ“Œ 1

@xiuyingwei is following 20 prominent accounts