Simone Scardapane's Avatar

Simone Scardapane

@sscardapane.bsky.social

I fall in love with a new #machinelearning topic every month ๐Ÿ™„ Ass. Prof. Sapienza (Rome) | Author: Alice in a differentiable wonderland (https://www.sscardapane.it/alice-book/)

482 Followers  |  35 Following  |  42 Posts  |  Joined: 20.10.2023  |  2.1273

Latest posts by sscardapane.bsky.social on Bluesky

Thanks a lot to all my amazing co-authors @alessiodevoto.bsky.social @sscardapane.bsky.social @yuzhaouoe.bsky.social @neuralnoise.com Eric de la Clergerie @bensagot.bsky.social

And a special thanks to @edoardo-ponti.bsky.social for the academic visit that made this work possible!

06.03.2025 16:02 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Will present this at #CVPR โœˆ๏ธ See you in Nashville ๐Ÿ‡บ๐Ÿ‡ธ!

Kudos to the team ๐Ÿ‘
Antonio A. Gargiulo, @mariasofiab.bsky.social, @sscardapane.bsky.social, Fabrizio Silvestri, Emanuele Rodolร .

11.03.2025 08:02 โ€” ๐Ÿ‘ 5    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Please share it within your circles! edin.ac/3DDQK1o

13.03.2025 11:59 โ€” ๐Ÿ‘ 14    ๐Ÿ” 9    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
Post image

๐Ÿš€ New Paper Alert! ๐Ÿš€

We introduce Q-Filters, a training-free method for efficient KV Cache compression!

It is compatible with FlashAttention and can compress along generation which is particularly useful for reasoning models โšก

TLDR: we make Streaming-LLM smarter using the geometry of attention

06.03.2025 16:02 โ€” ๐Ÿ‘ 21    ๐Ÿ” 7    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Post image

Q-Filters is very efficient which allows streaming compression at virtually no latency cost, just like Streaming-LLM...

...but it is also much better at retaining relevant KV pairs compared to fast alternatives (and can even beat slower algorithms such as SnapKV)

06.03.2025 13:40 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Compositionality and Ambiguity:ย  Latent Co-occurrence and Interpretable Subspaces โ€” LessWrong Matthew A. Clarke, Hardik Bhatnagar and Joseph Bloom

*Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces*
by @maclarke.bsky.social et al.

Studies co-occurence of SAE features and how they can be understood as composite / ambiguous concepts.

www.lesswrong.com/posts/WNoqEi...

27.02.2025 08:33 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Weighted Skip Connections are Not Harmful for Deep Nets Give Gates a Chance

*Weighted Skip Connections are Not Harmful for Deep Nets*
by @rupspace.bsky.social

Cool blog post "in defense" of weighted variants of ResNets (aka HighwayNets) - as a follow up to a previous post by @giffmana.ai.

rupeshks.cc/blog/skip.html

18.02.2025 09:49 โ€” ๐Ÿ‘ 8    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*CAT: Content-Adaptive Image Tokenization*
by @junhongshen1.bsky.social @lukezettlemoyer.bsky.social et al.

They use an LLM to predict a "complexity score" for each image token, which in turns decides the size of its VAE latent representation.

arxiv.org/abs/2501.03120

17.02.2025 14:57 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*Accurate predictions on small data with a tabular foundation model*
by Noah Hollmann et al.

A transformer for tabular data that takes an entire training set as input and provides predictions - trained on millions of synthetic datasets.

www.nature.com/articles/s41...

14.02.2025 15:52 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*Insights on Galaxy Evolution from Interpretable Sparse Feature Networks*
by @jwuphysics.bsky.social

Integrates a sparse dictionary step on the last layer of a CNN to obtain a set of interpretable features on multiple astronomical prediction tasks.

arxiv.org/abs/2501.00089

13.02.2025 13:53 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*Round and Round We Go! What makes Rotary Positional Encodings useful?*

by @petar-v.bsky.social et al.

They show RoPE has distinct behavior for different rotation angles - high freq for position, low freq for semantics.

arxiv.org/abs/2410.06205

10.02.2025 11:23 โ€” ๐Ÿ‘ 6    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*Cautious Optimizers: Improving Training with One Line of Code*
by Liang et al.

Adding a simple masking operation to momentum-based optimizers can significantly boost their speed.

arxiv.org/abs/2411.16085

03.02.2025 12:30 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*Byte Latent Transformer: Patches Scale Better Than Tokens*
by @artidoro.bsky.social et al.

Trains a small encoder to dynamically aggregate bytes into tokens, which are input to a standard autoregressive model. Nice direction!

arxiv.org/abs/2412.09871

31.01.2025 09:56 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*Understanding Gradient Descent through the Training Jacobian*
by @norabelrose.bsky.social @eleutherai.bsky.social

Analyzes training through the spectrum of the "training Jacobian" (โˆ‡ of trained weights wrt initial weights), identifying a large inactive subspace.

arxiv.org/abs/2412.07003

28.01.2025 11:47 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*Mixture of A Million Experts*
by Xu Owen He

Scales a MoE architecture up to millions of experts by implementing a fast retrieval method in the router, inspired by recent MoE scaling laws.

arxiv.org/abs/2407.04153

27.01.2025 14:00 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*Restructuring Vector Quantization with the Rotation Trick*
by Fifty et al.

Replaces the "closest codebook" operation in vector quantization with a rotation and rescaling operations to improve the back-propagation of gradients.

arxiv.org/abs/2410.06424

23.01.2025 11:31 โ€” ๐Ÿ‘ 6    ๐Ÿ” 1    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Post image

*On the Surprising Effectiveness of Attention Transfer
for Vision Transformers*
by Li et al.

Shows that distilling attention patterns in ViTs is competitive with standard fine-tuning.

arxiv.org/abs/2411.09702

21.01.2025 11:35 โ€” ๐Ÿ‘ 9    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*The Super Weight in Large Language Models*
by Yu et al.

Identifies single weights in LLMs that destroy inference when deactivated. Tracks their mechanisms through the LLM and proposes quantization-specific techniques.

arxiv.org/abs/2411.07191

17.01.2025 10:55 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*The Surprising Effectiveness of Test-Time Training for Abstract Reasoning*
by @ekinakyurek.bsky.social et al.

Shows that test-time training (fine-tuning at inference time) strongly improves performance on the ARC dataset.

arxiv.org/abs/2411.07279

16.01.2025 10:50 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion Model compression is essential in the deployment of large Computer Vision models on embedded devices. However, static optimization techniques (e.g. pruning, quantization, etc.) neglect the fact that d...

Our paper โ€œA Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion" is out as preprint!

By myself, @sscardapane.bsky.social, @rgring.bsky.social and @lanalpa.bsky.social

๐Ÿ“„ arxiv.org/abs/2501.07451

15.01.2025 16:58 โ€” ๐Ÿ‘ 7    ๐Ÿ” 5    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

*Large Concept Models*
by Barrault et al.

Builds an autoregressive model in a "concept" space by wrapping the LLM in a pre-trained sentence embedder (also works with diffusion models).

arxiv.org/abs/2412.08821

15.01.2025 16:09 โ€” ๐Ÿ‘ 6    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

"Task Singular Vectors: Reducing Task Interference in Model Merging" by Antonio Andrea Gargiulo, @crisostomi.bsky.social , @mariasofiab.bsky.social , @sscardapane.bsky.social, Fabrizio Silvestri, Emanuele Rodolร 

Paper: arxiv.org/abs/2412.00081
Code: github.com/AntoAndGar/t...

#machinelearning

15.01.2025 09:36 โ€” ๐Ÿ‘ 4    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*Adaptive Length Image Tokenization via Recurrent Allocation*
by @phillipisola.bsky.social et al.

An encoder to compress an image into a sequence of 1D tokens whose length can dynamically vary depending on the specific image.

arxiv.org/abs/2411.02393

14.01.2025 11:11 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*Deep Learning Through A Telescoping Lens*
by @alanjeffares.bsky.social @aliciacurth.bsky.social

Shows that tracking 1st-order approximations to the training dynamics provides insights into many phenomena (e.g., double descent, grokking).

arxiv.org/abs/2411.00247

14.01.2025 10:43 โ€” ๐Ÿ‘ 10    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*MoE Graph Transformers for Interpretable Particle Collision Detection*
by @alessiodevoto.bsky.social @sgiagu.bsky.social et al.

We propose a MoE graph transformer for particle collision analysis, with many nice interpretability insights (e.g., expert specialization).

arxiv.org/abs/2501.03432

10.01.2025 14:12 โ€” ๐Ÿ‘ 12    ๐Ÿ” 4    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*A Meticulous Guide to Advances in Deep Learning Efficiency over the Years* by Alex Zhang

Part deep learning history, part overview on the vast landscape of "efficiency" in DL (hardware, compilers, architecture, ...). Fantastic post!

alexzhang13.github.io/blog/2024/ef...

09.01.2025 14:15 โ€” ๐Ÿ‘ 10    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Preview
GitHub - DTU-PAS/awesome-dynn-for-cv: Awesome collection of DyNN papers for Computer Vision and Sensor Fusion applications :sparkles: Awesome collection of DyNN papers for Computer Vision and Sensor Fusion applications :sparkles: - DTU-PAS/awesome-dynn-for-cv

First little project of the year: an awesome collection of papers on Dynamic Neural Networks for Computer Vision and Sensor Fusion!

Each paper comes with a brief summary and code link.

๐Ÿ‘‰ github.com/DTU-PAS/awes...

08.01.2025 09:16 โ€” ๐Ÿ‘ 9    ๐Ÿ” 2    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 2
Preview
Task Singular Vectors: Reducing Task Interference in Model Merging Task Arithmetic has emerged as a simple yet effective method to merge models without additional training. However, by treating entire networks as flat parameter vectors, it overlooks key structural in...

Donโ€™t miss out on these insights and more โ€” check out the paper!

๐Ÿ“„ Preprint โ†’ arxiv.org/abs/2412.00081

๐Ÿ’ป Code โ†’ github.com/AntoAndGar/t...

Joint work w/ Antonio A. Gargiulo, @mariasofiab.bsky.social, @sscardapane.bsky.social, Fabrizio Silvestri, Emanuele Rodolร .

(6/6)

08.01.2025 19:00 โ€” ๐Ÿ‘ 2    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

*Modular Duality in Deep Learning*

Develops a theory of "modular duality" for designing principled optimizers that respect the "type semantics" of each layer.

arxiv.org/abs/2410.21265

03.01.2025 14:42 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image Post image

*Understanding Visual Feature Reliance through the
Lens of Complexity*
by @thomasfel.bsky.social @louisbethune.bsky.social @lampinen.bsky.social

Wonderful work! They rank features' complexity with a variant of mutual information, before analyzing their dynamics.

arxiv.org/abs/2407.06076

28.12.2024 17:22 โ€” ๐Ÿ‘ 21    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@sscardapane is following 20 prominent accounts