Julien Guinot @juj-guinot - Bluesky Profile

11/11 : Some future work areas include scalability to more variant concepts, and moving from obvious concepts like key and tempo to more high-level notions which could be interesting for retrieval. More to come!

30.12.2024 17:30 — 👍 0 🔁 0 💬 0 📌 0

10/11 : Not only that, LOEV++ allows users to control what attribute is more important for retrieval! By searching in the time-variant space, we get better retrieval results for tempo retrieval. Users can search for similar songs by specifying what kind of similarity they want.

30.12.2024 17:29 — 👍 0 🔁 0 💬 1 📌 0

9/11 : To take it one step further, we propose LOEV++. Instead of splitting the network at the projection heads, we split it before, creating individual latent spaces containing the augmentation information in a disentangled way. We show that this further improves performance.

30.12.2024 17:28 — 👍 0 🔁 0 💬 1 📌 0

8/11 : We do this by simply tracking which augmentations are applied and modifying the targets accordingly. We show through downstream probing that this forces the encoder to *not* discard key and tempo information, while keeping potent representations for general tasks.

30.12.2024 17:28 — 👍 0 🔁 0 💬 1 📌 0

7/11 Our approach is simple: We keep an all-invariant projection head, but build two more projection heads, Pitch-variant and Time-variant. Each head has its own contrastive objective: In the pitch-variant head, views that have been augmented with pitch shifting are treated as *negatives*.

30.12.2024 17:27 — 👍 0 🔁 0 💬 1 📌 0

6/11 So, there is a tradeoff - coming from the applied augmentations - between general and task-specific performance. LOEV, aims to fix this. We focus on two augmentations, Time Stretching (TS) and Pitch Shifting (PS), which are explicitly related to the musical notions of Tempo and Key.

30.12.2024 17:26 — 👍 0 🔁 0 💬 1 📌 0

In music, this can be catastrophic. Take the example of a song in the key of A Major. Apply a pitch shifting augmentation to it, and you end up with two different keys! A contrastive model will still map them in the same latent spot. This can cause the key space to completely collapse.

30.12.2024 17:25 — 👍 0 🔁 0 💬 1 📌 0

4/11: It has been shown that stronger augmentations lead to generally better performance on downstream tasks. But what happens when a downstream task needs representations to be variant to a certain transformation?

30.12.2024 17:23 — 👍 0 🔁 0 💬 1 📌 0

3/11 In doing so, contrastive models effectively learn invariances. By learning to map augmented data points to the same spot in the latent space. They learn to be *invariant* to the augmentations.

30.12.2024 17:23 — 👍 0 🔁 0 💬 1 📌 0

2/11 : Unimodal contrastive learning uses augmentations to produce different views of samples. The model then learns to push views from the same sample together in the latent space, and repel views from different samples. This allows models to internalize semantic information without supervision.

30.12.2024 17:22 — 👍 0 🔁 0 💬 1 📌 0

1/11 : In this work, we propose a simple way to mitigate the loss of information due to learned invariances in contrastive learning for music. This information loss can be catastrophic for downstream tasks and LOEV is a very cheap lunch to fix that!

30.12.2024 17:22 — 👍 0 🔁 0 💬 1 📌 0

Happy to announce that our new work "Leave One EquiVariant: Alleviating invariance-related information loss in contrastive music representations" (arxiv.org/pdf/2412.18955) was accepted for #ICASSP2025! Very excited to go to India in April.

🧵 below:

30.12.2024 17:21 — 👍 2 🔁 0 💬 1 📌 0

Julien Guinot

Latest posts by juj-guinot.bsky.social on Bluesky

@juj-guinot is following 20 prominent accounts