floriantramer - Bluesky Statics

This was an unfortunate mistake, sorry about that.

But the conclusions of our paper don't change drastically: there is significant gradient masking (as shown by the transfer attack) and the cifar robustness is at most in the 15% range. Still cool though!
We'll see if we can fix the full attack

12.12.2024 16:38 — 👍 5 🔁 1 💬 0 📌 0

I discovered a fatal flaw in a paper by @floriantramer.bsky.social et al claiming to break our Ensemble Everything Everywhere defense. Due to a coding error they used attacks 20x above the standard 8/255. They confirmed this but the paper is already out & quoted on OpenReview. What should we do now?

12.12.2024 16:29 — 👍 11 🔁 4 💬 2 📌 1

This was an unfortunate mistake, sorry about that.

But the conclusions of our paper don't change drastically: there is significant gradient masking (as shown by the transfer attack) and the cifar robustness is at most in the 15% range. Still cool though!
We'll see if we can fix the full attack

12.12.2024 16:38 — 👍 5 🔁 1 💬 0 📌 0

🚨Unlearned hazardous knowledge can be retrieved from LLMs 🚨

Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training.

Here's what we found👇

06.12.2024 17:47 — 👍 12 🔁 3 💬 1 📌 0

Come do open AI with us in Zurich!
We're hiring PhD students, postdocs (and faculty!)

04.12.2024 13:49 — 👍 11 🔁 3 💬 0 📌 1

Persistent Pre-Training Poisoning of LLMs Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practic...

Full paper: arxiv.org/abs/2410.13722
Amazing collaboration with Yiming Zhang during our internships at Meta.

Grateful to have worked with Ivan, Jianfeng, Eric, Nicholas, @floriantramer.bsky.social and Daphne.

25.11.2024 12:27 — 👍 5 🔁 2 💬 0 📌 1

Yeah they mostly are

25.11.2024 10:12 — 👍 1 🔁 0 💬 0 📌 0

Gradient Masking All-at-Once: Ensemble Everything Everywhere Is Not Robust Ensemble everything everywhere is a defense to adversarial examples that was recently proposed to make image classifiers robust. This defense works by ensembling a model's intermediate representations...

Ensemble Everything Everywhere is a defense against adversarial examples that people got quite exited about a few months ago (in particular, the defense causes "perceptually aligned" gradients just like adversarial training)

Unfortunately, we show it's not robust...

arxiv.org/abs/2411.14834

25.11.2024 08:38 — 👍 28 🔁 9 💬 1 📌 0

probably -> provably...

23.11.2024 08:43 — 👍 1 🔁 0 💬 0 📌 0

Evaluating Superhuman Models with Consistency Checks If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor...

This was the motivation for our work on consistency checking (superhuman) models: arxiv.org/abs/2306.09983

We tested chess models for instance, and could show many cases where the model is probably wrong in one of two instances (we just don't know which one)

23.11.2024 06:16 — 👍 9 🔁 0 💬 1 📌 0

Posts by (@floriantramer.bsky.social)