Giannis Daras giannisdaras - Bluesky Statics

Joint work with the amazing Jeffrey Zhang (equal contribution) and with other wonderful people: D. Diaz, K. Ravishankar, W. Daspit, A. Klivans, C. Daskalakis, Q. Liu.

08.07.2025 04:05 — 👍 0 🔁 0 💬 0 📌 0

Read the full paper for more results, including motif scaffolding:
biorxiv.org/content/10.1...
Code, Models, and Data (full release very soon):
github.com/jozhang97/am...

08.07.2025 04:05 — 👍 0 🔁 0 💬 1 📌 0

Our framework builds on recent innovations in training diffusion models from corrupted data.

AF corruption is not structured, it is not explicitly modeled, and it varies across protein size and topology. Yet, our framework still handles it.

08.07.2025 04:05 — 👍 0 🔁 0 💬 1 📌 0

Our model does not simply memorize the dataset.

We achieve novelty improvements, showing more unique structure generation.

This is achieved by using more datapoints, as low pLDDT AF structures are not filtered out as done previously.

08.07.2025 04:05 — 👍 0 🔁 0 💬 1 📌 0

We build on Genie2—scaling it to 17M params, changing the dataset, and training on longer proteins yields gains.

Our framework further boosts performance, leading to the best model for short and long protein generation.

Handling noise properly matters more than architecture.

08.07.2025 04:05 — 👍 0 🔁 0 💬 1 📌 0

Beyond algorithmic advances, we re-clustered AFDB since we found significant structural duplication across evolutionarily distant clusters.

This redundancy causes an overrepresentation of common motifs.

We fix it by tuning FoldSeek to explicitly focus on structural topology.

08.07.2025 04:05 — 👍 0 🔁 0 💬 1 📌 0

The results are quite strong.

Ambient Protein Diffusion substantially outperforms previous baselines in short and long protein generation.

For short proteins, we dominate the Pareto frontier between designability and diversity, using a ~13x smaller model than previous SOTA.

08.07.2025 04:05 — 👍 0 🔁 0 💬 1 📌 0

Ambient Protein Diffusion treats low pLDDT AF structures as low-quality data.

Instead of filtering them out (as done in prior work), we use them for a subset of the diffusion times.

Enough noise "erases" the AF mistakes, and we can still learn from those structures.

08.07.2025 04:05 — 👍 0 🔁 0 💬 1 📌 0

Obtaining large structure datasets experimentally is impossible.

SOTA protein structure models are trained on AFDB (214M AlphaFold predicted structures) subsets.

AF accuracy drops with increasing protein length and complexity, making it hard to generate such proteins.

08.07.2025 04:05 — 👍 0 🔁 0 💬 1 📌 0

Announcing Ambient Protein Diffusion, a state-of-the-art 17M-params generative model for protein structures.

Diversity improves by 91% and designability by 26% over the previous 200M SOTA model for long proteins.

The trick? Treat low pLDDT AlphaFold predictions as low-quality data.

08.07.2025 04:05 — 👍 1 🔁 0 💬 1 📌 0

Figure 1

Figure 2

Figure 3

Figure 4

Ambient Proteins: Training Diffusion Models on Low Quality Structures [new]
Trains diffusion models using low-confidence AlphaFold2 structures by adapting the diffusion objective based on corruption level.

06.07.2025 05:59 — 👍 4 🔁 1 💬 0 📌 0

Thanks a lot for your interest and for your post!
Let me know if you have any questions or thoughts!

08.07.2025 03:57 — 👍 1 🔁 0 💬 0 📌 0

New episode in this line of work from @giannisdaras.bsky.social et al. on training diffusion models with mostly bad/low-quality/corrupted data (+few high-quality samples). This time for proteins!

📄 Ambient diffusion Omni: arxiv.org/pdf/2506.10038
📄 Ambient Proteins: www.biorxiv.org/content/10.1...

07.07.2025 19:43 — 👍 3 🔁 2 💬 1 📌 0

Performance of the best known Reasoning models on various Benchmarks. OpenThinker-32B matches the current state of the art.

We are releasing OpenThinker-32B, the best 32B reasoning model with open data. We match or outperform Deepseek-R1-32B (a closed data model) in reasoning benchmarks. Congrats to Negin and the whole Open Thoughts team.

github.com/open-thought...

12.02.2025 20:11 — 👍 25 🔁 8 💬 2 📌 1

o3 can't multiply beyond a few digits...

But I think multiplication, addition, maze solving and easy-to-hard generalization is actually solvable on standard transformers...

with recursive self-improvement

Below is the acc of a tiny model teaching itself how to add and multiply

13.02.2025 13:33 — 👍 21 🔁 4 💬 2 📌 3

Well deserved!

20.12.2024 14:20 — 👍 1 🔁 0 💬 0 📌 0

Did Gauss invent the Gaussian?

- Laplace wrote down the integral first in 1783
- Gauss then described it in 1809 in the context of least-sq. for astronomical measurements
- Pearson & Fisher framed it as ‘normal’ density only in 1910

* Best part is: Gauss gave Laplace credit!

14.12.2024 06:22 — 👍 38 🔁 5 💬 0 📌 0

“There is nothing quite so useless, as doing with great efficiency, something that should not be done at all.”

04.12.2024 07:12 — 👍 66 🔁 12 💬 1 📌 1

“To write well is to think clearly.”

27.11.2024 07:19 — 👍 42 🔁 3 💬 1 📌 0

Please reply to your ICLR rebuttals 🙏

24.11.2024 15:49 — 👍 1 🔁 0 💬 0 📌 0

Posts by Giannis Daras (@giannisdaras.bsky.social)