Learn more, use our models, or work directly with us!
Blog: www.profluent.bio/showcase/pro...
Github: github.com/Profluent-AI...
Platform access: docs.google.com/forms/d/1Pdk...
@jeffruffolo.bsky.social
Protein Design / ML @ Profluent Bio | Molecular Biophysics PhD @ Johns Hopkins
Learn more, use our models, or work directly with us!
Blog: www.profluent.bio/showcase/pro...
Github: github.com/Profluent-AI...
Platform access: docs.google.com/forms/d/1Pdk...
Weβre incredibly optimistic about the opportunities to solve important, hard problems in protein design by scaling up our models and data. Weβve already ~10x our data scale since training ProGen3, so this really is just the beginning.
17.04.2025 20:28 β π 0 π 0 π¬ 1 π 0Not only do we see compelling benchmark performance, but also that these aligned capabilities extend to generative settings, which is what really matters for design. Meaning, with just a bit of data we can steer the models to generate the high-fitness sequences we want.
17.04.2025 20:28 β π 0 π 0 π¬ 1 π 0Coming back to fitness prediction, we wanted to see if this greater understanding of protein sequence space translated to stronger predictive power. We turned to alignment, where we use a bit of experimental data to tilt the model towards properties we care about, like stability.
17.04.2025 20:28 β π 0 π 0 π¬ 1 π 0We think this is the beginning of a new, more meaningful way of understanding what it means to scale protein language models, going beyond ranking of mutations or predicting structural contacts. This will be incredibly useful in shaping how we apply models like ProGen3.
17.04.2025 20:28 β π 0 π 0 π¬ 1 π 0This extends even to proteins that had low (or no) homology to anything in the modelsβ training data, where we still see comparable rates of protein expression, including for proteins with very low AlphaFold2 pLDDT.
17.04.2025 20:28 β π 0 π 0 π¬ 1 π 0To put this to the test, we experimentally tested the viability (expression) of hundreds of proteins in the lab, and found that this added diversity is real. Generated proteins are as viable as natural proteins, and larger models can come up with more and more of them.
17.04.2025 20:28 β π 1 π 0 π¬ 1 π 0So what should we be evaluating? Generative models like ProGen3 are fundamentally trained to generate proteins. So we just let the models generate! We found that as models scale, not only do they generate higher quality sequences, but also produce considerably more diversity.
17.04.2025 20:28 β π 0 π 0 π¬ 1 π 0But why do all of this? What does scaling get us? ProteinGym is a nice benchmark for measuring zero-shot fitness prediction, but even three years ago (ProGen2) we found that this wasnβt the best proxy for evaluating scaling, and we still find that to be the case.
17.04.2025 20:28 β π 0 π 0 π¬ 1 π 0We developed optimal scaling laws that allowed us to scale up to 46B parameters, where we continue to see signs of generalization on diverse proteins far from the training data.
17.04.2025 20:28 β π 1 π 0 π¬ 1 π 0ProGen3 is a family of MoE models ranging from 112M to 46B parameters, capable of full sequence generation, as well as infilling. For practical protein design problems, having these new capabilities opens up a lot of new possibilities.
17.04.2025 20:28 β π 2 π 0 π¬ 1 π 0What does pushing the boundaries of model capacity and data scale do for generative protein language models? Iβm super excited to share our latest work at Profluent Bio where we begin to explore and test some of our hypotheses!
www.biorxiv.org/content/10.1...
Iβll be in Vancouver for NeurIPS December 13-16, reach out if youβre interested in protein language models, genome editor / antibody design, or any of the other cool stuff weβre doing at Profluent!
06.12.2024 21:08 β π 5 π 0 π¬ 0 π 0We note the 86% identity to S. cristatus Cas9 in the paper, but importantly found that OpenCRISPR-1 was <80% identity to anything in our database of patented sequences.
26.04.2024 01:28 β π 2 π 0 π¬ 1 π 0Hey, happy to provide more info here! We did find other generated sequences with activity further from naturals (including S. cristatus), but for OpenCRISPR-1 we wanted a sequence thatβd be viable as a drop-in for SpCas9 (high activity, NGG PAM), which ended up bringing us closer to a few naturals.
26.04.2024 01:27 β π 2 π 0 π¬ 1 π 0