Anthony Gitter @anthonygitter

Biophysics-based protein language models for protein engineering - Nature Methods Mutational effect transfer learning (METL) is a protein language model framework that unites machine learning and biophysical modeling. Transformer-based neural networks are pretrained on biophysical simulation data to capture fundamental relationships between protein sequence, structure and energetics.

AI + physics for protein engineering 🚀
Our collaboration with @anthonygitter.bsky.social is out in Nature Methods! We use synthetic data from molecular modeling to pretrain protein language models. Congrats to Sam Gelman and the team!
🔗 www.nature.com/articles/s41...

01.10.2025 19:07 — 👍 4 🔁 1 💬 0 📌 0

Does anyone know whether there's a functioning API to ESMfold?

(api.esmatlas.com/foldSequence... gives me Service Temporarily Unavailable)

30.09.2025 14:11 — 👍 3 🔁 1 💬 2 📌 0

GitHub - gitter-lab/metl: Mutational Effect Transfer Learning (METL) framework for pretraining and finetuning biophysics-informed protein language models Mutational Effect Transfer Learning (METL) framework for pretraining and finetuning biophysics-informed protein language models - gitter-lab/metl

The main GitHub repo github.com/gitter-lab/m... links to the extensive resources for running Rosetta simulations at scale to generate new training data, training METL models, running our models, and accessing our datasets. 8/

11.09.2025 17:00 — 👍 0 🔁 0 💬 0 📌 0

Fig. 6: Low-N GFP design.

We can use METL for low-N protein design. We trained METL on Rosetta simulations of GFP biophysical attributes and only 64 experimental examples of GFP brightness. It designed fluorescent 5 and 10 mutants, including some with mutants entirely outside training set mutations. 7/

11.09.2025 17:00 — 👍 0 🔁 0 💬 1 📌 0

Fig. 5: Function-specific simulations improve METL pretraining for GB1.

A powerful aspect of pretraining on biophysical simulations is that the simulations can be customized to match the protein function and experimental assay. Our expanded simulations of the GB1-IgG complex with Rosetta InterfaceAnalyzer improve METL predictions of GB1 binding. 6/

11.09.2025 17:00 — 👍 0 🔁 0 💬 1 📌 0

Fig. 3: Comparative performance across extrapolation tasks.

We also benchmark METL on four types of difficult extrapolation. For instance, positional extrapolation provides training data from some sequence positions and tests predictions at different sequence positions. Linear regression completely fails in this setting. 5/

11.09.2025 17:00 — 👍 0 🔁 0 💬 1 📌 0

Fig. 2: Comparative performance of Linear, Rosetta total score, EVE, RaSP, Linear-EVE, ESM-2, ProteinNPT, METL-Global and METL-Local across different training set sizes.

We compare these approaches on deep mutational scanning datasets with increasing training set sizes. Biophysical pretraining helps METL generalize well with small training sets. However, augmented linear regression with EVE scores is great on some of these assays. 4/

11.09.2025 17:00 — 👍 0 🔁 0 💬 1 📌 0

METL models pretrained on Rosetta biophysical attributes learn different protein representations than general protein language models like ESM-2 or protein family-specific models like EVE. These new representations are valuable for machine learning-guided protein engineering. 3/

11.09.2025 17:00 — 👍 1 🔁 0 💬 1 📌 0

Most protein language models train on natural protein sequence data and use the underlying evolutionary signals to score sequence variants. Instead, METL trains on @rosettacommons.bsky.social data, learning from simulated biophyiscal attributes of the sequence variants we select. 2/

11.09.2025 17:00 — 👍 0 🔁 0 💬 1 📌 0

Biophysics-based protein language models for protein engineering - Nature Methods Mutational effect transfer learning (METL) is a protein language model framework that unites machine learning and biophysical modeling. Transformer-based neural networks are pretrained on biophysical ...

The journal version of "Biophysics-based protein language models for protein engineering" with @philromero.bsky.social is live! Mutational Effect Transfer Learning (METL) is a protein language model trained on biophysical simulations that we use for protein engineering. 1/

doi.org/10.1038/s415...

11.09.2025 17:00 — 👍 13 🔁 2 💬 1 📌 0

Chemical Language Model Linker: Blending Text and Molecules with Modular Adapters The development of large language models and multimodal models has enabled the appealing idea of generating novel molecules from text descriptions. Generative modeling would shift the paradigm from relying on large-scale chemical screening to find molecules with desired properties to directly generating those molecules. However, multimodal models combining text and molecules are often trained from scratch, without leveraging existing high-quality pretrained models. Training from scratch consumes more computational resources and prohibits model scaling. In contrast, we propose a lightweight adapter-based strategy named Chemical Language Model Linker (ChemLML). ChemLML blends the two single domain models and obtains conditional molecular generation from text descriptions while still operating in the specialized embedding spaces of the molecular domain. ChemLML can tailor diverse pretrained text models for molecule generation by training relatively few adapter parameters. We find that the choice of molecular representation used within ChemLML, SMILES versus SELFIES, has a strong influence on conditional molecular generation performance. SMILES is often preferable despite not guaranteeing valid molecules. We raise issues in using the entire PubChem data set of molecules and their associated descriptions for evaluating molecule generation and provide a filtered version of the data set as a generation test set. To demonstrate how ChemLML could be used in practice, we generate candidate protein inhibitors and use docking to assess their quality and also generate candidate membrane permeable molecules.

The journal version of our paper 'Chemical Language Model Linker: Blending Text and Molecules with Modular Adapters' is out doi.org/10.1021/acs....

ChemLML is a method for text-based conditional molecule generation that uses pretrained text models like SciBERT, Galactica, or T5.

22.08.2025 13:36 — 👍 0 🔁 0 💬 0 📌 0

🚨New paper 🚨

Can protein language models help us fight viral outbreaks? Not yet. Here’s why 🧵👇
1/12

17.08.2025 03:42 — 👍 42 🔁 19 💬 3 📌 0

Assay2Mol: large language model-based drug design using BioAssay context Scientific databases aggregate vast amounts of quantitative data alongside descriptive text. In biochemistry, molecule screening assays evaluate the functional responses of candidate molecules against...

Paper: arxiv.org/abs/2507.12574
GitHub: github.com/gitter-lab/A...
Datasets: doi.org/10.5281/zeno...

7/

18.07.2025 15:13 — 👍 1 🔁 0 💬 0 📌 0

Distribution of the top 10 docking scores from molecules with high- and low-relevance BioAssays as context for different proteins.

There are many more results and controls in the paper. Here's how the best (most negative) docking scores change when we use relevant assays, irrelevant assays, or no assays as context for generation with GPT-4o. In the majority of cases, but not all, relevant context helps. 6/

18.07.2025 15:13 — 👍 1 🔁 0 💬 1 📌 0

This generally has the desired effects across multiple LLMs and queried protein targets, with the caveat that our core results are based on AutoDock Vina scores. Assessing generated molecules with docking is admittedly frustrating. 5/

18.07.2025 15:13 — 👍 1 🔁 0 💬 1 📌 0

We embed the BioAssay data into a vectorbase, retrieve initial candidate assays, and do further LLM-based filtering and summarization. We select some active and inactive molecules from the BioAssay data table. This is all used for in-context learning and molecule generation. 4/

18.07.2025 15:13 — 👍 1 🔁 0 💬 1 📌 0

Data mining of PubChem bioassay records reveals diverse OXPHOS inhibitory chemotypes as potential therapeutic agents against ovarian cancer - Journal of Cheminformatics Focused screening on target-prioritized compound sets can be an efficient alternative to high throughput screening (HTS). For most biomolecular targets, compound prioritization models depend on prior ...

A proof of concept study from our collaborators showed that mining this PubChem data successfully identified new candidates for a target phenotype, oxidative phosphorylation doi.org/10.1186/s133....

We wanted to generalize that for any new query and assess the effectiveness. 3/

18.07.2025 15:13 — 👍 1 🔁 0 💬 1 📌 0

SSB-PriA antibiotic resistant target AlphaScreen

PubChem BioAssays can contain a lot of information about why and how an assay was run. Here's an example from our collaborators. pubchem.ncbi.nlm.nih.gov/bioassay/127...

There are now 1.7M PubChem BioAssays ranging in scale from a few tested molecules to high-throughput screens. 2/

18.07.2025 15:13 — 👍 1 🔁 0 💬 1 📌 0

The Assay2Mol workflow. A chemist provides a target description, which is used to retrieve BioAssays from the pre-embedded vector database. After filtering for relevance, the BioAssays are summarized by an LLM. The BioAssay ID is then used to retrieve experimental tables. The final molecule generation prompt is formed by combining the description, summarization, and selected test molecules with associated test outcomes, enabling the LLM to generate relevant active molecules.

Our preprint Assay2Mol introduces uses PubChem chemical screening data as context when generating molecules with large language models. It uses assay descriptions and protocols to find relevant assays and that text plus active/inactive molecules as context for generation. 1/

18.07.2025 15:13 — 👍 1 🔁 1 💬 1 📌 0

Nobody is commenting on this little nugget from Fig 1?

18.07.2025 14:55 — 👍 34 🔁 5 💬 3 📌 1

MLCB The 20th Machine Learning in Computational Biology (MLCB) meeting will be a two-day hybrid conference, September 10-11, 9am-5pm ET, with the in-person component at the New York Genome Center, NYC. Reg...

That's what MLCB became after it was rejected as a NeurIPS workshop www.mlcb.org

Maybe a possible partner?

07.07.2025 18:29 — 👍 2 🔁 0 💬 0 📌 0

Some complexes can be huge, not that that is what you'd use this model for. The mammalian nuclear pore complex has ~800 nucleoporins and a molecular weight of ~100 MDa. doi.org/10.1016/j.tc...

30.05.2025 14:22 — 👍 3 🔁 0 💬 1 📌 0

Isn't PKZILLA-1 the new champ? 45k amino acids www.uniprot.org/uniprotkb/A0...

30.05.2025 14:06 — 👍 2 🔁 0 💬 1 📌 0

New methods are revolutionizing biology: an interview with Martin Steinegger Martin Steinegger, who is the only non-DeepMind-affiliated author of the AlphaFold2 Nature paper, offers unique insights and personal reflections.

Happy to share this interview with Weijie Zhao from NSR at #OxfordUniversityPress. It covers questions I’m often asked—why I chose Korea, AlphaFold2, my unconventional journey into academia, and research insights. Thanks again for the fun conversation.
📄 academic.oup.com/nsr/article/...

19.05.2025 12:08 — 👍 75 🔁 17 💬 0 📌 1

PDB101: Learn: Other Resources: Commemorating 75 Years of Discovery and Innovation at the NSF Download images celebrating NSF and PDB milestones

To honor the 75th anniversary of @NSF, RCSB PDB Intern Xinyi Christine Zhang created posters to celebrate the science made possible by the NSF and RCSB PDB.
Explore these images and learn how protein research is changing our world. #NSFfunded #NSF75
pdb101.rcsb.org/lear...

08.05.2025 16:18 — 👍 16 🔁 10 💬 1 📌 1

I am assuming that responsibility extends far beyond Bluesky and that he also agrees to co-sign a personal loan I am applying for and walk the new puppy I adopted.

04.04.2025 14:38 — 👍 1 🔁 0 💬 1 📌 0

My first post is a niche and personal shout out to @michaelhoffman.bsky.social, the person who asked me most often if I am on Bluesky yet.

03.04.2025 23:23 — 👍 22 🔁 2 💬 4 📌 1

Anthony Gitter

Latest posts by anthonygitter.bsky.social on Bluesky

@anthonygitter is following 20 prominent accounts