Gonzalo Benegas's Avatar

Gonzalo Benegas

@gonzalobenegas.bsky.social

Comp Bio Postdoc @ UC Berkeley https://gonzalobenegas.github.io/

262 Followers  |  806 Following  |  17 Posts  |  Joined: 26.09.2023  |  2.3381

Latest posts by gonzalobenegas.bsky.social on Bluesky

Post image

We are excited to share GPN-Star, a cost-effective, biologically grounded genomic language modeling framework that achieves state-of-the-art performance across a wide range of variant effect prediction tasks relevant to human genetics.
www.biorxiv.org/content/10.1...
(1/n)

22.09.2025 05:29 β€” πŸ‘ 174    πŸ” 90    πŸ’¬ 4    πŸ“Œ 5

I am thrilled to announce that in January 2026 I will be starting my own lab at NYU Biology! Soon enough I will be recruiting postdocs and students! Please reach out if you are interested with a CV and description of your research interests, or if you know of people who could be interested! πŸ§¬πŸ—½ 🦊

25.06.2025 20:10 β€” πŸ‘ 82    πŸ” 19    πŸ’¬ 7    πŸ“Œ 0

How can one efficiently simulate phylodynamics for populations with billions of individuals, as is typical in many applications, e.g., viral evolution and cancer genomics? In this work with M. Celentano, @wsdewitt.github.io , & S. Prillo, we provide a solution. doi.org/10.1073/pnas...
1/n

23.05.2025 21:02 β€” πŸ‘ 37    πŸ” 15    πŸ’¬ 1    πŸ“Œ 1
Post image

Thrilled to see my digital art on the cover of Trends Genet. The two binary strings represent reverse-complementary DNA sequences (00=A, 01=C, 10=G, 11=T) and the connecting rectangles represent β€œembeddings” learned by DNA language models. Pls check out our article as well: doi.org/10.1016/j.ti...

07.04.2025 15:01 β€” πŸ‘ 69    πŸ” 13    πŸ’¬ 0    πŸ“Œ 1
Post image

In our updated TraitGym preprint (w/ @gonzalobenegas.bsky.social & GΓΆkcen Eraslan), we evaluate Evo 2 on regulatory variants associated with human traits. We see marked performance gains with scale on Mendelian traits, although still a bit behind alignment-based methods.
doi.org/10.1101/2025...
1/n

04.03.2025 19:54 β€” πŸ‘ 32    πŸ” 13    πŸ’¬ 1    πŸ“Œ 2

Thank you for contributing to bioicons! Sorry I forgot to add to acknowledgements, I will in the final version!

15.02.2025 19:29 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Thank you Remi!

14.02.2025 18:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
songlab/TraitGym Β· Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

TraitGym is available on HuggingFace, including a Colab notebook to eval a model in few minutes:
huggingface.co/datasets/son...

13.02.2025 20:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics Machine learning holds immense promise in biology, particularly for the challenging task of identifying causal variants for Mendelian and complex traits. Two primary approaches have emerged for this t...

Check out the paper for more details, including stratification by consequence trait, and eQTL:
www.biorxiv.org/content/10.1...

13.02.2025 20:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Scaling is probably part of the solution, but data curation might be the major bottleneck. The vast majority of bases in mammalian genomes lack evolutionary constraint which is precisely the signal leveraged by self-supervision.

13.02.2025 20:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Alignment-free DNA language models are not yet competitive. The best among them, our GPN-Promoter and SpeciesLM from @gagneurlab.bsky.social , are not the largest in number of parameters or context. Their key feature is having been trained only on functional regions of the genome.

13.02.2025 20:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Conservation-aware CADD and GPN-MSA do better on Mendelian trait variants, expected to be under strong purifying selection. On complex trait variants, especially for non-disease traits, functional-genomics models Enformer and Borzoi tend to do better. However, ensembling helps:

13.02.2025 20:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We evaluate models zero-shot (unsupervised) and with linear probing (logistic regression on top of extracted features):

13.02.2025 20:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We evaluate a wide range of models with up to 7B parameters and 500K context size. Do these numbers matter? πŸ€”

13.02.2025 20:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We collect putative causal variants from OMIM and UKBB with carefully matched controls.

13.02.2025 20:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Can DNA sequence models predict mutations affecting human traits?

We introduce TraitGym, a curated benchmark of causal regulatory variants for 113 Mendelian & 83 complex traits, and evaluate functional genomics and DNA language models. Joint work w/ GΓΆkcen Eraslan and @yun-s-song.bsky.social πŸ§΅πŸ‘‡

13.02.2025 20:57 β€” πŸ‘ 28    πŸ” 15    πŸ’¬ 1    πŸ“Œ 2

Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics https://www.biorxiv.org/content/10.1101/2025.02.11.637758v1

13.02.2025 07:33 β€” πŸ‘ 8    πŸ” 1    πŸ’¬ 0    πŸ“Œ 1

I still believe in alignment-free gLMs with better data curation and loss functions, I've been seeing advances but still tough.

02.02.2025 19:36 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

*An exception are alignment-based gLMs which do improve (non-trivially) over conservation scores.

02.02.2025 19:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

A simple bar is: do you surpass conservation scores in identifying functional mutations? This bar was easily passed by pLMs and plant gLMs but not yet by human gLMs* even after 5 years.

02.02.2025 19:35 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Thank you Jo!

12.01.2025 19:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
A previously reported bottleneck in human ancestry 900 kya is likely a statistical artifact Hu et al. (Science, 2023) recently inferred a severe ancient bottleneck around 900 thousand years (kya) ago in African ancestry but found no similar eviden

Our work, which shows statistical issues with the previous claim of a severe ancient bottleneck in the ancestry of African populations, has been selected as a Featured article in Genetics.

doi.org/10.1093/gene...

08.01.2025 20:23 β€” πŸ‘ 15    πŸ” 4    πŸ’¬ 0    πŸ“Œ 0

Coincidentally, another article from my lab on DNA language models got published on the same day as GPN-MSA. It's freely available for 50 days from this link:

authors.elsevier.com/a/1kNCscQbJB...
Genomic language models: opportunities and challenges

Please share with your colleagues.

03.01.2025 02:29 β€” πŸ‘ 10    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Preview
A DNA language model based on multispecies alignment predicts the effects of genome-wide variants - Nature Biotechnology A language model predicts the effects of genetic variants in the human genome.

Happy New Year! Our GPN-MSA paper is finally published, under a slightly different title from the preprint. Please check it out and share it with your colleagues:

doi.org/10.1038/s415...

1/4

02.01.2025 20:24 β€” πŸ‘ 16    πŸ” 7    πŸ’¬ 1    πŸ“Œ 1
Preview
A DNA language model based on multispecies alignment predicts the effects of genome-wide variants - Nature Biotechnology A language model predicts the effects of genetic variants in the human genome.

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants - @yun-s-song.bsky.social go.nature.com/4gWppWg

02.01.2025 16:18 β€” πŸ‘ 31    πŸ” 13    πŸ’¬ 0    πŸ“Œ 0

Thanks! Do you think this could explain part of the gap between task 4 and 5? Could profile prediction help generalization to variants (of the count head)?

11.12.2024 18:37 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Really appreciate your paper. Could you clarify about whether ChromBPNet and DNALMs are trained on the same data? From the paper and code it might seem like only ChromBPNet is given profile-level labels.

11.12.2024 03:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Ultrafast classical phylogenetic method beats large protein... Amino acid substitution rate matrices are fundamental to statistical phylogenetics and evolutionary biology. Estimating them typically requires reconstructed trees for massive amounts of aligned...

Large protein language models can learn complex epistatic interactions, but how much does that help with predicting variant effects? In this NeurIPS article, we show that classical independent-sites phylogenetic models can outperform pLMs on this task.
1/7
openreview.net/forum?id=H7m...

16.11.2024 20:41 β€” πŸ‘ 91    πŸ” 44    πŸ’¬ 2    πŸ“Œ 2

@gonzalobenegas is following 19 prominent accounts