We also made some improvements with genomic language model, Evo 2, but in this case the interpretation was less clear. See the preprint for more details. Code for using LFB will made available shortly. 10/10
26.05.2025 17:30 β π 2 π 0 π¬ 0 π 0
This provides evidence that better fitness estimation can be achieved at negligible computational cost by bridging the gap between likelihood and fitness at inference time. 9/n
26.05.2025 17:30 β π 2 π 0 π¬ 1 π 0
We show a scatterplot of ROC-AUCs for each gene, calculated separating benign and pathogenic labelled variants with either usual or LFB fitness estimation
This trend held across DMS assay types and mutational depth, and also on prediction of clinical variants. 8/n
26.05.2025 17:30 β π 1 π 0 π¬ 1 π 0
We show a plot of Model Size vs Mean Spearman Correlation across the DMS datasets from ProteinGym for ESM-2 and ProGen2 model families both with and without the LFB estimation.
On ProteinGym, LFB provided significant improvements across model classes and sizes and we saw that larger better fit models provided better predictions in general.
proteingym.org 7/n
26.05.2025 17:30 β π 2 π 0 π¬ 1 π 1
We found under an OrnsteinβUhlenbeck model of evolution that our LFB should be lower variance than the standard estimate by marginalising the effect of drift. 6/n
26.05.2025 17:30 β π 2 π 0 π¬ 1 π 0
We show a schematic of the LFB estimate where by averaging over predictions for a variant applied to other related sequences, we produce an score which should be closer to the true change in fitness.
We tried a simple strategy β averaging predictions over sequences under similar selective pressures to effectively reduce the impact of unwanted non-fitness related correlations β likelihood fitness bridging (LFB). 5/n
26.05.2025 17:30 β π 1 π 0 π¬ 1 π 0
We wondered whether we might be able to improve predictions from existing models without any further training. 4/n
26.05.2025 17:30 β π 1 π 0 π¬ 1 π 0
Non-identifiability and the Blessings of Misspecification in Models...
Misspecification is a blessing, not a curse, when estimating protein fitness from evolutionary sequence data using generative models.
Weinstein et al show that better fit sequence models can perform worse at fitness estimation due to phylogenetic structure:
openreview.net/forum?id=CwG...
And in practice we are seeing that pLMs donβt improve with lower perplexities:
openreview.net/forum?id=UvP... www.biorxiv.org/content/10.1... 3/n
26.05.2025 17:30 β π 1 π 0 π¬ 1 π 0
Have We Hit the Scaling Wall for Protein Language Models?
Beyond Scaling: What Truly Works in Protein Fitness Prediction
Protein language models are showing promise in variant effect prediction - but thereβs emerging evidence likelihood based zero shot fitness estimation is breaking down. See this excellent summary from @pascalnotin.bsky.social: pascalnotin.substack.com/p/have-we-hi... 2/n
26.05.2025 17:30 β π 5 π 0 π¬ 1 π 0
@cwjpugh.bsky.social at #VariantEffect25
22.05.2025 10:29 β π 19 π 8 π¬ 0 π 0
Three BioML starter packs now!
Pack 1: go.bsky.app/2VWBcCd
Pack 2: go.bsky.app/Bw84Hmc
Pack 3: go.bsky.app/NAKYUok
DM if you want to be included (or nominate people who should be!)
03.12.2024 03:27 β π 147 π 60 π¬ 16 π 6
Thanks Charlie for opening the PhD Symposium! Many thanks to everyone involved in its organisation. #CRGPhDSymp2024
28.11.2024 09:10 β π 7 π 4 π¬ 0 π 0
Assistant Professor π©πΌβπ» at Erasmus Medical Center Rotterdam researching genetics & gene regulation π§¬π₯ in the context of disease π₯Όππ
Team Leader & Independent Fellow at the @crg.eu
At the Repetitive DNA Biology (REPBIO) Lab, we leverage the latest technologies to decode nucleotide sequences for investigating how repetitive DNA shapes genome function and contributes to disease.
The goal of the Canadian BioGenome Project is to produce high-quality reference genomes 𧬠for all Canadian species π
Sequencing Canada's Biodiversity πΏπ¦ππ§¬π’π¦πΈπ³πΏππ¦π¦¦
Learn more: https://linktr.ee/canadianbiogenome
bioinformatics phd student at UCLA
The Earth BioGenome Project (EBP), a moonshot for biology, aims to sequence, catalog, and characterize the genomes of all of Earth's eukaryotic biodiversity over a period of ten years.
Harvard PhD Candidate SSQB
@DeboraMarksLab
Turtles all the way down
https://www.courtneyshearer.com/
Math Professor - IFCE
PhD Bioinformatics - UFMG
Computational biologist and mechanistic interpretability researcher. For more find me at: https://mclarke1991.github.io & https://www.linkedin.com/in/matthew-alan-clarke/
Postdoc @ Debbie Marks Lab, Harvard | Prev. PhD @ MIT EECS || ML for Proteins + Viruses π¦
PhDing at the Sanger Institute, i'm evolving every day
Computational biologist. Geriatric Millennial. Professor, University of Cambridge. Director of Data Sciences, Baker Heart & Diabetes Institute. British | Australian | American.
www.inouyelab.org | Cambridge, UK
postdoc with Doc Edge at USC. interested in pop gen, stat gen, ELSI. she/they. π
roshnipatel.github.io
Comp chem Ph.D @ Zhang lab NYU
ai/ml molecules and proteins | allostery | stem outreach | knicks and brooklyn
Biologist that navigate in the oceans of diversity through space-time
Protein evolution, metagenomics, AI/ML/DL
Website https://miangoaren.github.io/
Scientist working on computational genomics, looking into mutational heterogeneity and somatic evolution. Maths, Stats, ML, Bioinfo. Staff scientist, BBGLab, IRB Barcelona.
βͺPostdoc @ FCEN UBA w/βͺ @diegulise.bsky.socialβ¬
& moving soon toβͺβͺ @crg.euβ¬ w/M. Dias and @jonnyfrazer.bsky.social
> Biological Physics | Proteins | Comp Bio | ML
https://scholar.google.com/citations?user=n55NtEsAAAAJ&hl=en
Husband & dad / Director of SafeSpot Overdose Hotline/ Paramedic from This American Life Ep809 The Call / Adjunct Assistant Clinical Professor Boston University School of Public Health / VIEWS MINE!
http://stephen-murray.com
Trying to figure out the tumor microenvironment from single cell epigenomics as the world ends. Instructor at Icahn school of medicine.
Protein dynamic, Multi conformation, Language model, Computational biology | Postdoc @Columbia | PhD 2023 & Bachelor 2020 @PKU1898
http://chaohou.netlify.app
Postdoc at the LinderstrΓΈm-Lang Center for Protein Science