Will this recipe work for other organisms? We think it depends on genome size and proportion of nucleotides under selection, which drives the value of the self-supervised stage and training data scale. An exciting question for future work!
13.10.2025 23:06 β π 2 π 0 π¬ 0 π 0
This was a massive effort, driven by the incredible work of Calico intern Kuan-Hao Chao (@kuanhaochao.bsky.social
). Huge thanks to him, Majed Mohamed Magzoub, and Johannes Linder!
13.10.2025 23:06 β π 0 π 0 π¬ 1 π 0
My take: While MPRAs are powerful, they lose vital genomic context like local chromatin and post-transcriptional regulation. For modeling complex gene regulation in vivo, models trained on endogenous sequences are essential.
13.10.2025 23:06 β π 0 π 0 π¬ 1 π 0
Each wins on its βhome fieldβ:
* MPRA-trained models excel at predicting MPRA data, including variant sequences.
* Shorkie excels at predicting expression from promoters in their natural genomic context and eQTLs.
13.10.2025 23:06 β π 0 π 0 π¬ 1 π 0
How does Shorkie compare to models trained on massively parallel reporter assays (MPRAs)?
13.10.2025 23:06 β π 0 π 0 π¬ 1 π 0
This translates to variant effect prediction where Shorkie accurately predicts the impact of cis-eQTLs, outperforming alternative models at classifying influential regulatory variants.
13.10.2025 23:06 β π 0 π 0 π¬ 1 π 0
Shorkie also captures dynamic regulatory changes. Using new time-course RNA-seq data from TF inductions, we showed Shorkie can track how the importance of specific TF motifs changes over time.
13.10.2025 23:06 β π 0 π 0 π¬ 1 π 0
This pre-training strategy makes a huge difference. Shorkie substantially outperforms the same model trained from scratch, boosting gene-level expression prediction from a Pearson's R of 0.74 to 0.88.
13.10.2025 23:06 β π 1 π 0 π¬ 1 π 0
But which genomes work best? We trained on different phylogenetic levels, from close S. cerevisiae strains to the fungal kingdom. The Saccharomycetales order was the sweet spot, providing the right balance of diversity and conserved regulatory grammar for the model to learn from.
13.10.2025 23:06 β π 0 π 0 π¬ 1 π 0
Our hypothesis: Jumpstart supervised learning with self-supervision--before predicting chromatin and expression, we first asked our model to predict masked-out nucleotides across many related genomes, so it learns conserved elements like genes and their promoters.
13.10.2025 23:06 β π 0 π 0 π¬ 1 π 0
However, yeast's small genome provides limited data, making it tough for deep learning models to learn complex regulatory rules from scratch.
13.10.2025 23:06 β π 0 π 0 π¬ 1 π 0
At Calico, we've been studying S. cerevisiae for years to understand replicative aging. Along the way, we've generated rich datasets to probe its regulatory networks, which helped make this work possible.
13.10.2025 23:06 β π 0 π 0 π¬ 1 π 0
AI in Molecular Biology | Keystone Symposia
Join us at the Keystone Symposia on AI in Molecular Biology, September 2025, in Santa Fe, with field leaders!
The poster abstract deadline for the @keystonesymposia.bsky.social AI in Molecular Biology meeting in Santa Fe is coming up on August 21st, so get your submissions in!
www.keystonesymposia.org/conferences/...
04.08.2025 11:50 β π 2 π 1 π¬ 0 π 0
borzoi-paper/extensions/prime at main Β· calico/borzoi-paper
Analyses related to the Borzoi paper. Contribute to calico/borzoi-paper development by creating an account on GitHub.
Weβve done some experiments, but the metrics arenβt conclusive, so choose your own adventure! Weβve released these models open source, open weight for all to use. github.com/calico/borzo...
23.07.2025 16:22 β π 2 π 0 π¬ 0 π 0
We hypothesized that training with cell-type-specific and 3' data might make these models particularly effective for transfer to datasets with similar properties.
23.07.2025 16:22 β π 1 π 0 π¬ 1 π 0
Hence the name: Borzoi Prime to emphasize their 3β expertise!
23.07.2025 16:22 β π 0 π 0 π¬ 1 π 0
Indeed, he discovered the new models better predict alternative polyadenylation and QTL variants that affect where transcripts get cleaved and polyadenylated. This key regulatory layer influences cell type-specific protein production.
23.07.2025 16:22 β π 0 π 0 π¬ 1 π 0
Drawing on his expertise and interest in isoform regulation, Johannes hypothesized that single-cell RNA-seqβs 3β sequencing protocols might reveal additional capabilities in these models.
23.07.2025 16:22 β π 0 π 0 π¬ 1 π 0
Using single cell eQTL studies, he evaluated the cell type specific variant effect predictions and found good concordance.
23.07.2025 16:22 β π 0 π 0 π¬ 1 π 0
As cell-type-specific applications emerged, Johannes Linder took a fresh look.
23.07.2025 16:22 β π 0 π 0 π¬ 1 π 0
We trained these models in early 2023 (which is why theyβre algorithmically similar to the originals), but initial metrics were underwhelming, so we shelved them.
23.07.2025 16:22 β π 0 π 0 π¬ 1 π 0
Side noteβwant your amazing data included in future training runs of open source, open weight models? Make and release BigWig tracks!
23.07.2025 16:22 β π 0 π 0 π¬ 1 π 0
We curated several cell atlas collections to produce pseudobulk coverage tracks. Thank you to the CZI Tabula projects and the BICCN Brain Cell Atlas for making this possible!
23.07.2025 16:22 β π 0 π 0 π¬ 1 π 0
A limitation of the first Borzoi training run was the absence of cell type specific RNA-seq tracks; most are heterogeneous bulk samples.
23.07.2025 16:22 β π 0 π 0 π¬ 1 π 0
Alongside the manuscript and analysis, we released Borzoi predictions for 19.5 million common and low-frequency UK Biobank variants. Code for scoring additional variants with Borzoi is available here: github.com/calico/baske...
21.07.2025 14:50 β π 2 π 0 π¬ 0 π 0
Moving forward, we suspect there are further improvements available. The Borzoi predictions cover most body tissues, but they arenβt yet zoomed into specific cell types. Alternative nonlinear heritability models may usurp S-LDSC for fitting variant priors.
21.07.2025 14:50 β π 2 π 0 π¬ 1 π 0
Generally, we found that Borzoi predictions improve fine-mapping clarity and gene prioritization. Weβre using Sniff to better analyze aging-related trait GWAS at Calico.
21.07.2025 14:50 β π 1 π 0 π¬ 1 π 0
PhD Student at Theis and Gagneur lab @TU Munich - Interested in ML, gene regulation and epigenetics π§¬. Previously Cambridge University and Heidelberg University. she/her
Probabilistic machine learning to address questions in evolution and health #EvolutionaryMedicine. PI at the Centre for Genomic Regulation, co-leading a group with Mafalda Dias. Previously Harvard.
Affiliate Associate Professor @ UW.
Previously @ Altius Institute (Principal Investigator), @csail.mit.edu (postdoc), @nkinl.bsky.social⬠/ @tudelfteemcs.bsky.social (PhD).
Loves a good analogy!
Senior Research Scientist at CZ Biohub New York, AI/ML Platform. Building causal AI models of immune cells.
PhD student within the Functional Genomics group (Franke lab), University Medical Centre Groningen. Interested in gene networks, sequence-based models and non-coding somatic mutations
(she/her) Computational biologist and post-doc scientist in the Greenleaf and Kundaje labs at Stanford. Interested in understanding how cells know what to become (transcription factors, gene regulation, dev bio, open science) www.selinjessa.com
Assistant Professor of Genetics @ Yale.
Studying how variation in cis-regulatory-elements impacts evolution, complex traits, and more!
http://reilly-lab.com
501(c)(3) nonprofit organization that convenes open, peer-reviewed life science meetings over a wide range of fields & global locations.
https://www.keystonesymposia.org/
https://x.com/KeystoneSymp
https://www.youtube.com/KeystoneSymposia
Origins and consequences of genome mutation; software for genomic discovery.
Prof. and Chair of Human Genetics at U. of Utah.
https://www.genetics.utah.edu/
http://quinlanlab.org
Baseball and Hockey Nerd. Contractor for MLB Team. Somehow a larger Pokemon Nerd.
Patreon: patreon.com/tj_stats
Substack: https://tjstats.ca/
Baseball beer sandwiches music running
Finished a human genome, working on a few more π¨βπ»
Lab: https://genomeinformatics.github.io
Posts are my own
Assistant Professor @ Stanford Genetics & BASE Initiative. Mapping the regulatory code of the human genome to understand heart development and disease. www.engreitzlab.org
Associate Professor at The Jackson Laboratory -- Studying gene regulation, genetic variation & human complex traits.
assistant professor at ucsf interested in genetics, statistics, etcβ¦
jeffspence.github.io
Associate Professor of Bioinformatics at University of Tartu, Estonia. Project lead at eQTL Catalogue. https://kauralasoo.github.io/
Assistant Professor @ Stanford
Bloomberg Distinguished Professor at Johns Hopkins University. http://schatz-lab.org
Bren Professor of Computational Biology @Caltech.edu. Blog at http://liorpachter.wordpress.com. Posts represent my views, not my employer's. #methodsmatter