Peter Koo's Avatar

Peter Koo

@pkoo562.bsky.social

AI4Science researcher. Associate Professor @CSHL. My lab advances AI for genomics and healthcare! http://koo-lab.github.io

3,117 Followers  |  1,266 Following  |  112 Posts  |  Joined: 04.12.2023  |  2.3708

Latest posts by pkoo562.bsky.social on Bluesky

Post image

Congratulations to John Clarke, Michel Devoret and John Martinis on receiving the 2025 Nobel Prize in Physics!
www.nobelprize.org/prizes/physi...

I have fond memories of my time in the Clarke lab, where I did my Honors Thesis on ultra low-field MRI w/ SQUIDs as an undergrad at UC Berkeley!

07.10.2025 14:16 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Check out a Research Highlights on our work at @naturemethods by Lin Tang!

www.nature.com/articles/s41...

19.09.2025 16:36 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Richard Bonneau giving the last keynote on navigating the complexity of drug discovery and their lab-in-the-loop for molecule design! #MLCB

11.09.2025 17:40 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

First talk a (surprise) keynote by Jacob Schreiber from UMass Medical talking about fruit-themed AI tools for understanding and designing regulatory DNA

11.09.2025 13:44 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

2025 MLCB day 2 is starting now!

Streaming live now!
m.youtube.com/watch?v=PxlXNb…

11.09.2025 13:42 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Now Barbara Engelhardt giving a keynote on characterizing behaviors of modified T cells in live cell imaging data using machine learning!

10.09.2025 17:58 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Next talk by Courtney Shearer who is talking about genomic language models for zero shot promoter indel effects!

10.09.2025 15:15 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Next talk by Alan Murphy and Masayuki (Moon) Nagai (from my lab!) who are talking about how naive fine-tuning genomic DNNs leads to catastrophic forgetting and propose *iterative causal refinement* to improve learned associations to causal understanding of cis-regulatory biology!

10.09.2025 14:53 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Next talk by Johannes Linder at Calico. Talking about expanding genomic seq2fun DNNs with RBP binding and RNA processing data to consider post-transcriptional regulation.

10.09.2025 14:38 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Some technical delays but we are all set!

First talk by Alexis Battle! @alexisbattle.bsky.social

10.09.2025 13:52 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Machine Learning in Computational Biology 2025 YouTube video by Machine Learning in Computational Biology

Here is the YouTube live link:

www.youtube.com/live/19I7xTh...

Starts at 9:30a!

10.09.2025 13:05 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
MLCB - Schedule The in-person component will be held at the New York Genome Center, 101 6th Ave, New York, NY 10013. All times below are Eastern Time.

2025 Machine Learning in Computational Biology (#MLCB) meeting starts TODAY (9/10) at 9:30a (EST) at the NY Genome Center in NYC!

We have a great lineup of keynotes, contributed talks, and posters today and tomorrow

Schedule: mlcb.org/schedule

Join for free via livestream: m.youtube.com/@mlcbconf

10.09.2025 11:42 β€” πŸ‘ 13    πŸ” 7    πŸ’¬ 1    πŸ“Œ 3
Post image

Here's another unpublished result:

We compared probing strategies to assess how informative the pretrained representations areβ€”benchmarking Evo2 vs D3 on Drosophila enhancer activity measured via STARR-seq.

Again, D3 outperforms Evo2 (and one-hot) across all probing methods!

16.07.2025 12:17 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

But, when we trained D3 (score-entropy discrete diffusion for regulatory DNA) in an unsupervised manner on the genomic sequences, probing the representations of D3 was comparable to supervised SOTA (even with a basic CNN)! (100M parameters vs 40B parameters)

16.07.2025 12:17 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

*Easter egg alert* NOT in the published paper. We also benchmarked Evo 2 and while it did better than other gLMs (consistent that scale can improve gLMs), it still falls short of a basic CNN trained using one-hot sequences and far short of supervised SOTA

16.07.2025 12:16 β€” πŸ‘ 26    πŸ” 5    πŸ’¬ 1    πŸ“Œ 0

Also, my perspective is coming from gLMs applied to human genomes. I think they have a lot of potential for small compact genomes that don't have as layered regulation as higher-order eukaryotes.

16.07.2025 12:15 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

gLMs provide promise in learning structure in the genome, but we need to rethink how we either tokenize the genome (and no byte pair encoding isn't the answer either) or come up with a better masking strategy for non-coding genome that is different from other regions (eg coding).

16.07.2025 12:15 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Tokenizing nucleotides/kmers and treating each token equally is like injecting lots of random words between every word in a sentence and hope that a LLM will learn the structure of the english language.

16.07.2025 12:14 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

It's unclear whether standard NLP-based objectives (MLM or CLM) will bring us to the promised land.

Unlike proteins, which have conservation at sequence and covariation levels, non-coding genome is conserved at functional level -- lots of drift and uninformative positions!

16.07.2025 12:14 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

There are many great applications for gLMs -- I'm not just a hater. The central dogma (or whatever that is being sold) is not one of them.

In terms of non-coding genome regulation (outside of splice sites) in humans, there is a huge uphill battle.

16.07.2025 12:13 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Breaking the constant propagation of pointless gLM benchmarks in the ML field (that are disconnected from how biologists will use them) is what is giving gLMs unwarranted hype. The field must rally around useful applications of gLMs.

16.07.2025 12:13 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Our benchmark is far from complete! It shows how current gLMs struggle in zero-shot capabilities for cell-type specific regulation. Think about all the differential regulation across cell types being projected onto a single genome -- this is hard to learn w/o functional data!

16.07.2025 12:13 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Our benchmark is far from complete! It shows how current gLMs struggle in zero-shot capabilities for cell-type specific regulation. Think about all the differential regulation across cell types being projected onto a single genome -- this is hard to learn w/o functional data!

16.07.2025 12:13 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

This went 3 rounds of review in another journal, but 1 reviewer was adamant that this type of benchmark might be harmful to the burgeoning gLM field, which currently only benchmarks relative performance on (nearly) useless benchmarks in the non-coding regions. It was rejected!

16.07.2025 12:12 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Preview
Evaluating the representational power of pre-trained DNA language models for regulatory genomics - Genome Biology Background The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of ...

Our work on "Evaluating the representational power of pre-trained DNA language models for regulatory genomics" led by @AmberZqt with help from @NiraliSomia & @stevenyuyy is finally published in Genome Biology! Check it out!

genomebiology.biomedcentral.com/articles/10....

16.07.2025 12:12 β€” πŸ‘ 11    πŸ” 4    πŸ’¬ 2    πŸ“Œ 2

One thing that really bothers me with the new "virtual cell" terminology is that it is currently largely focused on a very narrow definition of models that can predict effects of trans perturbations (gene dosage, drugs etc) on gene expression. 1/

28.06.2025 10:38 β€” πŸ‘ 105    πŸ” 30    πŸ’¬ 1    πŸ“Œ 0

Excited to launch our AlphaGenome API goo.gle/3ZPUeFX along with the preprint goo.gle/45AkUyc describing and evaluating our latest DNA sequence model powering the API. Looking forward to seeing how scientists use it! @googledeepmind

25.06.2025 14:29 β€” πŸ‘ 219    πŸ” 82    πŸ’¬ 5    πŸ“Œ 10

This a really exciting leap forward for genomic sequence to activity gene regulation models. It is a genuine improvement over pretty much all SOTA models spanning a wide range of regulatory, transcriptional and post-transcriptional processes. 1/

25.06.2025 16:18 β€” πŸ‘ 72    πŸ” 20    πŸ’¬ 2    πŸ“Œ 2

Congrats @avsecz.bsky.social! Looking forward to exploring what it has learned! 🧬

25.06.2025 17:41 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

@pkoo562.bsky.social Peter Koo at #AIxBio

23.06.2025 11:07 β€” πŸ‘ 10    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0

@pkoo562 is following 19 prominent accounts