David van Dijk @vandijklab - Bluesky Profile

Right. We have done something similar in our previous work (cinema-ot) where we validated casual inferences using synthetic data where we know the ground truth.

19.04.2025 14:00 — 👍 1 🔁 0 💬 0 📌 0

Right, and I do believe this is possible based on other experiments we have done where we translate between biological language and natural language. Your proposed experiment may be more specific and I’m interested in trying it.

19.04.2025 13:58 — 👍 1 🔁 0 💬 0 📌 0

Zero shot is possible but obviously much harder and also very much depends on the specific system.

19.04.2025 13:54 — 👍 1 🔁 0 💬 0 📌 0

We have focussed on fine tuning on one immune cell cytokine stim dataset and on (bulk) L1000. In both cases we show generalization by leaving out conditions (eg cytokine combos).

19.04.2025 13:53 — 👍 1 🔁 0 💬 0 📌 0

And the reasoning here is that if they improve then that shows that our model generates meaningful data? That’s interesting. It’s a convenient way of validating without doing experiments I guess

19.04.2025 13:50 — 👍 1 🔁 0 💬 1 📌 0

I see. We haven’t done this specific experiment where we compare well studied vs poorly studied genes. It’s an interesting idea. We will look into it. I would expect that genes/cell types/tissues that have a lot of training data, both expression and meta data, generalize better.

19.04.2025 13:40 — 👍 1 🔁 0 💬 0 📌 0

Yes. We showed that natural language pretraining vs training on cell sentences from scratch, significantly boosts performance.
In addition, in the spatial reasoning task (fig.6) we did an ablation where we trained with and without metadata. With metadata performed significantly better.

19.04.2025 13:35 — 👍 1 🔁 0 💬 0 📌 0

Finally, asking a model to generate a “cell sentence” (e.g. for perturbation response prediction) is novel by design, since no LLM has encountered that representation in its training data.

18.04.2025 17:32 — 👍 1 🔁 0 💬 0 📌 0

Second, several test sets—such as Dataset Interpretation on held-out studies—use scRNA-seq datasets published after each model’s pretraining cutoff, giving us strong assurance that those examples weren’t seen during training.

18.04.2025 17:32 — 👍 1 🔁 0 💬 1 📌 0

We took several steps to ensure robust evaluation. First, we tested both open- and closed-source LLMs (GPT-4o, Gemini, LLaMA-3) on our benchmarks and found they perform poorly out of the box, indicating minimal overlap with pretraining corpora.

18.04.2025 17:32 — 👍 1 🔁 0 💬 2 📌 0

For this paper, we chose a prompt structure that helps the model learn perturbations effectively, but initial tests suggest the model handles prompt variations well as long as the data formatting is consistent—so we don't expect prompt engineering to be a major issue.

18.04.2025 17:19 — 👍 2 🔁 0 💬 0 📌 0

We'll formally test prompt robustness in future work, but from experience with earlier Cell2Sentence models, we've found minimal performance loss when using new or varied prompts. In general, we always train on a wide variety of prompts to avoid overfitting.

18.04.2025 17:19 — 👍 1 🔁 0 💬 1 📌 0

Thank you!

18.04.2025 17:13 — 👍 1 🔁 0 💬 0 📌 0

- For dataset interpretation, we evaluate on scRNA-seq studies published after the model was pretrained.
Performance drops in these settings let us estimate generalization gaps, but we're also interested in developing confidence measures in future work.

18.04.2025 17:11 — 👍 1 🔁 0 💬 0 📌 0

This is still an open challenge - we don't yet have confidence estimation built into the model, but we do evaluate C2S-Scale in out-of-distribution regimes. For example:
- In perturbation prediction, we test on unseen cell type–drug combinations and combinatorial perturbations.

18.04.2025 17:11 — 👍 1 🔁 0 💬 2 📌 0

So performance likely reflects both mechanistic pattern recognition and domain transfer from literature and metadata. Our training corpus was intentionally multimodal to support this integration, letting the model ground textual knowledge in expression-level representations.

18.04.2025 17:10 — 👍 1 🔁 0 💬 1 📌 0

Great question, it might be a combination of both. For tasks like scQA, the model must (i) interpret gene expression patterns from cell sentences (e.g., identify marker genes or activation signatures), and (ii) relate those to biological concepts learned from the textual domain.

18.04.2025 17:10 — 👍 1 🔁 0 💬 1 📌 0

Many downstream tasks (e.g. scQA) require the model to reason jointly over cell sentences and biological text/metadata. We also explored this in our spatial reasoning ablation studies, where interleaving training with gene interaction data improved accuracy over training with expression alone.

18.04.2025 17:09 — 👍 1 🔁 0 💬 0 📌 0

C2S-Scale interleaves gene expression (as "cell sentences") with biological text during training to enable reasoning across both modalities. This multimodal integration is a key difference from expression-only models and is important for complex tasks.

18.04.2025 17:09 — 👍 1 🔁 0 💬 1 📌 0

We thank our amazing team at Yale, Google Research, and Google DeepMind

18.04.2025 14:13 — 👍 2 🔁 0 💬 0 📌 0

Scaling Large Language Models for Next-Generation Single-Cell Analysis Single-cell RNA sequencing has transformed our understanding of cellular diversity, yet current single-cell foundation models (scFMs) remain limited in their scalability, flexibility across diverse ta...

Dive into the details:
📄 Preprint: biorxiv.org/content/10.1...
📝 Google AI Blog: research.google/blog/teachin...
💻 Code/Models: huggingface.co/collections/... github.com/vandijklab/c...

18.04.2025 14:13 — 👍 2 🔁 0 💬 1 📌 0

What's next for C2S-Scale?
• True Multimodality: Integrating proteomics, epigenomics, imaging data 🖼️
• Deeper Biology: Modeling cell interactions, dynamics, & development ⏳
• Enhanced Trust: Improving interpretability & reliability ✅
• Community Tools: Building shared benchmarks & platforms 🏆

18.04.2025 14:13 — 👍 1 🔁 0 💬 1 📌 0

Cell2Sentence Models - a vandijklab Collection Cell2Sentence models trained for single-cell tasks

Let's build together! 🛠️ We're open-sourcing C2S-Scale to empower the community.
Models up to 1B parameters are already available on HF, and models up to 27B parameters will be released in the next few weeks!
huggingface.co/collections/... github.com/vandijklab/c...

18.04.2025 14:13 — 👍 0 🔁 0 💬 1 📌 0

Beyond standard training, we used Reinforcement Learning (RL) 🤖 to fine-tune C2S-Scale.
Using GRPO + biological rewards, we specifically improved:
• Perturbation prediction accuracy 🧪
• Biological Q&A relevance ❓
Aligning LLMs with biological goals! ✅

18.04.2025 14:13 — 👍 0 🔁 0 💬 1 📌 0

Size matters! 📈 We observed clear scaling laws: As model size increased from 410M → 27 Billion parameters, performance consistently improved across tasks.
This confirms that LLMs learn better biological representations at scale using the C2S approach. Even works with efficient LoRA tuning! 💪

18.04.2025 14:13 — 👍 0 🔁 0 💬 1 📌 0

And it works! 🎉 C2S-Scale achieves SOTA performance, surpassing specialized single-cell models AND general LLMs:
• 🎯 Cell type annotation
• 🧪 Predicting perturbation responses
• ✍️ Generating dataset summaries from cells
• 🗺️ Inferring spatial relationships
• ❓ Answering complex biological questions

18.04.2025 14:13 — 👍 0 🔁 0 💬 1 📌 0

To truly "teach" biology to LLMs, we built a massive corpus: Over 1 BILLION tokens! 📚
This wasn't just cell sentences – it included:
• 🧬 50M+ cell profiles (human/mouse)
• 🏷️ Annotations & Metadata
• 📄 Biological Text (abstracts, etc.)
Result? One model, many tasks!

18.04.2025 14:13 — 👍 0 🔁 0 💬 1 📌 0

We enable LLMs to "read" biology via Cell2Sentence (C2S) 🧬➡️📝: ranking genes creates text.
This lets us leverage massive pre-trained models, unifying transcriptomic data with biological text (annotations, papers) for richer understanding.

18.04.2025 14:13 — 👍 2 🔁 0 💬 1 📌 0

Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis

What if LLMs could “read” & “write” biology? 🤔
Introducing C2S‑Scale—a Yale and Google collab: we scaled LLMs (up to 27B!) to analyze & generate single‑cell data 🧬 ➡️ 📝
🔗 Blog: research.google/blog/teachin...
🔗 Preprint: biorxiv.org/content/10.1...

18.04.2025 14:13 — 👍 18 🔁 10 💬 2 📌 0

Huge thanks to the team: Zhikai Wu, Shiyang Zhang, Sizhuang He, Sifan Wang, Min Zhu, Anran Jiao, Lu Lu! Let us know what you think! #OperatorLearning #LLM #AI4Science

13.02.2025 19:23 — 👍 1 🔁 0 💬 0 📌 0

David van Dijk

Latest posts by vandijklab.bsky.social on Bluesky

@vandijklab is following 19 prominent accounts