@alandenadel - Bluesky Profile

Zero-shot evaluation reveals limitations of single-cell foundation models - Genome Biology Foundation models such as scGPT and Geneformer have not been rigorously evaluated in a setting where they are used without any further training (i.e., zero-shot). Understanding the performance of mode...

genomebiology.biomedcentral.com/articles/10....

Quite an indictment of some of the current single cell "virtual cell" foundation models. Even for the relatively mundane applications, cell labeling, batch correction etc, they are poor compared to much simpler & cheaper methods.

20.04.2025 16:14 — 👍 158 🔁 47 💬 6 📌 3

Great to see our paper presenting recall, a framework which calibrates clustering for the impact of data "double-dipping" in single-cell studies, out today in AJHG! Congratulations, @alandenadel.bsky.social and co-authors!

12.03.2025 19:18 — 👍 9 🔁 3 💬 1 📌 0

Artificial variables help to avoid over-clustering in single-cell RNA sequencing Calibrated clustering with artificial variables protects against over-clustering single-cell RNA-seq data by controlling for the impact of reusing the same data twice when performing differential expr...

🚨Online now!
📄Artificial variables help to avoid over-clustering in single-cell RNA sequencing
🧑‍🤝‍🧑 @alandenadel.bsky.social @lcrawford.bsky.social & co

12.03.2025 17:59 — 👍 3 🔁 1 💬 0 📌 1

Sparse modeling of interactions enables fast detection of genome-wide epistasis in biobank-scale studies The lack of computational methods capable of detecting epistasis in biobanks has led to uncertainty about the role of non-additive genetic effects on complex trait variation. The marginal epistasis fr...

Can we find epistasis in human traits? To help, in our preprint, we present the most scalable and powerful framework for detecting epistasis to date: the “sparse marginal epistasis test” (SME).
Thank you @lcrawford.bsky.social, @sampatsmith.bsky.social, Dan Weinreich!
doi.org/10.1101/2025...
1/6

17.01.2025 16:46 — 👍 6 🔁 2 💬 1 📌 1

Thank you to all my collaborators for their contributions and thoughtful feedback.

Madeline Hughes
Akshaya Thoutam
@anaygupta.bsky.social
Andrew Navia
@nfusi.bsky.social
Srivatsan Raghavan
Peter Winter
@avapamini.bsky.social
@lcrawford.bsky.social

I welcome any comments!

18.12.2024 18:48 — 👍 0 🔁 0 💬 1 📌 0

Evaluating the role of pre-training dataset size and diversity on single-cell foundation model performance The success of transformer-based foundation models on natural language and images has motivated their use in single-cell biology. Single-cell foundation models have been trained on increasingly larger...

Code: github.com/microsoft/sc...
Paper: www.biorxiv.org/content/10.1...

18.12.2024 18:48 — 👍 0 🔁 0 💬 1 📌 0

Our results highlight the need for a more nuanced approach, balancing dataset size and diversity with careful attention to model architectures and model benchmarking.

18.12.2024 18:48 — 👍 0 🔁 0 💬 1 📌 0

Our findings underscore the importance of prioritizing data quality and content over sheer size. Developers of scFMs and large databases should consider this rather than simply scaling up models and databases, which we have shown is unlikely to meaningfully improve performance.

18.12.2024 18:48 — 👍 0 🔁 0 💬 1 📌 0

While neural scaling laws observed in other domains suggest that increasing dataset size leads to better performance, our findings show that, past a learning saturation point, simply increasing pre-training datasets doesn't necessarily improve performance on downstream tasks.

18.12.2024 18:48 — 👍 0 🔁 0 💬 1 📌 0

These results suggest that further scaling up of pre-training datasets from tens of millions of cells to hundreds of millions or even billions of cells without appropriately modeling the non-sequential nature of single-cell data may not yield tangible returns.

18.12.2024 18:48 — 👍 0 🔁 0 💬 1 📌 0

Our work addresses a critical consideration in training large-scale models: the size and diversity of the pre-training corpus.

18.12.2024 18:48 — 👍 0 🔁 0 💬 1 📌 0

(B) Heatmap visualizing the learning saturation point for the clonal hematopoiesis, intestine-on-chip, periodontitis, and placental infection datasets for each of scVI, SSL, and Geneformer, across each downsampling strategy and when evaluated in the zero-shot regime. Each sub-panel corresponds to the model architecture, the x-axis corresponds to the dataset evaluated, and the y-axis corresponds to the downsampling strategy used to pre-train each model. (C) Heatmap visualizing the learning saturation point for the clonal hematopoiesis, intestine-on-chip, periodontitis, and placental infection datasets for each of scVI, SSL, and Geneformer, across each downsampling strategy and when evaluated in the fine-tuning regime. Each sub-panel corresponds to the model architecture, the x-axis corresponds to the dataset evaluated, and the y-axis corresponds to the downsampling strategy used to pre-train each model.

The learning saturation points were always 25% or less when evaluating the models on zero-shot classification and were always 10% or less when evaluating the models on fine-tuned classification. We also observed similar results for zero-shot batch integration.

18.12.2024 18:48 — 👍 0 🔁 0 💬 1 📌 0

Schematic of analysis to find the learning saturation point. For each family of models (i.e., a downsampling strategy paired with a model) a "saturation threshold" of 95% of the maximum performance was computed, and the minimum pre-training dataset size that produced a model surpassing that threshold was identified. This dataset size was denoted the "learning saturation point" and is considered the point at which model performance saturated as a function of pre-training dataset size.

To assess the extent to which this plateauing generalized across datasets and tasks, we identified the "learning saturation point" for each model. This is the minimum pre-training dataset size for which a model surpassed 95% of the maximum performance observed.

18.12.2024 18:48 — 👍 0 🔁 0 💬 1 📌 0

Assessing the limits of zero-shot foundation models in single-cell biology The advent and success of foundation models such as GPT has sparked growing interest in their application to single-cell biology. Models like Geneformer and scGPT have emerged with the promise of serv...

Simple baselines were quite competitive, as shown in the literature by @kasia.codes , Rebecca Boiarsky, and @const-ae.bsky.social, among others.

www.biorxiv.org/content/10.1...
www.nature.com/articles/s42...
www.biorxiv.org/content/10.1...

18.12.2024 18:48 — 👍 0 🔁 0 💬 1 📌 0

$Figure 2. Zero-shot and fine-tuned performance on classifying cells from a clonal hematopoiesis dataset plateaus at a small fraction of the total data available for pre-training. (A) Line plots showing zero-shot classification performance for each model’s embeddings, as evaluated by the micro F1 Score. For each model, the different colors correspond to the downsampling strategy used to generate the data used for pre-training. The dotted line shows the performance of using the highly variable genes as an embedding; the dashed line shows the performance of using principal component projections as an embedding. (B) Line plots showing classification performance for each model after fine-tuning, as evaluated by the micro F1 Score. For each model, the different colors correspond to the downsampling strategy used to generate the data used for pre-training. The dotted line shows the performance of training a regularized logistic regression classifier using the highly variable genes as input features.$

Figure 2. Zero-shot and fine-tuned performance on classifying cells from a clonal hematopoiesis dataset plateaus at a small fraction of the total data available for pre-training. (A) Line plots showing zero-shot classification performance for each model’s embeddings, as evaluated by the micro F1 Score. For each model, the different colors correspond to the downsampling strategy used to generate the data used for pre-training. The dotted line shows the performance of using the highly variable genes as an embedding; the dashed line shows the performance of using principal component projections as an embedding. (B) Line plots showing classification performance for each model after fine-tuning, as evaluated by the micro F1 Score. For each model, the different colors correspond to the downsampling strategy used to generate the data used for pre-training. The dotted line shows the performance of training a regularized logistic regression classifier using the highly variable genes as input features.

Model performance at cell type classification (both zero-shot and fine-tuned) tended to plateau at a small fraction of the total pre-training dataset size on a clonal hematopoiesis evaluation dataset, regardless of pre-training dataset diversity.

18.12.2024 18:48 — 👍 0 🔁 0 💬 1 📌 0

Supplemental Figure 1. Diversity of datasets used for pre-training as evaluated by intrinsic and extrinsic metrics. The Shannon index, Gini-Simpson index, and Vendi Score are shown for each of the downsampled pre-training datasets. Cell type re-weighting and geometric sketching have increased diversity relative to the randomly downsampled datasets. Cell type re-weighting (which re-weights based on cell type metadata) has the highest Shannon index and Gini-Simpson index (which both measure the diversity of cell type metadata). Geometric sketching (which samples evenly across transcriptional space) has the highest Vendi Score (which measures the diversity of the transcriptional data directly).

The three downsampling schemes were: (1) random downsampling (2) cell type re-weighting and (3) geometric sketching. (1) conserves diversity, while (2) and (3) increase diversity (relative to the full corpus). Datasets were generated at 1%, 10%, 25%, 50%, and 75% of the total.

18.12.2024 18:48 — 👍 0 🔁 0 💬 1 📌 0

We assessed three model architectures pre-trained to perform as single-cell foundation models (scFMs) in the context of single-cell RNA-seq: scVI, SSL, and Geneformer. We pre-trained these models on subsets of the scTab corpus using three different downsampling schemes.

18.12.2024 18:48 — 👍 0 🔁 0 💬 1 📌 0

Figure 1. Strategy to assess the effects of pre-training dataset size and diversity on scFM performance. (A) Schematic of the downsampling approaches, sizes of downsampled pre-training datasets, and data splitting strategy. (B) An example of what evaluation performance might a priori be expected to look like as a function of pre-training dataset size and diversity.

Current methods in the field are trained on atlases ranging from 1 to 100 million cells. In our newest preprint, we show that these same approaches tend to plateau in performance with pre-training datasets that are only a fraction of the size.

18.12.2024 18:48 — 👍 5 🔁 0 💬 1 📌 2

Latest posts by alandenadel.bsky.social on Bluesky

@alandenadel is following 20 prominent accounts