David B. Blumenthal @dbblumenthal

This week I presented DataSAIL at the #ISMBECCB2025 conference in #Liverpool
It has been an amazing chance and experience to meet many people working on information leakage. And getting great ideas to extend it

Nicely supervised by & colaborated with @dbblumenthal.bsky.social @ok55991.bsky.social

24.07.2025 16:03 — 👍 5 🔁 4 💬 0 📌 0

Very happy to see DataSAIL published in @naturecomms.bsky.social. Give it a try if you want to test if your ML models generalize to OOD scenarios. Great collaboration between @uni-saarland.de and @fau.de :-)

09.04.2025 08:27 — 👍 2 🔁 1 💬 0 📌 0

A graphic depicting people sitting at desks and working on computers in the foreground. In the background, there are more people standing around in an abstract space. The space itself looks futuristic.

Are you ready to challenge yourself and compete with the brightest minds in AI and computer science? 🧠 Join the FAU AI Innovation Challenge 2025! Compete for 10.000€ in prizes in various categories ranging from game AI, cybersecurity & more.

👉 Learn more: go.fau.de/1beg-

27.01.2025 12:34 — 👍 4 🔁 2 💬 0 📌 0

Deep learning models for sequence-based PPI prediction still fail to yield reliable predictions in challenging scenarios. Great work led by the amazing Timo Reim and @judith-bernett.bsky.social

27.01.2025 11:50 — 👍 3 🔁 0 💬 0 📌 0

Graphical summary of the analyses done in the publication displayed on six panels a-f. (a) We computed ESM-2 embeddings of different sizes for the proteins of our data-leakage-free PPI dataset. The per-token embeddings have variable sizes depending on the protein length, while the per-protein embeddings have a fixed size by applying dimension-wise averaging. (b) We tested two models operating on the per-protein embeddings—a baseline random forest classifier and adaptions of the previously published Richoux model. Five models operated on the per-token embeddings: a 2d-baseline, the 2d-Selfattention and 2d-Crossattention models (which expanded the 2d-baseline through a Transformer encoder), and adaptations of the published models D-SCRIPT and TUnA. (c) Hyperparameter tuning gave us insight into the influence of each tunable parameter on the classification performance. (d) No model surpassed an accuracy of 0.65. The more advanced models had similar accuracies, leading us to believe that the information content of the ESM-2 embedding has more influence than the model architecture. Per-token models did not consistently outperform per-protein models. (e) We applied various modifications to test their influence: different embedding sizes, inserting a Transformer encoder into different positions, adding spectral normalization after the linear layers, self- vs. cross-attention, and removing the padding. (f) Finally, we compared the implicitly predicted distance maps of the 2d-baseline, 2d-Selfattention, 2dCrossattention, and D-SCRIPT-ESM-2 to real distance maps computed from PDB structures.

🧬🖥️ Proud to share our latest update on PPI predictions – "Deep learning models for unbiased sequence-based PPI prediction plateau at an accuracy of 0.65" doi.org/10.1101/2025... by T. Reim, published with @itisalist.bsky.social @dbblumenthal.bsky.social, A. Hartebrodt, and me. What did we do? 1/15 🧵

27.01.2025 10:58 — 👍 10 🔁 3 💬 1 📌 2

Emergence of power-law distributions in protein-protein interaction networks through study bias

Incredibly happy to finally see our manuscript "Emergence of power-law distributions in protein-protein interaction networks through study bias" published in @elife.bsky.social. doi.org/10.7554/eLif... It's been a long but fun journey with @dbblumenthal.bsky.social and @martinschaefer.bsky.social

14.12.2024 19:47 — 👍 16 🔁 9 💬 1 📌 1

Overview figure summarizing the major analyses of the paper. In the setting employed by published methods, performances are strongly inflated regardless of model and dataset. When the positive training dataset is randomly rewired, there is only a slight decrease in performance. When the models are evaluated on unseen proteins that are unsimilar to the training proteins, performances become random.

🧬💻Transferring my paper posts here, starting off with "Cracking the black box of deep sequence-based protein-protein interaction prediction" doi.org/10.1093/bib/..., which I published together with @itisalist.bsky.social and @dbblumenthal.bsky.social. So what was it about? 1/13 🧵

28.11.2024 09:10 — 👍 14 🔁 4 💬 3 📌 0

Overview figure of the seven guiding questions addressed in manuscript regarding (1) Study bias, (2) Label distribution, (3) Data shift, (4) Inter-sample similarities, (5) Unavailable features, (6) Features as illegitimate surrogates, (7) Pretraining

🧬🖥️Tranferring my papers, Pt. 2: "Guiding questions to avoid data leakage in biological machine learning applications" doi.org/10.1038/s41592-024-02362-y, which I published with @itisalist.bsky.social @dbblumenthal.bsky.social @romanjoeres.bsky.social, D. Grimm, F. Haselbeck, and O.V. Kalinina. 🧵1/20

02.12.2024 11:53 — 👍 6 🔁 2 💬 1 📌 0

Reminder: #computational_biology position available in my team:

- large-scale #proteomics #metabolomics data analysis
- #rstats , #python, #machinelearning
- great living in the ❤️ of the 🇮🇹 Alps ⛰️

👉 info and apply: bit.ly/3wztFsN

01.03.2024 07:53 — 👍 2 🔁 2 💬 0 📌 1

David B. Blumenthal

Latest posts by dbblumenthal.bsky.social on Bluesky

@dbblumenthal is following 20 prominent accounts