Apache Spark is one of the most frustrating piece of software I’ve ever used. Why something tailored for big data fails so easily when we try to escalate to … big data?!
06.02.2026 19:56 — 👍 0 🔁 0 💬 0 📌 0@phydev.bsky.social
AI skeptic. Sul Americano. https://phydev.github.io
Apache Spark is one of the most frustrating piece of software I’ve ever used. Why something tailored for big data fails so easily when we try to escalate to … big data?!
06.02.2026 19:56 — 👍 0 🔁 0 💬 0 📌 0A typical university strategist:
"Our unique selling point is that we are ranked 345 in the world. Nobody else can say that!"
«The AI group averaged 50% on the quiz, compared to 67% in the hand-coding group. The largest gap in scores was on debugging, suggesting that the ability to understand when code is incorrect and why it fails may be a particular area of concern.»
www.anthropic.com/research/AI-...
I see, that makes sense. I’ll play a little and see what I can get from it. Thanks for the recommendations. 😊
30.01.2026 19:47 — 👍 1 🔁 0 💬 1 📌 0How do you know that the recommendations are reasonable? I’m planing to run a half marathon, but I’m not an experienced runner so I don’t know if I can trust a training plan from ChatGPT.
29.01.2026 22:38 — 👍 0 🔁 0 💬 1 📌 0Interesting. Can you give an example?
29.01.2026 20:46 — 👍 0 🔁 0 💬 1 📌 0Transformer-based LLMs are the most significant technology of the past decade. This is the first in a series of posts for the WHERE MACHINES THINK Substack, exploring Transformers/LLMs at various levels of abstraction, digging deeper with each post. wheremachinesthink.substack.com/p/a-primer-o...
25.01.2026 15:15 — 👍 8 🔁 2 💬 0 📌 0Take a look on: HTTPS://ocbe-uio.github.io/trajpy
25.01.2026 09:57 — 👍 0 🔁 0 💬 0 📌 0Just released a new version of trajpy with a new user interface. The previous GUI was built with tkinter, which has quite old looking design. This version got a full revamp with a web based application built with NiceUi.
To improve code maintainability I moved the frontend to its own repository.
This is a great package. Is the AI guide a thing now in R packages?
24.01.2026 11:12 — 👍 0 🔁 0 💬 1 📌 0This semester I taught Spatial Data Science with #rstats Students analyzed areal, geostatistical & point pattern data, creating fantastic projects on disease mapping 🗺️ air pollution 🏭 crime 🚨 & species modeling 🐾
Book freely available:
👉 paulamoraga.com/book-spatial/
My 2025 highlights for AI research and code:
▪ Unpacking the AI scale narrative
▪ Tabular-learning research
- TabICL: table foundation model
- Retrieve merge predict: data lakes
▪ Better software
- Skrub: machine learning with tables
- Fundamentals in scikit-learn
gael-varoquaux.info/science/2025...
Nice picture! The best view of Paris, because it keeps the ugly Montparnasse building out.. 😅
30.12.2025 22:21 — 👍 1 🔁 0 💬 0 📌 0Experimental evidence that students are more likely to contest grades when they are delivered by an evaluator with a female-sounding name.
"These findings suggest that women in evaluative positions face disproportionate resistance when delivering negative assessments."
Our guidance regarding performance measures for medical AI models is finally out!
- Stop bashing AUROC, although it does not settle things
- Calibration and clinical utility are key
- Show risk distributions
- Classification statistics (e.g. F1) are improper
www.thelancet.com/journals/lan...
Our didactic review on machine learning for causal inference, now open access:
• identifiability (theory of when the data can answer a causal question)
• machine-learning estimators
• study design (asking well-framed questions + loopholes, eg with timewise data)
www.annualreviews.org/content/jour...
🖊️AI for health: the impossible necessity of unbiased data
Is unbiased data important to build health AI? Yes!
Can there be unbiased data? No!
Building health on biased data discriminates
The notion of bias depends on the intended use:
gael-varoquaux.info/science/ai-f...
Side note: I attended a seminar this week about a new method called Adversarial Random Forest, which made me excited. The group that develops the method cares about statistical consistency and they have a paper under review on applying this generative method to imputation.
arxiv.org/abs/2205.09435
What could go wrong when we use random forest based imputation methods for classical inference?
With a simple simulation study we show how random forest imputation can have catastrophic effects on classical inference with respect to bias and spurious correlations.
phydev.github.io/posts/ranger...
Based on this #MICCAI2024 paper, we are currently preparing a new submission with a Bayesian approach to investigate the probability of false claims in medical imaging AI papers. The results are shocking… stay tuned⏰
Great collaboration with @gaelvaroquaux.bsky.social and O. Colliot
I wrote a short tutorial on how to run deepseek and other models locally with ollama and open-webui: phydev.github.io/posts/deepse...
30.01.2025 10:44 — 👍 2 🔁 0 💬 1 📌 0Wrangling string columns for machine learning, the new StringEncoder in @skrub-data.bsky.social gives such a good compute/prediction performance tradeoff.
It's mostly just a bunch of simple tricks, but with well-chosen defaults. This is what we aim for in skrub
skrub-data.org/stable/refer...
Hot off the press! 📣📣In this tutorial we illustrate available multiple imputation approaches for handling longitudinal data including when they are clustered within higher level clusters. A reproducible example with R and Stata code provided! #OpenAccess
onlinelibrary.wiley.com/doi/10.1002/...
Happy to share the first paper of my PhD is published☺️!
In case you like to use class imbalance corrections, maybe it is interesting. Let me know what you think!
onlinelibrary.wiley.com/doi/10.1002/...
Many thanks to @maartenvsmeden.bsky.social, @benvancalster.bsky.social, Anne, Kim and Carl !!
I’ve been living in Norway for 4.5 years and still in love with this place. Yesterday snowed all day long and this morning the sky is crystal clear with a beautiful yellow moon 🌙 , in contrast with the white snow that paints everything. I wished I had my camera with me - a recurrent thought here.
24.01.2025 06:49 — 👍 2 🔁 0 💬 0 📌 0Great compilation! I’m so greatful that I found your and Riley’s research years ago, I wish this becomes common knowledge among data scientist outside academia/biostats also. God jul og godt nyttår! 😊
23.12.2024 10:55 — 👍 2 🔁 0 💬 1 📌 0Let us start 2025 in a positive mood: here are 10 methods things researchers can worry *less* about in 2025
23.12.2024 10:36 — 👍 261 🔁 119 💬 15 📌 18Key question to consider before submitting your paper on the development and validation of your new clinical prediction model is:
WHERE IS THE MODEL????
Thanks for the clarification!
14.11.2023 16:05 — 👍 1 🔁 0 💬 0 📌 0Hi Richard. I also recall reading somewhere that your method was intended to be used with up to 30 predictor candidates. Maybe it was mentioned in an early version of pmsampsize? Recently I searched again to be sure and couldn't find this restriction anymore.
14.11.2023 10:38 — 👍 0 🔁 0 💬 1 📌 0