Michael Oberst moberst - Bluesky Statics

Thanks for sharing!

09.03.2026 12:37 — 👍 0 🔁 0 💬 0 📌 0

I hesitate to give actual feedback lol, but maybe make the pause button a fixed size? When I pause and then try to go back, the pause button becomes larger to accommodate more text, and then I accidentally hit it while pressing where the “go back” button used to be.

02.01.2026 12:38 — 👍 4 🔁 0 💬 0 📌 0

I think it’s fair to say there are also good examples of both goals (measuring “true” vs “relative” performance) being pursued in ML, especially when the benchmark is ostensibly tied to a “real” application, and meant to demonstrate some real-world utility.

28.12.2025 19:09 — 👍 1 🔁 0 💬 0 📌 0

AI for radiographic COVID-19 detection selects shortcuts over signal - Nature Machine Intelligence The urgency of the developing COVID-19 epidemic has led to a large number of novel diagnostic approaches, many of which use machine learning. DeGrave and colleagues use explainable AI techniques to an...

Another example: Excitement around diagnosing COVID from chest X-rays, very easy on public datasets, but much harder in practice. In public datasets, “positives” came from a certain set of hospitals, and “negatives” from others, so many shortcuts existed. See eg www.nature.com/articles/s42...

28.12.2025 19:09 — 👍 3 🔁 0 💬 1 📌 0

More broadly, I loved the post, thanks for writing and sharing it! My critique is more in the “minor revisions” category than anything else :)

28.12.2025 18:43 — 👍 2 🔁 0 💬 1 📌 0

The implicit hope being that relative gains on a benchmark (the “local” evaluation, as you put it) translate into relative gains on “real” tasks. Of course, that’s not always how it works out, as you point out.

28.12.2025 18:43 — 👍 3 🔁 0 💬 2 📌 0

I think that’s a fair point! FWIW I loved the “build-and-test” vs “describe and defend” distinction, and as you put it, “which approach performs better under the same evaluation?” is often the build-and-test question, where you care about relative performance, not so much absolute accuracy.

28.12.2025 18:43 — 👍 1 🔁 0 💬 1 📌 0

It’s non-obvious to me that “corrigenda for years of ImageNet papers” was required, given the finding that the essential conclusions (does model A improve over model B) were shown to hold up!

27.12.2025 11:58 — 👍 3 🔁 0 💬 2 📌 0

So this part of the piece feels off:

> Consider ImageNet: when Recht et al. (2019) built fresh test sets and found nontrivial accuracy drops…the typical response was not to issue corrigenda for years of ImageNet papers. Instead, the field continued to iterate on the next yardstick

27.12.2025 11:58 — 👍 3 🔁 0 💬 1 📌 0

Enjoyed the post and have encouraged folks to read it, but IMO it misrepresents @beenwrekt.bsky.social’s “Do ImageNet Classifiers Generalize to ImageNet?” The surprising finding of that paper wasn’t the (absolute) accuracy drop, but the fact that ranking of models was essentially unchanged.

27.12.2025 11:58 — 👍 5 🔁 0 💬 1 📌 1

Antibiotic Resistance Microbiology Dataset Mass General Brigham (ARMD-MGB) v1.0.0 ARMD-MGB contains detailed microbiology and clinical metadata for >225,000 patients and >970,000 cultures collected over 10 years

Today I am very proud to announce the release of the Antibiotic Resistance Microbiology Dataset - Mass General Brigham (ARMD-MGB; physionet.org/content/armd...) , as part of an NIH-funded collaboration led by Jonathan Chen at Stanford. (1/6)

05.12.2025 07:39 — 👍 30 🔁 8 💬 1 📌 0

Data Science and AI Institute announces 22 new faculty - Johns Hopkins Data Science and AI Institute The Johns Hopkins Data Science and AI Institute welcomes 22 new faculty members. These newly appointed faculty members join more than 150 Data Science and AI Institute faculty members across…

More broadly, JHU is an increasingly exciting place to do research in AI and ML, with huge investments in faculty, students, and compute. Just last year we hired 22 (!) new faculty in Data Science and AI! ai.jhu.edu/news/data-sc...

04.12.2025 16:03 — 👍 0 🔁 0 💬 0 📌 0

For more information about me and my group, see my website, which also has information on applying to the CS PhD program (www.michaelkoberst.com)

04.12.2025 16:03 — 👍 0 🔁 1 💬 1 📌 0

Photo of Johns Hopkins University

Come join my group at Johns Hopkins!

I'm recruiting CS PhD students for Fall'26 (deadline: Dec 15) who are interested in safe/reliable AI in healthcare. See my website (link in reply) for more info.

I'm also headed to #NeurIPS, and happy to chat with prospective students!

04.12.2025 16:03 — 👍 2 🔁 2 💬 1 📌 0

For more details, see the paper / poster!

And if you're at UAI, check out the talk and poster today! Jacob (not on social media) and I are around at UAI, so reach out if you're interested in chatting more!

Paper: arxiv.org/abs/2502.09467
Poster: www.michaelkoberst.com/assets/paper...

23.07.2025 14:09 — 👍 0 🔁 0 💬 0 📌 0

These findings are also relevant for the design of new trials!

For instance, deploying *multiple models* in a trial has two benefits: (1) it allows us to construct tighter bounds for new models, and (2) it allows us to test whether these assumptions hold in practice.

23.07.2025 14:09 — 👍 0 🔁 0 💬 1 📌 0

We make some other mild assumptions, which can be falsified using existing RCT data. For instance, if two models have the *same* output on a given patient, then we assume outcomes are at least as good under the model with higher performance.

23.07.2025 14:09 — 👍 1 🔁 0 💬 1 📌 0

To capture these challenges, we assume that model impact is mediated by both the output of the model (A), and the performance characteristics (M).

This formalism allows us to start reasoning about the impact of new models with different outputs and performance characteristics.

23.07.2025 14:09 — 👍 0 🔁 0 💬 1 📌 0

The second challenge is trust: Impact depends on the actions of human decision-makers, and those decision-makers may treat two models differently based on their performance characteristics (e.g., if a model produces a lot of false alarms, clinicians may ignore the outputs).

23.07.2025 14:09 — 👍 0 🔁 0 💬 1 📌 0

We tackle two non-standard challenges that arise in this setting, *coverage* and *trust*.

The first challenge is coverage: If the new model is very different from previous models, it may produce outputs (for specific types of inputs) that were never observed in the trial.

23.07.2025 14:09 — 👍 0 🔁 0 💬 1 📌 0

We develop a method for placing bounds on the impact of a *new* ML model, by re-using data from an RCT that did not include the model.

These bounds require some mild assumptions, but those assumptions can be tested in practice using RCT data that includes multiple models.

23.07.2025 14:09 — 👍 0 🔁 0 💬 1 📌 0

Randomized trials (RCTs) help evaluate if deploying AI/ML systems actually improves outcomes (e.g., survival rates in a healthcare context).

But AI/ML systems can change: Do we need a new RCT every time we update the model? Not necessarily, as we show in our UAI paper! arxiv.org/abs/2502.09467

23.07.2025 14:09 — 👍 5 🔁 1 💬 1 📌 0

Hard to have a graded quiz, but still useful as an ungraded “self-assessment” (which I’ve seen) to set expectations for what kind of prereqs are expected. In some courses, you might expect those who would be scared off to drop the course later in any case, esp if drop deadline is pretty late.

30.12.2024 14:05 — 👍 7 🔁 0 💬 0 📌 0

From skimming the paper it seems more like the takeaway is: “if you binarize, you are estimating *something* that has a specific causal interpretation but it’s a weird thing (diff of two very specific treatment policies) you might not actually care about except in some special cases”

26.12.2024 13:39 — 👍 9 🔁 0 💬 1 📌 0

I’d nominate @monicaagrawal.bsky.social

12.12.2024 13:13 — 👍 6 🔁 0 💬 1 📌 0

@matt-levine.bsky.social has a great explanation in his Money Stuff newsletter (which I also highly recommend in general)

12.12.2024 07:55 — 👍 1 🔁 0 💬 1 📌 0

In this conversation I have been endorsed as "twee" and "not a crank".

BTW, I'm on the job market this year. If you are interested hiring an economist in macro/metrics/computational/ML with such stellar endorsements, please get in touch!

09.11.2024 15:12 — 👍 32 🔁 6 💬 0 📌 1

An example of some recent work (my first last-author paper!) on rigorous re-evaluation of popular approaches to adapt LLMs and VLMs to the medical domain
bsky.app/profile/zach...

27.11.2024 16:03 — 👍 7 🔁 0 💬 0 📌 0

Joining the Group Computer Science, Statistics, Causality, and Healthcare

Application link: www.cs.jhu.edu/academic-pro...

More information: www.michaelkoberst.com/joining

27.11.2024 15:58 — 👍 4 🔁 0 💬 1 📌 0

Photo of Johns Hopkins Campus

I'm recruiting PhD students for Fall 2025! CS PhD Deadline: Dec. 15th.

I work on safe/reliable ML and causal inference, motivated by healthcare applications.

Beyond myself, Johns Hopkins has a rich community of folks doing similar work. Come join us!

27.11.2024 15:58 — 👍 19 🔁 7 💬 1 📌 0

Posts by Michael Oberst (@moberst.bsky.social)