Michael Oberst's Avatar

Michael Oberst

@moberst.bsky.social

Assistant Prof. of CS at Johns Hopkins Visiting Scientist at Abridge AI Causality & Machine Learning in Healthcare Prev: PhD at MIT, Postdoc at CMU

1,301 Followers  |  202 Following  |  31 Posts  |  Joined: 04.10.2023
Posts Following

Posts by Michael Oberst (@moberst.bsky.social)

I hesitate to give actual feedback lol, but maybe make the pause button a fixed size? When I pause and then try to go back, the pause button becomes larger to accommodate more text, and then I accidentally hit it while pressing where the β€œgo back” button used to be.

02.01.2026 12:38 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I think it’s fair to say there are also good examples of both goals (measuring β€œtrue” vs β€œrelative” performance) being pursued in ML, especially when the benchmark is ostensibly tied to a β€œreal” application, and meant to demonstrate some real-world utility.

28.12.2025 19:09 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
AI for radiographic COVID-19 detection selects shortcuts over signal - Nature Machine Intelligence The urgency of the developing COVID-19 epidemic has led to a large number of novel diagnostic approaches, many of which use machine learning. DeGrave and colleagues use explainable AI techniques to an...

Another example: Excitement around diagnosing COVID from chest X-rays, very easy on public datasets, but much harder in practice. In public datasets, β€œpositives” came from a certain set of hospitals, and β€œnegatives” from others, so many shortcuts existed. See eg www.nature.com/articles/s42...

28.12.2025 19:09 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

More broadly, I loved the post, thanks for writing and sharing it! My critique is more in the β€œminor revisions” category than anything else :)

28.12.2025 18:43 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

The implicit hope being that relative gains on a benchmark (the β€œlocal” evaluation, as you put it) translate into relative gains on β€œreal” tasks. Of course, that’s not always how it works out, as you point out.

28.12.2025 18:43 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

I think that’s a fair point! FWIW I loved the β€œbuild-and-test” vs β€œdescribe and defend” distinction, and as you put it, β€œwhich approach performs better under the same evaluation?” is often the build-and-test question, where you care about relative performance, not so much absolute accuracy.

28.12.2025 18:43 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

It’s non-obvious to me that β€œcorrigenda for years of ImageNet papers” was required, given the finding that the essential conclusions (does model A improve over model B) were shown to hold up!

27.12.2025 11:58 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

So this part of the piece feels off:

> Consider ImageNet: when Recht et al. (2019) built fresh test sets and found nontrivial accuracy drops…the typical response was not to issue corrigenda for years of ImageNet papers. Instead, the field continued to iterate on the next yardstick

27.12.2025 11:58 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Enjoyed the post and have encouraged folks to read it, but IMO it misrepresents @beenwrekt.bsky.social’s β€œDo ImageNet Classifiers Generalize to ImageNet?” The surprising finding of that paper wasn’t the (absolute) accuracy drop, but the fact that ranking of models was essentially unchanged.

27.12.2025 11:58 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1
Antibiotic Resistance Microbiology Dataset Mass General Brigham (ARMD-MGB) v1.0.0 ARMD-MGB contains detailed microbiology and clinical metadata for >225,000 patients and >970,000 cultures collected over 10 years

Today I am very proud to announce the release of the Antibiotic Resistance Microbiology Dataset - Mass General Brigham (ARMD-MGB; physionet.org/content/armd...) , as part of an NIH-funded collaboration led by Jonathan Chen at Stanford. (1/6)

05.12.2025 07:39 β€” πŸ‘ 30    πŸ” 8    πŸ’¬ 1    πŸ“Œ 0
Data Science and AI Institute announces 22 new faculty - Johns Hopkins Data Science and AI Institute The Johns Hopkins Data Science and AI Institute welcomes 22 new faculty members. These newly appointed faculty members join more than 150 Data Science and AI Institute faculty members across…

More broadly, JHU is an increasingly exciting place to do research in AI and ML, with huge investments in faculty, students, and compute. Just last year we hired 22 (!) new faculty in Data Science and AI! ai.jhu.edu/news/data-sc...

04.12.2025 16:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

For more information about me and my group, see my website, which also has information on applying to the CS PhD program (www.michaelkoberst.com)

04.12.2025 16:03 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Photo of Johns Hopkins University

Photo of Johns Hopkins University

Come join my group at Johns Hopkins!

I'm recruiting CS PhD students for Fall'26 (deadline: Dec 15) who are interested in safe/reliable AI in healthcare. See my website (link in reply) for more info.

I'm also headed to #NeurIPS, and happy to chat with prospective students!

04.12.2025 16:03 β€” πŸ‘ 2    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

For more details, see the paper / poster!

And if you're at UAI, check out the talk and poster today! Jacob (not on social media) and I are around at UAI, so reach out if you're interested in chatting more!

Paper: arxiv.org/abs/2502.09467
Poster: www.michaelkoberst.com/assets/paper...

23.07.2025 14:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

These findings are also relevant for the design of new trials!

For instance, deploying *multiple models* in a trial has two benefits: (1) it allows us to construct tighter bounds for new models, and (2) it allows us to test whether these assumptions hold in practice.

23.07.2025 14:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We make some other mild assumptions, which can be falsified using existing RCT data. For instance, if two models have the *same* output on a given patient, then we assume outcomes are at least as good under the model with higher performance.

23.07.2025 14:09 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

To capture these challenges, we assume that model impact is mediated by both the output of the model (A), and the performance characteristics (M).

This formalism allows us to start reasoning about the impact of new models with different outputs and performance characteristics.

23.07.2025 14:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

The second challenge is trust: Impact depends on the actions of human decision-makers, and those decision-makers may treat two models differently based on their performance characteristics (e.g., if a model produces a lot of false alarms, clinicians may ignore the outputs).

23.07.2025 14:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We tackle two non-standard challenges that arise in this setting, *coverage* and *trust*.

The first challenge is coverage: If the new model is very different from previous models, it may produce outputs (for specific types of inputs) that were never observed in the trial.

23.07.2025 14:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We develop a method for placing bounds on the impact of a *new* ML model, by re-using data from an RCT that did not include the model.

These bounds require some mild assumptions, but those assumptions can be tested in practice using RCT data that includes multiple models.

23.07.2025 14:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Randomized trials (RCTs) help evaluate if deploying AI/ML systems actually improves outcomes (e.g., survival rates in a healthcare context).

But AI/ML systems can change: Do we need a new RCT every time we update the model? Not necessarily, as we show in our UAI paper! arxiv.org/abs/2502.09467

23.07.2025 14:09 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Hard to have a graded quiz, but still useful as an ungraded β€œself-assessment” (which I’ve seen) to set expectations for what kind of prereqs are expected. In some courses, you might expect those who would be scared off to drop the course later in any case, esp if drop deadline is pretty late.

30.12.2024 14:05 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

From skimming the paper it seems more like the takeaway is: β€œif you binarize, you are estimating *something* that has a specific causal interpretation but it’s a weird thing (diff of two very specific treatment policies) you might not actually care about except in some special cases”

26.12.2024 13:39 β€” πŸ‘ 9    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I’d nominate @monicaagrawal.bsky.social

12.12.2024 13:13 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@matt-levine.bsky.social has a great explanation in his Money Stuff newsletter (which I also highly recommend in general)

12.12.2024 07:55 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

In this conversation I have been endorsed as "twee" and "not a crank".

BTW, I'm on the job market this year. If you are interested hiring an economist in macro/metrics/computational/ML with such stellar endorsements, please get in touch!

09.11.2024 15:12 β€” πŸ‘ 32    πŸ” 6    πŸ’¬ 0    πŸ“Œ 1

An example of some recent work (my first last-author paper!) on rigorous re-evaluation of popular approaches to adapt LLMs and VLMs to the medical domain
bsky.app/profile/zach...

27.11.2024 16:03 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Joining the Group Computer Science, Statistics, Causality, and Healthcare

Application link: www.cs.jhu.edu/academic-pro...

More information: www.michaelkoberst.com/joining

27.11.2024 15:58 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Photo of Johns Hopkins Campus

Photo of Johns Hopkins Campus

I'm recruiting PhD students for Fall 2025! CS PhD Deadline: Dec. 15th.

I work on safe/reliable ML and causal inference, motivated by healthcare applications.

Beyond myself, Johns Hopkins has a rich community of folks doing similar work. Come join us!

27.11.2024 15:58 β€” πŸ‘ 19    πŸ” 7    πŸ’¬ 1    πŸ“Œ 0
Preview
Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress? Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pret...

Medically adapted foundation models (think Med-*) turn out to be more hot air than hot stuff. Correcting for fatal flaws in evaluation, the current crop are no better on balance than generic foundation models, even on the very tasks for which benefits are claimed.
arxiv.org/abs/2411.04118

26.11.2024 18:12 β€” πŸ‘ 259    πŸ” 57    πŸ’¬ 8    πŸ“Œ 13