Ehud Karavani's Avatar

Ehud Karavani

@ehudk.bsky.social

Research Staff Member at IBM Research. Causal Inference πŸ”΄β†’πŸŸ β†πŸŸ‘. Machine Learning πŸ€–πŸŽ“. Data Communication πŸ“ˆ. Healthcare βš•οΈ. Creator of π™²πšŠπšžπšœπšŠπš•πš•πš’πš‹: https://github.com/IBM/causallib Website: https://ehud.co

280 Followers  |  236 Following  |  509 Posts  |  Joined: 25.08.2023
Posts Following

Posts by Ehud Karavani (@ehudk.bsky.social)

Post image

STAT saw and raised

07.03.2026 19:31 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

always use a Mondrian theme in your ggplots
@statsepi.bsky.social

07.03.2026 13:32 β€” πŸ‘ 9    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Note of reflection (March 5, 2020) 
This model was conceived in 2010, now more than 10 years ago, and not very long after Git itself came into being. In those 10 years, git-flow (the branching model laid out in this article) has become hugely popular in many a software team to the point where people have started treating it like a standard of sorts β€” but unfortunately also as a dogma or panacea.

During those 10 years, Git itself has taken the world by a storm, and the most popular type of software that is being developed with Git is shifting more towards web apps β€” at least in my filter bubble. Web apps are typically continuously delivered, not rolled back, and you don't have to support multiple versions of the software running in the wild.

This is not the class of software that I had in mind when I wrote the blog post 10 years ago. If your team is doing continuous delivery of software, I would suggest to adopt a much simpler workflow (like GitHub flow) instead of trying to shoehorn git-flow into your team.

If, however, you are building software that is explicitly versioned, or if you need to support multiple versions of your software in the wild, then git-flow may still be as good of a fit to your team as it has been to people in the last 10 years. In that case, please read on.

To conclude, always remember that panaceas don't exist. Consider your own context. Don't be hating. Decide for yourself.

Note of reflection (March 5, 2020) This model was conceived in 2010, now more than 10 years ago, and not very long after Git itself came into being. In those 10 years, git-flow (the branching model laid out in this article) has become hugely popular in many a software team to the point where people have started treating it like a standard of sorts β€” but unfortunately also as a dogma or panacea. During those 10 years, Git itself has taken the world by a storm, and the most popular type of software that is being developed with Git is shifting more towards web apps β€” at least in my filter bubble. Web apps are typically continuously delivered, not rolled back, and you don't have to support multiple versions of the software running in the wild. This is not the class of software that I had in mind when I wrote the blog post 10 years ago. If your team is doing continuous delivery of software, I would suggest to adopt a much simpler workflow (like GitHub flow) instead of trying to shoehorn git-flow into your team. If, however, you are building software that is explicitly versioned, or if you need to support multiple versions of your software in the wild, then git-flow may still be as good of a fit to your team as it has been to people in the last 10 years. In that case, please read on. To conclude, always remember that panaceas don't exist. Consider your own context. Don't be hating. Decide for yourself.

To his credit, he states as much as the very beginning of the post. The type of software this approach aims for is rarely seen nowadays, and (as someone who once thought 𝘡𝘩π˜ͺ𝘴 π˜ͺ𝘴 𝘡𝘩𝘦 𝘸𝘒𝘺) it should be judged accordingly. no need to be too harsh on it.

18.02.2026 07:13 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

πŸ“£ NEW! I’ve just released the BIGGEST and perhaps most creative project I’ve ever worked on!

β€œSearching for Birds” searchingforbirds.visualcinnamon.com 🐀

A project, an article, an exploration that dives into the data that connects humans with birds, by looking at how we search for birds.

12.02.2026 10:02 β€” πŸ‘ 476    πŸ” 176    πŸ’¬ 25    πŸ“Œ 49

of course, this meme is magnetic 🧲 🧭

06.02.2026 13:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Robocalypse: The Revival of the Mechanical Turk

04.02.2026 09:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

it's a good view, don't get me wrong, but I also find it a bit limiting - because you can benefit from using random effects to pool estimates / shrink effects of finite factors, too, especially in imbalanced settings

04.02.2026 08:03 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

that whammy bar on the guitar is excellent detail πŸ‘Œ

04.02.2026 07:04 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

funny, I wasn't aware he had other fame aside from his spanning trees fame

30.01.2026 11:46 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Flexible use of a multi-purpose tool by a cow Osuna-MascarΓ³ and Auersperg report flexible, multipurpose tool use in a cow, expanding the known range of mammalian tool users and underscoring overlooked cognitive capacities in livestock.

If you haven't watched the video abstract about Veronika the tool-using cow, you really should, especially the message from her owner at the end! It cleanses the timeline a bit.
πŸ§ͺ

www.cell.com/current-biol...

28.01.2026 20:33 β€” πŸ‘ 55    πŸ” 15    πŸ’¬ 0    πŸ“Œ 2

Quarto's yaml autocomplete was my sole inspiration for coding a json schema for some internal package we developed that has a yaml/hydra interface.
was a fun Pydantic exercise but I wouldn't recommend to a friend...

28.01.2026 19:13 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

*doesn't have to necessarily write all of the code, but thoughtful use of llms to contemplate different approaches rather than blindly accepting code shows care and more detail-oriented mentality and critical thinking, which are broader traits of interest in an era where everyone can now code.

28.01.2026 04:32 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

this. "why" questions are the key.
someone writing their own code* will have in depth considerations of choices made because they struggled with deciding on different implementations. but delegating coding to an llm makes them only a reader of the code, not so different than you reviewing it.

28.01.2026 04:32 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Always happy when you chime in. Interesting idea. Do you have any examples?
I read "error bars" in a frequentist sense, so you'll only have mean and two interval edges (rather than continuous throughout the interval?), and then dithering of three colors would seem too irregular?

23.01.2026 15:14 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

If you can map your error bars to a 0-1 confidence there are ways to use color transparency and/or hex size (softening the edges) to make more uncertain hexes less prominent

23.01.2026 07:37 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

happy b-day wikipedia
wikipedia25.org/en

22.01.2026 08:02 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Both are "AI" in general (at least colloquially), but are different in how we create them, validate them, and deploy then to interact with the public

20.01.2026 10:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I think this is a result of the general public now conflating between AI Gen 1 (e.g., deep learning methods for computer vision in radiology or self-driving cars) and this newer GenAI era (natural language interface to everything).

20.01.2026 10:16 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 2    πŸ“Œ 0

But my hunch is him using a fixed true ΞΈ, while yours is sampling from a prior for each trial (and using the same prior for data generation 𝘒𝘯π˜₯ analysis is probably what makes it so well calibrated).
But I guess I now better understand your comment about him mixing Bayesian and frequentist ideas.

18.01.2026 09:14 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Thanks Frank. Reading an earlier chapter of yours, I think I found the agreement with John: the proportion of wrong conclusions among trials stopped early. The difference is in the magnitude of the phenomena - John's is much higher (~20% for 95% HDI).

18.01.2026 09:14 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

the related section can be found here, in case you think it's something in the details: nyu-cdsc.github.io/learningr/as...

17.01.2026 14:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

I don't think he avoids the posterior mean - the stopping criterion I mentioned above is when the HDI is beyond some ROPE threshold (3rd column in figure), and the HDI is calculated from the posterior distribution.
I also think calibration is bad when falsely rejecting the null (2nd row 3rd column)

17.01.2026 14:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

so do you believe this bias is a result of Kruschke using a uniform prior in this example instead of something more regularizing?

16.01.2026 14:56 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
The fourth panels of Figures 13.4 and 13.5 show the 95% HDIs at every flip, assuming
a uniform prior. The y-axis is ΞΈ, and at each N the 95% HDI of the posterior, p(ΞΈ|z, N ),
is plotted as a vertical line segment. The dashed lines indicate the limits of a ROPE from
0.45 to 0.55, which is an arbitrary but reasonable choice for illustrating the behavior
of the decision rule. You can see in Figure 13.4 that the HDI eventually falls within
the ROPE, thereby correctly accepting the null value for practical purposes

The fourth panels of Figures 13.4 and 13.5 show the 95% HDIs at every flip, assuming a uniform prior. The y-axis is ΞΈ, and at each N the 95% HDI of the posterior, p(ΞΈ|z, N ), is plotted as a vertical line segment. The dashed lines indicate the limits of a ROPE from 0.45 to 0.55, which is an arbitrary but reasonable choice for illustrating the behavior of the decision rule. You can see in Figure 13.4 that the HDI eventually falls within the ROPE, thereby correctly accepting the null value for practical purposes

That's a fair point and Kruschke does seem to focus on the observed proportion of heads ("z/N"). But to the best of my understanding the HDI π˜ͺ𝘴 calculated from the posterior distribution. so whether the HDI is outside the ROPE is a posterior-based measured, not an observed one.

15.01.2026 21:15 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

is that really the case that simply changing the ROPE's bounds/thresholds (from MCID to nontrivial efficacy) debiases the decision?

15.01.2026 21:08 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Definitions

Delta: true treatment effect being estimated
delta: minimum clinically important Delta
gamma: minimum detectable Delta (threshold for non-trivial treatment effect, e.g. 0.3delta)
SI: similarity interval for Delta, e.g., -0.5delta, 0.5delta
N: average sample size, used for initial resource planning. This is an estimate of the ultimate sample size based on assuming Delta=delta. 
N is computed to achieve a Bayesian power of 0.9 for efficacy, i.e., achieving a 0.9 probability at a fixed sample size that the probability of any efficacy exceeds 0.95 while the probability of non-trivial efficacy exceeds 0.85, i.e., Pr(Delta>0) and Pr(Delta>gamma)>0.85.

Definitions Delta: true treatment effect being estimated delta: minimum clinically important Delta gamma: minimum detectable Delta (threshold for non-trivial treatment effect, e.g. 0.3delta) SI: similarity interval for Delta, e.g., -0.5delta, 0.5delta N: average sample size, used for initial resource planning. This is an estimate of the ultimate sample size based on assuming Delta=delta. N is computed to achieve a Bayesian power of 0.9 for efficacy, i.e., achieving a 0.9 probability at a fixed sample size that the probability of any efficacy exceeds 0.95 while the probability of non-trivial efficacy exceeds 0.85, i.e., Pr(Delta>0) and Pr(Delta>gamma)>0.85.

thanks Frank, it's the first time i'm hearing about non-trivial efficacy as a distinct concept that is not a different wording for MCID. from your article I see you're using interchangeably with minimum detectable effect.
anyhow, in the article it seems that NTE is just scaled MCID.

15.01.2026 21:08 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 3    πŸ“Œ 0

very good chance I might've misunderstood or misinterpreted your writings or Kruschke's, but I would highly appreciate if you could please try to clarify.

(it is a genuine question I've been having for a while now bsky.app/profile/ehud...)

14.01.2026 19:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Stopping when HDI is inside / outside the ROPE leads to some false positives when the null is true (wrong type 1 assertion), while correctly rejecting the null consistently when it is false (no wrong type 2 assertions). Meanwhile, using only the width of the HDI as the selection criterion leads to no wrong assertions (although you can stay undecided for very large Ns when the null is true, instead of accepting it).

Stopping when HDI is inside / outside the ROPE leads to some false positives when the null is true (wrong type 1 assertion), while correctly rejecting the null consistently when it is false (no wrong type 2 assertions). Meanwhile, using only the width of the HDI as the selection criterion leads to no wrong assertions (although you can stay undecided for very large Ns when the null is true, instead of accepting it).

The key point is this: If the sampling procedure, such as the stopping rule, biases the data in the sample then the estimation can be biased whether it’s Bayesian estimation or not. A stopping rule based on getting extreme values will automatically bias the sample toward extreme estimates, because once some extreme data appear by chance, sampling stops. A stopping rule based on precision will not bias the
sample unless the measure of precision depends on the value of the parameter (which actually is the case here, just not very noticeably for parameter values that aren’t very extreme).

The key point is this: If the sampling procedure, such as the stopping rule, biases the data in the sample then the estimation can be biased whether it’s Bayesian estimation or not. A stopping rule based on getting extreme values will automatically bias the sample toward extreme estimates, because once some extreme data appear by chance, sampling stops. A stopping rule based on precision will not bias the sample unless the measure of precision depends on the value of the parameter (which actually is the case here, just not very noticeably for parameter values that aren’t very extreme).

Frank, genuine question, any decision (e.g "stop for inefficiency") can have false positives. it is unintuitive to me that selecting on sign+magnitude of the effect leads to no selection bias. This assertion is also at odds with Kruschke's book (13.3.2) showing ROPE-based criteria biasing the effect

14.01.2026 19:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

I don't fully understand the ontological comment, tbh. But I do think counterfactuals can be defined without setting (hypothetical) experiments, which I think you're alluding to? imo, if you can solve it with parallel universes, you can solve it with counterfactuals; RCTs are for mortals.

05.01.2026 20:52 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I bet he did at least a few, but didn't need much to generalize because he used strong priors so experimental N could be kept low πŸ™ƒ

05.01.2026 20:12 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0