's Avatar

@anish144.bsky.social

PhD researcher in Machine Learning at Imperial College. Visiting at University of Oxford. Interested in all things involving causality and Bayesian machine learning. Recently I have also been interested in scaling theory. https://anish144.github.io/

74 Followers  |  468 Following  |  30 Posts  |  Joined: 22.11.2024  |  2.5368

Latest posts by anish144.bsky.social on Bluesky

We will be presenting this work at #ICML2025 and are happy to discuss it further.

πŸ—“οΈ: Tue 15 Jul 4:30 p.m. PDT
πŸ“: East Exhibition Hall A-B #E-1912

Joint 1st author: @ruby-sedgwick.bsky.social.
With: Avinash Kori, Ben Glocker, @mvdw.bsky.social.

🧡14/14

10.07.2025 18:06 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Finally, we note the flexibility of our model comes at the cost of more difficult optimisation. However, random restarts and choosing the model with the highest score reliably improve the structure recovery metrics (commonly done in GPs).

🧡13/14

10.07.2025 18:06 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We also test our method on semi-synthetic data generated from the Syntren gene regulatory network simulator.

🧡12/14

10.07.2025 18:06 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

When data are generated from an identifiable model (ANM), our more flexible model performs as well as an ANM restricted Bayesian model (CGP). Both Bayesian models again outperform other non-Bayesian approaches - even those that assume the correct ANM assumption.

🧡11/14

10.07.2025 18:06 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

With larger number of variables (50), where the discrete search blows up, and with complex data, our approach performs well. SDCD uses the same acyclicity regulariser but uses maximum likelihood with NNs. This shows the advantage of the Bayesian approach.

🧡10/14

10.07.2025 18:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We first test on data generated from our model itself and where discrete model selection is tractable (3 variables). Here, we show that while the discrete model (DGP-CDE) recovers the true structure reliably, our continuous approximation (CGP-DCE) results in higher error.

🧡9/14

10.07.2025 18:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We enforce acyclicity in the adjacency by adding an acyclicity constraint to the optimisation. Variational inference trains the rest of the parameters.

The final objective returns the adjacency of the causal structure that maximises the posterior.

🧡8/14

10.07.2025 18:06 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Therefore, we can construct an adjacency matrix from the kernel hyperparameters. This amounts to Automatic Relevance Determination: maximising the marginal likelihood uncovers the dependency structure among the variables. However, the learnt adjacency must be acyclic.

🧡7/14

10.07.2025 18:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Next, we construct a latent variable Gaussian process model that can model non-Gaussian densities with inputs according to a causal graph. To continuously parametrise the space of graphs, we note that the kernel hyperparameters control input dependence.

🧡6/14

10.07.2025 18:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We first show that the guarantees of Bayesian model selection (BMS) hold in the multivariate case: 1) when the underlying model is identifiable, BMS identifies the true DAG, 2) for more flexible models, graphs stay distinguishable.

🧡5/14

10.07.2025 18:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

However, naive Bayesian model selection scales poorly because DAGs grow exponentially with no. of variables.

We propose a continuous Bayesian model selection approach that scales and allows for using more flexible assumptions.

🧡4/14

10.07.2025 18:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

While current causal discovery impose unrealistic model restrictions to ensure identifiability, Bayesian models relax identifiability but allow for causal and more realistic assumptions, yielding performance gains: arxiv.org/abs/2306.02931

🧡3/14

10.07.2025 18:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Bayesian models encode soft restrictions in the form of priors. These priors also allow for encoding causal assumptions. Mainly that causal mechanisms do not inform each other. This is achieved by simply ensuring that the prior factorises over the mechanisms.

🧡2/14

10.07.2025 18:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ“’New #ICML2025 paper: "Continuous Bayesian Model Selection for Multivariate Causal Discovery".

We propose a Bayesian causal model that allows for scalable causal discovery without restrictive model assumptions.

Paper: arxiv.org/abs/2411.10154

Code: github.com/Anish144/Con...

🧡1/14

10.07.2025 18:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 2
Preview
A Meta-Learning Approach to Bayesian Causal Discovery Discovering a unique causal structure is difficult due to both inherent identifiability issues, and the consequences of finite data. As such, uncertainty over causal structures, such as those...

Excited to be presenting this work at #ICLR2025. Please do reach out if you are interested in a similar space!

πŸ—“οΈ: Hall 3 + Hall 2B #471
πŸ•: Fri 25 Apr, 3 p.m.
πŸ“œ: openreview.net/forum?id=eeJ...

This was a great collaboration w/ @mashtro.bsky.social, James Requeima, @mvdw.bsky.social

19.04.2025 17:39 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Why did that work? We are approximating the posterior of a causal model (from which data is generated), which may be different to the data generating process. Improving the causal model (more flexible, wider prior), and increasing the capacity of the neural process can help 14/15

19.04.2025 17:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

What if we don't know the data distribution? Our approach here is to encode a "wide prior", training on mixtures of all possible models (that we can think of). We show that this approach leads to good performance on datasets whose generation process was unknown at training. 13/15

19.04.2025 17:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Next, we test with higher nodes (20), denser graphs, and more complicated functions. Here, we show that our model outperforms other baselines. Notably, a single model that is trained on all the data (labelled BCNP All Data) does not lose performance on specific datasets. 12/15

19.04.2025 17:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We first show that our model outputs reasonable posterior samples: 2 node, graph with single edge, where the underlying data is not identifiable. Here we can see that the AVICI model, that does not correlate terms of the adjacency matrix, fails to output reasonable samples. 11/15

19.04.2025 17:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We test against two baseline types: 1) Posterior approx. via marginal likelihood (DiBS, BayesDAG). 2) NP-like methods finding single structures, that can be used to obtain posterior samples, but missing key properties of the posterior (AVICI, CSIvA). 10/15

19.04.2025 17:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

The loss, targeting the KL divergence, simplifies to maximising the log probability of the true causal graph under our model. The final scheme: A model that efficiently outputs samples of causal structures approximating the true posterior β€” with just a forward pass! 9/15

19.04.2025 17:39 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Our decoder uses lower triangular-permutation matrices (A, Q) to construct DAGs. A Gumbel-Sinkhorn distribution is parameterised, from which permutations (Q) can be sampled. The representation is further processed to parameterise the lower triangular matrix (A). 8/15

19.04.2025 17:39 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We embed each node sample pair, and append a query vector of 0s to the sample axis. Our encoder alternates between attention over samples and nodes to preserve equivariance. We then perform cross attention with the query vector to encode permutation invariance over samples. 7/15

19.04.2025 17:39 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

What does our model look like? We encode key properties of the posterior: 1) Permutation Invariance with respect to the samples, 2) Permutation equivariance with respect to nodes, 3) Correlation between adjacency elements. We do with a transformer encoder-decoder structure. 6/15

19.04.2025 17:39 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Our training objective reflects this: we minimise the KL between the true posterior and the neural process. The key property is that we only require samples of data and the true causal graph. This data forms the "prior", which can be synthetic or can be from real examples. 5/15

19.04.2025 17:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

So how do we solve this? We bypass these integrals by using the neural process paradigm: an NN that learns to directly map data to the target posterior. We sample data from the causal model of interest & train a network to reconstruct the underlying causal structure. 4/15

19.04.2025 17:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Bayes infers a distribution over plausible causal structures instead of a single structure. However, it requires solving a complicated integral, or equivalently finding a posterior over functions. The space of DAGs also grows super-exponentially with the number nodes. 3/15

19.04.2025 17:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Causal discovery from observational data is a challenging task. Identifiability is only guaranteed under strong assumptions about model classes, which are often violated in practice. This issue is made worse by finite data. 2/15

19.04.2025 17:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Understanding causes is key to science. Finite observational data alone isn't enough. While Bayes offers a framework to deal with this, the calculations are often intractable. We introduce a method to accurately approximate the posterior over causal structures.

#ICLR2025 🧡1/15

19.04.2025 17:39 β€” πŸ‘ 15    πŸ” 2    πŸ’¬ 1    πŸ“Œ 1
Bayesian Deep Learning is Needed in the Age of Large-Scale AI [Paper Reflection] In particular, we point out that there are many other metrics beyond accuracy, such as uncertainty calibration, which we have to take into account to ensure that better models also translate to better...

I wrote a little blog post for the Neptune.ai blog about our ICML position paper on Bayesian Deep Learning: neptune.ai/blog/bayesia...

13.03.2025 17:53 β€” πŸ‘ 36    πŸ” 10    πŸ’¬ 1    πŸ“Œ 2

@anish144 is following 19 prominent accounts