#TalentCNRS ๐ฅ| Flora Jay, entre gรฉnomes synthรฉtiques et rรฉcits รฉvolutifs, reรงoit la mรฉdaille de bronze du CNRS.
โก๏ธ www.ins2i.cnrs.fr/fr/cnrsinfo/...
๐ค @lisnlab.bsky.social @cnrs-paris-saclay.bsky.social
@laurentjacob.bsky.social
Researcher in statistics and machine learning for genomics https://laurent-jacob.github.io/
#TalentCNRS ๐ฅ| Flora Jay, entre gรฉnomes synthรฉtiques et rรฉcits รฉvolutifs, reรงoit la mรฉdaille de bronze du CNRS.
โก๏ธ www.ins2i.cnrs.fr/fr/cnrsinfo/...
๐ค @lisnlab.bsky.social @cnrs-paris-saclay.bsky.social
Preprint alert! ๐ฆ
Our new abundance index, REINDEER2, is out!
It's cheap to build and update, offers tunable abundance precision at kmer level, and delivers very high query throughput.
Short thread!
www.biorxiv.org/content/10.1...
github.com/Yohan-Hernan...
Registration is now open!
The 580โฌ include housing and all meals.
We will close on October 17th or when reaching 80 participants.
The 2026 Probabilistic Modeling in Genomics (ProbGen) meeting will be held at UC Berkeley, March 25-28, 2026. We have an amazing list of keynote speakers and session chairs:
probgen2026.github.io
Please help spread the news.
Merci ร @cnrs-rhoneauvergne.bsky.social et @astropierre.com pour cette interview sur mes travaux en IA pour la gรฉnomique รฉvolutive!
02.06.2025 10:53 โ ๐ 15 ๐ 3 ๐ฌ 0 ๐ 1There is a nice example in @stephaneguindon.bsky.social's Ph.D thesis p.55
theses.hal.science/tel-00843343...
The design matrix of the regression should be nPairs x nBranches, and have a 1 at coordinates (i,j) such that branch j belongs to the path defined by pair i in the tree, 0 otherwise.
03.04.2025 12:26 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0I think one way to do this is the least squares method, which gives you the set of branch lengths on your given topology such that the sum of squared differences between your given distances and the distances on the tree are minimal.
03.04.2025 12:23 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0Phyloformer is finally published in MBE! ๐
academic.oup.com/mbe/advance-...
The thread below provides a summary of our neural network for likelihood-free phylogenetic reconstruction.
People having breakfast in front of the Alps in the Centre Paul Langevin.
Come hear about the latest advances in the field and discuss your own work at Centre Paul Langevin in beautiful Aussois.
24.02.2025 08:58 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0A headshot of Dr Burak Yelmen.
Burak Yelmen from the University of Tartu will give a keynote presentation on "A perspective on generative neural networks in genomics with applications in synthetic data generation".
24.02.2025 08:58 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0A headshot of Dr Claudia Solis-Lemus.
Claudia Solรญs-Lemus from the University of Wisconsin-Madison will give a keynote presentation on "The good, the bad and the ugly of deep learning in phylogenetic inference".
24.02.2025 08:58 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0A headshot of Dr Anne-Florence Bitbol.
Anne-Florence Bitbol from EPFL will give a keynote presentation on "Coevolution-aware language models".
24.02.2025 08:58 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0A legendary being holds a phylogenetic tree in the palm of their hand, with snowy mountains in the background.
The next LEGEND conference on machine learning for evolutionary genomics will be in Aussois (French Alps) between December 8th and 12th.
Mark your calendars and make sure your best work is ready next September when the call for abstracts opens ๐
legend2025.sciencesconf.org
๐งฌ Excited to share our latest work, MUSET ๐ญ, a new tool for creating abundance unitig matrices from sequencing data. It was published yesterday in Oxford Bioinformatics if you want to have a look๐ :
academic.oup.com/bioinformati...
Let's break it down:
My book is (at last) out, just in time for Christmas!
A blog post to celebrate and present it: francisbach.com/my-book-is-o...
Ok, I tried to create my own list of people working on developing statistical or machine learning models applied to omics data. I am sure I missed a lot of cool people. If you'd like to be added, let me know. #Stats #ML #Omics
go.bsky.app/73rcuJn
Hi Raphael, thanks for putting this together, I'll be happy to be in the list if you think it makes sense :)
26.11.2024 08:05 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0A sketch summarizing the entire Phyloformer process.
All this work was done by Luca Nesterenko and
@lblassel.bsky.social , assisted by P. Veber, Bastien Boussau
and myself.
The code and data are available at github.com/lucanest/Phy...
Please share if you find this interesting, and we welcome your feedback :)
A plot comparing the speed of all methods.
In all these experiments, and regardless of model complexity, Phyloformer run on a GPU was the fastest method.
About two orders of magnitude faster than IQtree, and even twice faster than FastME.
A plot comparing the error of different methods under a more complex probabilistic model of sequence evolution. Phyloformer outperforms all other methods under all metrics.
We then trained Phyloformer under a more realistic model, accounting for co-evolution.
It outperformed all other methods, including IQTree/FastTree, on all metrics.
A plot stratifying the error into two terms: we perform very well for estimating evolutionary distances, less well for the topology.
More precisely, Phyloformer was very good at predicting distances, and on the Kuhner-Felsenstein metric accounting for both topology and branch lengths.
Looking at the topology only (Robinson-Foulds metric), it performed less well than IQTree/FastTree, but better than FastME.
A plot comparing the error made by different methods of phylogenetic inference. The distance method FastME underperforms, our method is on par with likelihood methods.
We first trained Phyloformer to perform inference under LG, a common model under which likelihood computation is possible.
It performed much better than FastME (distance method), on par with maximum likelihood approaches (IQTree, FastTree).
A visual justification of permutation invariance: two sequence alignments that are identical up to a permutation must lead to the same phylogeny.
Phyloformer uses self-attention to progressively share information among and between sequences.
This choice makes our function invariant to the order of the input sequences (any order yields the same output phylogeny).
The inference process of Phyloformer. We use the train network to estimate evolutionary distances from related sequences, and pass these estimates to FastME to build a phylogeny.
Once trained, Phyloformer provides estimates of all evolutionary distances given the sequences.
But each of these distance estimates is informed by the entire set of sequence, not just the corresponding pair!
We then pass them to FastME, a distance method, to obtain a tree.
The process for training Phyloformer. We sample trees and sequences evolved along these trees from the model under which we want to do inference. We use these examples to train a network that predicts the parameters (evolutionary distances, equivalent to the tree) from an observation (aligned related sequences).
Phyloformer is a learnable function. Its input is a set of sequences, its output is their phylogeny, represented by evolutionary distances between all pairs of sequences.
We optimize this function on a large number of (phylogeny, sequences) sampled from the probabilistic model.
A visual justification of simulation-based and likelihood-free inference: under some probabilistic models, computing likelihoods is hard but sampling data is easy.
This is where likelihood-free/simulation-based inference comes into play.
Sampling trees and sequences under a probabilistic model is possible under much more complex models, for which likelihood computations would be prohibitive.
It's an alternative way to access the model.
A visual summary of maximum likelihood approaches for phylogenetic inference. We explore the space of phylogenetic trees to find the one making a given set of related sequences as likely as possible under a chosen probabilistic model of sequence evolution.
Maximum likelihood approaches on the other hand search for the most likely tree jointly over all sequences.
This makes them accurate but slow. It also restricts these approaches to simplistic models under which likelihood computations are fast enough.
A visual summary of so-called distance methods for phylogenetic inference. We start from an estimate of evolutionary distances between all pairs of sequences (sums of branch lengths between leaves in the true tree) and build a tree by hierarchical clustering.
Knowing the evolutionary distances (sum of branch lengths) between all pairs of sequences is enough to recover the tree, by hierarchical clustering.
Distance methods rely on this idea, with estimates from pairs of sequences taken separately. This makes them fast but inaccurate.
A diagram of phylogenetic inference: we build a tree summarizing how a given set of related sequences evolved from a common ancestor.
Phylogenetic trees describe how related sequences (at the leaves) evolved from a common ancestor. Internal nodes are successive ancestral sequences.
In probabilistic models, branch lengths represent an expected number of substitutions between the sequences at the two ends.