(@dgautheret) — Bluesky Profile

5 days ago

Excited to share this preprint that describes my latest work on using GPUs to accelerate processing of RNA-seq data.

The title says it all: "RNA-seq analysis in seconds using GPUs" now on biorxiv www.biorxiv.org/content/10.6... and github github.com/pachterlab/k...

Figure 1 shows they key result

182 86 6 8

2 weeks ago

Presentation of scientific work on De Bruijn Graphs applied to the processing of sequencing data in the context of biology. The picture was taken in the conference room of the University of Venice, where a screen displays a slide that introduces De Bruijn Graphs, with the speaker standing in front of it. Being the screen is a large renaissance painting that spans from the floor to the roof.

I had the occasion of presenting nice results about the detection of biological events in De Bruijn Graph at #DSB2026, in the context of my PhD work on #Vizitig !

Thanks to the organizers and colleagues for this amazing and super-inspiring event (and @camillemrcht.bsky.social for the picture).

17 7 1 0

3 weeks ago

Beautiful caveat section !

1 0 1 0

1 month ago

Pour l'importance des pesticides dans l'incidence des cancers, voyez plutot ceci. Les expositions professionnelles (amiante, benzene) sont dans la barre bleue à droite, et les pesticides n'apparaissent nulle part faute de données suffisantes.
www.nature.com/articles/s41...

2 0 0 0

1 month ago

👇 😨

0 0 0 0

1 month ago

PREPRINT ALERT

I heard you craving for more combinatorics, here are some more for y'all !

5 4 0 1

1 month ago

Pour l'importance des facteurs de risque de cancer, voyez plutôt ceci. La petite zone bleu clair, ce sont toutes les causes professionnelles: amiante, arsenic, etc. Les pesticides n'apparaissent nulle part faute de données suffisantes.
Source: Fink et al. Nature Medicine, 2026

0 0 0 1

1 month ago

More minimizer papers! 😆

3 2 1 0

4 months ago

Stay tuned: We are now running Metapuccino on SRA’s 1 million human transcriptomes.

2 1 0 0

4 months ago

This ms. covers the full methodology and discusses the limits of NLP and LLMs for NGS metadata completion.

0 0 1 0

4 months ago

Usability was a top priority: Metapuccino runs on regular computers with open-source LLMs, but can also scale up on GPUs for large datasets. All it needs is a list of SRA IDs — no pre-processed tables required.

0 0 1 0

4 months ago

Fiona Hak developed a clever LLM training strategy using the hardest SRA cases — the fine-tuned model is available on Hugging Face.

0 0 1 0

4 months ago

Metapuccino fills and standardizes 19 key SRA metadata fields in human transcriptomics, using rule-based NLP and a large language model (LLM).

0 0 1 0

4 months ago

Even simple tasks, like selecting tumor vs. normal samples for a cancer type, require expert curation across multiple tables, protocols, and abstracts.

0 0 1 0

4 months ago

NCBI’s SRA is a fantastic resource for studying the human transcriptome. But its metadata is messy — over 70% of fields are empty, and information is often inconsistent.

0 0 1 0

4 months ago

Metappuccino: Large Language Model-driven Reconstruction of Sequence Read Archive Metadata for Cancer Research Motivation: High-throughput RNA-sequencing has significantly advanced transcriptomic profiling in oncology. Millions of RNA-seq datasets have accumulated in public databases such as the Sequence Read ...

www.biorxiv.org/cgi/content/...

What’s behind Metapuccino? ☕️, by PhD student Fiona Hak, @camillemrcht.bsky.social and Melina Gallopin. A thread 👇

5 2 2 0

4 months ago

Metappuccino: Large Language Model-driven Reconstruction of Sequence Read Archive Metadata for Cancer Research Motivation: High-throughput RNA-sequencing has significantly advanced transcriptomic profiling in oncology. Millions of RNA-seq datasets have accumulated in public databases such as the Sequence Read ...

My algorithmic friends (@camillemrcht.bsky.social) doing LLM stuff : www.biorxiv.org/content/10.1...! And also, screaming last names in the author list ;P. Given my level of trust in Camille, though, perhaps it's time for me to engage more seriously with these models in research...

5 2 1 0

4 months ago

PostDoc position in bioinformatics and artificial intelligence. PDF available upon request.

Interested in #lncRNA and #ArtificiaIntelligence?
In the frame of our recently founded French-Korean bilateral project DHARP, we are recruiting a post-doc in bioinformatics and artificial intelligence in our team at
@ips2parissaclay.bsky.social
Application limit: 01/12/2025

3 1 0 0

4 months ago

PubMed is running on autopilot during shutdown, but key independent committee has been abolished www.bmj.com/content/391/... 🧪

7 9 0 2

4 months ago

Illustration of Burrows-Wheeler Transform and many auxiliary structures from the input string how$now$brown$cow$#

New tool "bwt-svg" for making illustrations of the BWT and the many auxiliary arrays and other structures related to it. Pyodide-based no-installation-necessary interface here: benlangmead.github.io/bwt-svg/. (H/t to @robert.bio for pointing me to pyodide!) Full repo: github.com/benlangmead/....

40 21 4 1

5 months ago

The MSc. Bioinformatics students of U. Paris-Saclay are organizing the Junior Conference on Computational Biology (JC2B) 2025: AI and predictive models in bioinformatics
November 13, 2025 - I2BC, CNRS, Gif-sur-Yvette, France
Register for free : bioi2.i2bc.paris-saclay.fr/jc2b/#regist...

2 2 0 0

7 months ago

🦠🧍‍♀️From bacterial to human immunity.

We report in @science.org the discovery of a human homolog of SIR2 antiphage proteins that participates in the TLR pathway of animal innate immunity.
Co-led wt @enzopoirier.bsky.social by D. Bonhomme and @hugovaysset.bsky.social

www.science.org/doi/10.1126/...

262 122 9 11

7 months ago

Congratulations to Rayan Chiki, (Institut Pasteur) head of the “Sequence Bioinformatics” unit, for securing the ERC Proof of Concept 2025 for his project ENZYMINER! 👏

‪@rayan.chiki.bsky.social

#Bioinformatics

60 13 4 2

7 months ago

How to speed up peer review: make applicants mark one another ‘Distributed peer review’ of grants makes process more than twice as fast — and includes some cheat-prevention measures.

Ca a l'air bien, non?
www.nature.com/articles/d41...

1 1 0 0

8 months ago

K2R: Tinted de Bruijn graphs implementation for efficient read extraction from sequencing datasets AbstractSummary. Biological sequence analysis often relies on reference genomes, but producing accurate assemblies remains a challenge. As a result, de nov

Paper Alert!

Our preprint on the K2R index, being able to efficiently associate kmers to the reads containing them is finally out there!

A thread!
academic.oup.com/bioinformati...

17 9 1 0

7 months ago

New ENCODE4 long-read RNA-seq transcripts track for hg38 and mm10. Triplets (e.g. [1,1,3]) indicate start site, exon combination, and stop site for each transcript. Enrichment scores show how these change across tissue and cell line samples.

Read more: genome.ucsc.edu/gold...

26 7 0 1

8 months ago

#JOBIM2025 Mathilde Girard ends the session with a simple but effective idea: re oder the reads before using an off the shelf compressor to improve compression gain

6 3 0 0

8 months ago

#JOBIM2025 @bdegardins.bsky.social presents his PhD work on Vizitig, a multi sample graph exploration tool, with a focus on RNA - this afternoon we'll do a demo on pangenomes with the same tool

10 5 1 0

8 months ago

OReO: optimizing read order for practical compression AbstractMotivation. Recent advances in high-throughput and third-generation sequencing technologies have created significant challenges in storing and mana

Paper alert!
We present Oreo a tools that reorder long reads datasets in a way to compress them efficiently with ANY universal compressor like gz, zstd, xz ...
TLDR: You can get state of the art compression WITHOUT a dedicated compressor/decompressor!
academic.oup.com/bioinformati...
A thread!

23 18 1 1

8 months ago

Preprint alert from the group 🚨 super fast grep-like sequence selection

6 5 0 0