Paul Medvedev 's Avatar

Paul Medvedev

@pashadag.bsky.social

Algorithmic Bioinformatics Researcher and Teacher. Posts about research results and educational/mentorship topics (for details, see http://bit.ly/380vX22).

1,812 Followers  |  156 Following  |  64 Posts  |  Joined: 07.09.2023  |  1.7622

Latest posts by pashadag.bsky.social on Bluesky

Preview
Complex genetic variation in nearly complete human genomes - Nature Using sequencing and haplotype-resolved assembly of 65 diverse human genomes, complex regions including the major histocompatibility complex and centromeres are analysed.

Two papers in today's issue of @nature.com ‬: 1) we assemble 65 genomes to near completion, including centromeres and the MHC. tinyurl.com/3huhax6w. 2) we sequence 1,019 genomes from the 1kGP with long reads, revealing SVs down to low allele frequencies tinyurl.com/wbx3we9x.

23.07.2025 15:12 β€” πŸ‘ 54    πŸ” 24    πŸ’¬ 1    πŸ“Œ 2
Preview
FAMSA2 enables accurate multiple sequence alignment at protein-universe scale We introduce FAMSA2, an algorithm that produces high-accuracy multiple protein sequence alignments with unprecedented speed. Across structural, phylogenetic, and functional benchmarks, FAMSA2 matches ...

Interested in a tool that aligns millions of proteins in minutes with quality similar to or better than the state-of-the-art utilities? Please take a look at our FAMSA2 paper: www.biorxiv.org/content/10.1...
and GH repo: github.com/refresh-bio/...

19.07.2025 21:28 β€” πŸ‘ 46    πŸ” 28    πŸ’¬ 3    πŸ“Œ 0

Sassy: Searching Short DNA Strings in the 2020s https://www.biorxiv.org/content/10.1101/2025.07.22.666207v1

26.07.2025 18:46 β€” πŸ‘ 7    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

Congratulations to Rayan Chiki, (Institut Pasteur) head of the β€œSequence Bioinformatics” unit, for securing the ERC Proof of Concept 2025 for his project ENZYMINER! πŸ‘

β€ͺ@rayan.chiki.bsky.social

#Bioinformatics

24.07.2025 15:10 β€” πŸ‘ 55    πŸ” 13    πŸ’¬ 4    πŸ“Œ 2

After Tim Hunt won the Nobel, he said, "We do science because we like discovering things about the world...and then boasting about what we found".

Any one individual can argue about their own motivation, but it would naive to dispute that's an accurate description of many people.

11.07.2025 13:44 β€” πŸ‘ 15    πŸ” 4    πŸ’¬ 2    πŸ“Œ 0
Preview
K2R: Tinted de Bruijn graphs implementation for efficient read extraction from sequencing datasets AbstractSummary. Biological sequence analysis often relies on reference genomes, but producing accurate assemblies remains a challenge. As a result, de nov

Paper Alert!

Our preprint on the K2R index, being able to efficiently associate kmers to the reads containing them is finally out there!

A thread!
academic.oup.com/bioinformati...

05.07.2025 09:30 β€” πŸ‘ 17    πŸ” 9    πŸ’¬ 1    πŸ“Œ 0
Preview
OReO: optimizing read order for practical compression AbstractMotivation. Recent advances in high-throughput and third-generation sequencing technologies have created significant challenges in storing and mana

Paper alert!
We present Oreo a tools that reorder long reads datasets in a way to compress them efficiently with ANY universal compressor like gz, zstd, xz ...
TLDR: You can get state of the art compression WITHOUT a dedicated compressor/decompressor!
academic.oup.com/bioinformati...
A thread!

03.07.2025 10:52 β€” πŸ‘ 23    πŸ” 18    πŸ’¬ 1    πŸ“Œ 1

I worked with Thomas during a three months research visit during his PhD, and it resulted in a paper in NAR. I highly recommend him. doi.org/10.1093/nar/...

02.07.2025 11:48 β€” πŸ‘ 9    πŸ” 8    πŸ’¬ 0    πŸ“Œ 0
Preview
Accelerating k-mer-based sequence filtering The exponential growth of global sequencing data repositories presents both analytical challenges and opportunities. While k - mer-based indexing has improved scalability over traditional alignment fo...

Preprint alert!
We present K2Rmini, an ultra-fast, grep-like tool that extracts sequences of interest from FASTA/FASTQ files based on their k-mer content.
www.biorxiv.org/content/10.1...
A thread

02.07.2025 12:59 β€” πŸ‘ 37    πŸ” 19    πŸ’¬ 1    πŸ“Œ 0

πŸ–₯️🧬 WABI '25 will not only have excellent keynotes, but an exciting program of papers. The titles and abstracts of all accepted WABI '25 papers are now available on the conference website (wabiconf.github.io/2025/talks/). I'm looking forward to seeing these talks!

25.06.2025 18:40 β€” πŸ‘ 9    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0

5/n
The code and paper are available:
πŸ”— paper: tinyurl.com/4svc3xhu
πŸ”— code: github.com/medvedevgrou...

25.06.2025 13:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

4/n

The traditional "repeat-oblivious" estimator can *overestimate mutation rates by an order of magnitude* on repetitive data. In contrast, the new estimator remains accurate across a broad range of rates and repetitive sequences (e.g. RBMY gene, Ξ±-satellite centromeres).

25.06.2025 13:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

3/n

Capturing the full repeat structure in the estimator is pretty hard and possibly not even needed. Instead, we account for the most pertinent part of the repeat structure in the estimator and the rest of the structure is accounted for in the bias formula.

25.06.2025 13:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

2/n

Tools such as Mash estimate the mutation rate via k-mer Jaccard similarity, assuming *non-repetitive* sequences. But in highly repetitive regions (e.g., Ξ±-satellite DNA), these estimates break down. We derive a novel estimator by relaxing the non-repetitive assumption.

25.06.2025 13:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

🧡1/n
Estimating mutation rates using k-mers is fastβ€”but what happens when repeats dominate the genome?

In a new preprint, Haonan Wu, Antonio Blanca, and myself propose a *repeat-aware* estimator that's accurate even in centromeres.

25.06.2025 13:19 β€” πŸ‘ 29    πŸ” 14    πŸ’¬ 1    πŸ“Œ 0
Preview
GitHub - COMBINE-lab/QCatch: Quality Control downstream of alevin-fry / simpleaf Quality Control downstream of alevin-fry / simpleaf - COMBINE-lab/QCatch

πŸš€ We are thrilled to introduce QCatch β€” a fast, command-line QC reporting tool built for alevin-fry & simpleaf single-cell data! Led by @ygao61.bsky.social & in collaboration with Dongze He 🧬πŸ–₯️ . The Preprint πŸ“– is available at bit.ly/4neSznl. Read more below: 1/3

23.06.2025 12:27 β€” πŸ‘ 22    πŸ” 6    πŸ’¬ 1    πŸ“Œ 1

Preprint alert! 🦌
Our new abundance index, REINDEER2, is out!
It's cheap to build and update, offers tunable abundance precision at kmer level, and delivers very high query throughput.

Short thread!

www.biorxiv.org/content/10.1...

github.com/Yohan-Hernan...

19.06.2025 09:12 β€” πŸ‘ 22    πŸ” 13    πŸ’¬ 1    πŸ“Œ 2

Also: what are the bottlenecks in your data processing?
Specifically, I'm looking for reasonably well defined & understood and widely used methods that could use a fresh high-throughput implementation.
Stuff like sketching, maybe assembly, ...

Surely, many pipelines could be sped up 10x ;)

20.06.2025 15:14 β€” πŸ‘ 4    πŸ” 3    πŸ’¬ 2    πŸ“Œ 0

My thought is that you can't drop it because of that UNLESS you truly have no access to a compute node without more memory

16.06.2025 13:33 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Build software better, together GitHub is where people build software. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects.

5/4 This is a draft manuscript and we hope to receive feedback from the community. You can submit a GitHub issue using github.com/medvedevgrou... or email the authors privately

12.06.2025 11:28 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

4/4 We further clarify common misconceptions, e.g. the confusion between uniformity and regularity, the discrepancy between the original SimHash for vectors and the folklore version commonly used for estimating similarities among sets.

12.06.2025 11:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

3/4 We propose a categorization of hashing methods based on their properties, design goals, and application context.

12.06.2025 11:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

2/4 We provide a comprehensive overview of hash functions used in genomics. Hashing is central to many genomic tasks, but we found no good treatment that describes the wide variety of hash functions employed in these applications.

12.06.2025 11:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Dropbox

1/4 Hash functions in genomic sequence analysis (tinyurl.com/4kk9ccmt) : a new survey written together with Ke Chen, Xiang Li, Qian Shi, and Mingfu Shao. Before submitting it, we are posting it online to get feedback from the community.

12.06.2025 11:26 β€” πŸ‘ 23    πŸ” 15    πŸ’¬ 1    πŸ“Œ 0

Hi Rob, thanks for clarifying. If I now understand you correctly, the severity of the criticism in your original post was due more to the lack of source code then the choice of license, right?

08.06.2025 11:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Congratulations!

08.06.2025 10:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I was trying to understand the license but couldn't spot the issue. Could you point to the specific problematic language?

05.06.2025 08:44 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Actually, if you open that "Source code" tarball, it doesn't contain any source code!

04.06.2025 17:39 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Slides from my talk (with @kamilsjaron.bsky.social) on an history of k-mers in bioinformatics: rayan.chikhi.name/pdf/2025-kme...

03.06.2025 09:25 β€” πŸ‘ 42    πŸ” 24    πŸ’¬ 1    πŸ“Œ 2

@pashadag is following 20 prominent accounts