Back online. Not sure if it is a bug in my code or a hiccup at the hosting service.
01.10.2025 14:29 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0@lh3lh3.bsky.social
Associate Professor DFCI & HMS
Back online. Not sure if it is a bug in my code or a hiccup at the hosting service.
01.10.2025 14:29 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0Do you know ~60% of human SVs fall in ~1% of GRCh38? See our new preprint: arxiv.org/abs/2509.23057 and the companion blog post on how we started this project and longdust: lh3.github.io/2025/09/29/o.... Work with Alvin Qin
30.09.2025 02:19 โ ๐ 75 ๐ 27 ๐ฌ 0 ๐ 0arXiv accepted our assembly review two years ago. That was written in MS Word, so PDF-only. Nonetheless, at that time they didn't require TeX source as I remember. Something might have been changed internally.
29.09.2025 01:26 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0And learn what fully AI-generated websites look like. Avoid them, as they are more likely to be scam.
15.09.2025 12:18 โ ๐ 9 ๐ 3 ๐ฌ 1 ๐ 0Heads up: ignore samtools dot org, similarly minimap2 dot com and likely others. It's owned by a known phishing site and while the binaries they offer look valid currently (but note they may be serving us different binaries to others), that could change.
Ie: it's not us (Samtools team)! Be warned
New blog post โ A quick look at Roche's SBX
lh3.github.io/2025/09/11/a...
Now preprinted at arxiv.org/abs/2509.07357
10.09.2025 02:10 โ ๐ 21 ๐ 7 ๐ฌ 0 ๐ 0minimap2.com is potentially a phishing site. Please don't use anything from that website.
github.com/lh3/minimap2...
Preprint out for myloasm, our new nanopore / HiFi metagenome assembler!
Nanopore's getting accurate, but
1. Can this lead to better metagenome assemblies?
2. How, algorithmically, to leverage them?
with co-author Max Marin @mgmarin.bsky.social, supervised by Heng Li @lh3lh3.bsky.social
1 / N
High-resolution metagenome assembly for modern long reads with myloasm https://www.biorxiv.org/content/10.1101/2025.09.05.674543v1
07.09.2025 04:47 โ ๐ 18 ๐ 8 ๐ฌ 0 ๐ 1Of course, also thank Andrea Guarracino and Andrew Carroll for their quick and careful review!
04.09.2025 17:09 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0"Received: July 4, 2025. Revised: August 7, 2025. Accepted: August 15, 2025" and published on September 4. This is a simple and straightforward paper, but the speedy editorial process is still impressive. It could have been even faster if I had responded the initial editorial request more timely.
04.09.2025 16:55 โ ๐ 10 ๐ 1 ๐ฌ 2 ๐ 1Now published in GigaScience with minor improvements: academic.oup.com/gigascience/...
* Download: zenodo.org/records/1490...
* More info: github.com/lh3/panmask
(Harvard STAT115): Introduction to Bioinformatics and Computational Biology by Shirley Liu.
liulab-dfci.github.io/bioinfo-com...
Timing and practical needs are key factors. BCF also has a proper spec and an okay library and it is based on bgzf. The library is more complex to use because VCF is more complex.
07.08.2025 13:04 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0CRAM also ticks some of the points, but the storage cost alone wins users over.
07.08.2025 13:01 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0BAM is more of a literal dump. BCF is not used often because 1) VCF is a small fraction of SAM. Performance is not as critical. 2) Tabix is good enough. 3) Too complex to implement as VCF is not designed with binary in mind. 4) Too late. GATK started to support in 2019. 5) Binary version changes.
07.08.2025 12:49 โ ๐ 0 ๐ 0 ๐ฌ 3 ๐ 0I think genomic file formats should be text-first and designed with binary representations in mind
06.08.2025 22:23 โ ๐ 4 ๐ 0 ๐ฌ 2 ๐ 0In 2020, NCBI considered to remove base quality from free-to-download SRA files. I responded and wrote two blog posts to argue against that. I don't know how much NCBI weighed on everyone's response but they are keeping quality in most SRA files nowadays. grants.nih.gov/grants/guide...
31.07.2025 20:57 โ ๐ 15 ๐ 4 ๐ฌ 0 ๐ 0Longdust, a new tool to identify highly repetitive STRs, VNTRs, satellite DNA and other low-complexity regions (LCRs). Similar to SDUST but for long regions.
github.com/lh3/longdust
Preprint on "Finding easy regions for short-read variant calling from pangenome data": arxiv.org/abs/2507.03718
08.07.2025 02:14 โ ๐ 31 ๐ 13 ๐ฌ 0 ๐ 1Pangolin only supports SNPs and doesn't distinguish donor/acceptor but mutating cTaAt has little effect, too. I don't have alphaGenome numbers. Overall, this comes back to @ewanbirney.bsky.social's question: Is BP essential to splicing? Do these models really see BP?
29.06.2025 15:32 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0chr3:143021319:cTaAt matches the yTnAy BP consensus at the 1-based coordinate. The Broad's SpliceAI server doesn't think chr3:143021320:TaA>CaG would lead to an acceptor loss โ it's non-essential to splicing. SpliceAI thinks donor might be more affected.
29.06.2025 15:32 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0Thanks for the explanation! This is a good example then. It will be interesting to see if pangolin or spliceAI can capture these with ISM. Another question is how often BPs are found across all acceptor sites.
26.06.2025 23:42 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0IMHO, it is important to understand what a model learns; otherwise, the model might just captures nuances irrelevant to biology. 4/
26.06.2025 02:54 โ ๐ 4 ๐ 0 ๐ฌ 1 ๐ 0Generally, my intuition is large DNNs draw power from compositional differences between introns and exons. For example, if we see a transition from 40% GC to 60% GC, a GT in the middle is more likely to be a real donor. I don't know how much signals DNNs learn just from splice motifs. 3/3
26.06.2025 02:51 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0Fig 3b deletes 4bp right after a donor GT on the minus strand. This reduces minisplice log-odd from 9 to -1. It is also a trivial case. I am more interested in cases where a SNP >20-50bp away from splice site may affect splicing, with biological evidence. 2/3
26.06.2025 02:51 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0They speculate it is a branch point at one particular acceptor in Fig 3d. I am not sure if that is the real one. Similarly, the exon motif seems a guess as well. Also, Fig 3c changes the G in GT. This of courses kills the splice site. 1/3
26.06.2025 02:51 โ ๐ 0 ๐ 0 ๐ฌ 2 ๐ 0Preprint on "Improving spliced alignment by modeling splice sites with deep learning". It describes minisplice for modeling splice signals. Minimap2 and miniprot now optionally use the predicted scores to improve spliced alignment.
arxiv.org/abs/2506.12986
Neng Huang developed longcallR for joint SNP calling and phasing from long RNA-seq reads, AND for identifying allele-specific splicing/junctions (ASJ). Although ASJs of statistical significance are rare, a large fraction involve unannotated junctions. In Rust!
30.05.2025 14:54 โ ๐ 16 ๐ 7 ๐ฌ 0 ๐ 0