Does anyone else think they are seeing post-acceptance editorial changes at proof stage which are error-prone and probably due to adoption of AI?
02.12.2025 10:29 โ ๐ 4 ๐ 2 ๐ฌ 4 ๐ 1@mbhall88.bsky.social
Bioinformatics geek ๐ค crafting Rust-y tools ๐ฆ for microbial genomes ๐ฆ ๐งฌ. Trying to master Dad mode ๐จโ๐ผ See what I'm up to here: https://github.com/mbhall88
Does anyone else think they are seeing post-acceptance editorial changes at proof stage which are error-prone and probably due to adoption of AI?
02.12.2025 10:29 โ ๐ 4 ๐ 2 ๐ฌ 4 ๐ 1So nohuman now ships an unmasked HPRC.r2 DB by default, with optional dataset selection.
If youโve used nohuman before, I highly recommend updating to v0.5.0 and re-downloading the new DB.
Repo: github.com/mbhall88/nohuman
Keep your metagenomes clean ๐งน๐งฌ
At the same time, I realised the Human Pnagenome Reference Consortium had made a second release of genomes.
So I rebuilt release 1 without masking, and added a release 2 database with no masking. The improvement in detection accuracy was substantial:
๐จ Update to nohuman ๐จ
While testing against the standard Kraken DB, I noticed Kraken was detecting far more human reads than nohuman. I realised Kraken masks low-complexity regions by default during DB construction and that setting had been left on in nohuman, leading to missing human reads.
Stars are level of p value (description is in the figure caption in the paper)
07.11.2025 19:43 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0True.
Thanks for the great questions and discussion
Correct. Yeah I guess mash on a random subset should perform similarly. Havenโt looked at that though.
07.11.2025 11:05 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0Itโs a decent sample size at 3000. But I guess more would always be better. I wanted to use refseq genomes which has long read data to be as sure as possible about the true size
There is likely inherent biases though based on error rates in reads for the kmer based methods
- Overlaps are pairwise alignment with minima2 (FFI)
-Thanks!
- See other thread where I have answered this
I just used mash v2.3. The supplement has an exploration of the best parameters to use for mash to estimate genome size. Mash was the fastest tool though.
07.11.2025 10:58 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Thanks for appreciating the plots. I obsessed a lot over them. I created a repo for the colour palette too if youโre interested in that github.com/mbhall88/cud
07.11.2025 10:56 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0the bars are pair wise statistical comparisons. I only show the significant ones so as not to over clutter the plot
07.11.2025 10:52 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0And lastly, a HUGE thank you to @lachlanjmc.bsky.social for a lot of the methodological heavy lifting when we were coming up with the idea
07.11.2025 03:21 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Try LRGE here: github.com/mbhall88/lrge
(installable from wherever you get your podcasts ๐)
You might remember the preprint from late last year... Reviews/Publication were delayed while I was on parental leave. We extended validation to include H. sapiens, which lead to smarter handling of contained overlaps in repetitive genomes. Big shout-out to Chenxi Zhou for leading that part
07.11.2025 03:18 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0However, the computational resource usage (runtime/memory) of LRGE was MUCH better than assembling
07.11.2025 03:18 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0We benchmarked >3,000 bacterial genomes and found that LRGE (our method) achieves significantly better accuracy than k-mer-based methods like Mash and GenomeScope and performs on par with full genome assembly (Raven)
07.11.2025 03:18 โ ๐ 0 ๐ 0 ๐ฌ 2 ๐ 0Our method for genome size estimation from long-read overlaps is now published ๐ฅณ
academic.oup.com/bioinformati...
New from @dgpratas.bsky.social et al. for analyzing multiple sequences in multi-FASTA format using alignment-free methodologies. Scalable to millions of sequences for pandemic research and more
AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data doi.org/10.1093/giga...
โClarivateโs decision rewards journals for continuing the unhelpful practice of keeping peer review information hidden and unintentionally presenting incomplete and inadequate studies as sound science and punishes those journals that are more transparent.โ ๐๐
www.coalition-s.org/blog/how-the...
The DOI URL doesn't seem to be working for the preprint currently. You can find it here: www.biorxiv.org/content/10.1...
03.12.2024 04:02 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 08/ Try it out!
LRGE is open-source and ready to integrate into your workflows as a Rust library or CLI application. Whether youโre on a high-performance cluster or a basic laptop, LRGE delivers fast and reliable genome size estimates. Get it here: github.com/mbhall88/lrge
7/ We validated LRGE on 3370 long read bacterial datasets which have associated high-quality RefSeq assemblies ๐ฆ . We also confirmed it generalises to eukaryote organisms ๐ชฐ๐ฑ๐
03.12.2024 01:38 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 06/ And itโs efficient! โก
LRGE uses significantly less CPU and memory than traditional approaches, making it ideal for both high-performance clusters and resource-limited environments.
5/ LRGE vs. the competition ๐ฅ
LRGE delivers estimates as reliable as assembly-based methods and better than k-mer-based approaches.
Relative error (y-axis) measures the proportional difference between the estimated and true genome size.
4/ LRGE also provides a confidence interval for the estimated genome size, offering users an expected range of variation.
03.12.2024 01:38 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 03/ Why choose LRGE?
* Outperforms traditional k-mer-based tools in accuracy and resource usage.
* Comparable in accuracy to quick assembly tools (like Raven) but much faster and with lower memory requirements.
* Built in Rust, with zero external dependencies. ๐ป
2/ How does it work?
the basic idea is that if we knew the genome size we could calculate the expected number of overlaps between each read and all other reads. We invert this relationship to estimate the genome size based on the observed number of overlaps for each read
1/ Accurate genome size estimation is crucial for genomics, yet many tools are optimised for short reads, leaving long-read datasets underserved. Enter LRGE: a lightweight, fast, and highly efficient tool specifically designed for long-read sequencing technologies.
03.12.2024 01:38 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0๐ Excited to share my latest preprint with @lachlanjmc.bsky.social on @biorxivpreprint.bsky.social: "Genome size estimation from long read overlapsโ! ๐
Check it out here: doi.org/10.1101/2024...
And find the code here: github.com/mbhall88/lrge
๐งต๐