Michael Hall's Avatar

Michael Hall

@mbhall88.bsky.social

Bioinformatics geek ๐Ÿค“ crafting Rust-y tools ๐Ÿฆ€ for microbial genomes ๐Ÿฆ  ๐Ÿงฌ. Trying to master Dad mode ๐Ÿ‘จโ€๐Ÿผ See what I'm up to here: https://github.com/mbhall88

188 Followers  |  285 Following  |  31 Posts  |  Joined: 20.11.2024  |  1.7832

Latest posts by mbhall88.bsky.social on Bluesky

Does anyone else think they are seeing post-acceptance editorial changes at proof stage which are error-prone and probably due to adoption of AI?

02.12.2025 10:29 โ€” ๐Ÿ‘ 4    ๐Ÿ” 2    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 1
Preview
GitHub - mbhall88/nohuman: Remove human reads from a sequencing run Remove human reads from a sequencing run. Contribute to mbhall88/nohuman development by creating an account on GitHub.

So nohuman now ships an unmasked HPRC.r2 DB by default, with optional dataset selection.

If youโ€™ve used nohuman before, I highly recommend updating to v0.5.0 and re-downloading the new DB.

Repo: github.com/mbhall88/nohuman
Keep your metagenomes clean ๐Ÿงน๐Ÿงฌ

20.11.2025 06:50 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

At the same time, I realised the Human Pnagenome Reference Consortium had made a second release of genomes.
So I rebuilt release 1 without masking, and added a release 2 database with no masking. The improvement in detection accuracy was substantial:

20.11.2025 06:50 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿšจ Update to nohuman ๐Ÿšจ

While testing against the standard Kraken DB, I noticed Kraken was detecting far more human reads than nohuman. I realised Kraken masks low-complexity regions by default during DB construction and that setting had been left on in nohuman, leading to missing human reads.

20.11.2025 06:50 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Stars are level of p value (description is in the figure caption in the paper)

07.11.2025 19:43 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

True.
Thanks for the great questions and discussion

07.11.2025 11:07 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Correct. Yeah I guess mash on a random subset should perform similarly. Havenโ€™t looked at that though.

07.11.2025 11:05 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Itโ€™s a decent sample size at 3000. But I guess more would always be better. I wanted to use refseq genomes which has long read data to be as sure as possible about the true size
There is likely inherent biases though based on error rates in reads for the kmer based methods

07.11.2025 11:04 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

- Overlaps are pairwise alignment with minima2 (FFI)
-Thanks!
- See other thread where I have answered this

07.11.2025 11:01 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

I just used mash v2.3. The supplement has an exploration of the best parameters to use for mash to estimate genome size. Mash was the fastest tool though.

07.11.2025 10:58 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
GitHub - mbhall88/cud: Color Universal Design colourblind-friendly python matplotlib palette Color Universal Design colourblind-friendly python matplotlib palette - mbhall88/cud

Thanks for appreciating the plots. I obsessed a lot over them. I created a repo for the colour palette too if youโ€™re interested in that github.com/mbhall88/cud

07.11.2025 10:56 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

the bars are pair wise statistical comparisons. I only show the significant ones so as not to over clutter the plot

07.11.2025 10:52 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

And lastly, a HUGE thank you to @lachlanjmc.bsky.social for a lot of the methodological heavy lifting when we were coming up with the idea

07.11.2025 03:21 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
GitHub - mbhall88/lrge: Genome size estimation from long read overlaps Genome size estimation from long read overlaps. Contribute to mbhall88/lrge development by creating an account on GitHub.

Try LRGE here: github.com/mbhall88/lrge
(installable from wherever you get your podcasts ๐Ÿ˜‰)

07.11.2025 03:18 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

You might remember the preprint from late last year... Reviews/Publication were delayed while I was on parental leave. We extended validation to include H. sapiens, which lead to smarter handling of contained overlaps in repetitive genomes. Big shout-out to Chenxi Zhou for leading that part

07.11.2025 03:18 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

However, the computational resource usage (runtime/memory) of LRGE was MUCH better than assembling

07.11.2025 03:18 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

We benchmarked >3,000 bacterial genomes and found that LRGE (our method) achieves significantly better accuracy than k-mer-based methods like Mash and GenomeScope and performs on par with full genome assembly (Raven)

07.11.2025 03:18 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Preview
Genome size estimation from long read overlaps AbstractMotivation. Accurate genome size estimation is an important component of genomic analyses such as assembly and coverage calculation, though existin

Our method for genome size estimation from long-read overlaps is now published ๐Ÿฅณ
academic.oup.com/bioinformati...

07.11.2025 03:18 โ€” ๐Ÿ‘ 37    ๐Ÿ” 16    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Preview
AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data AbstractBackground. Most viral genome sequences generated during the latest pandemic have presented new challenges for computational analysis. Analyzing mi

New from @dgpratas.bsky.social et al. for analyzing multiple sequences in multi-FASTA format using alignment-free methodologies. Scalable to millions of sequences for pandemic research and more

AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data doi.org/10.1093/giga...

12.12.2024 10:28 โ€” ๐Ÿ‘ 4    ๐Ÿ” 4    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
How the Web of Science takes a step back <p>The Web of Science, a major commercial indexing service of scientific journals operated by Clarivate, recently decided to remove eLife from its Science Citation Index Expanded (SCIE). eLife will on...

โ€œClarivateโ€™s decision rewards journals for continuing the unhelpful practice of keeping peer review information hidden and unintentionally presenting incomplete and inadequate studies as sound science and punishes those journals that are more transparent.โ€ ๐Ÿ‘๐Ÿ™Œ

www.coalition-s.org/blog/how-the...

03.12.2024 09:49 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

The DOI URL doesn't seem to be working for the preprint currently. You can find it here: www.biorxiv.org/content/10.1...

03.12.2024 04:02 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
GitHub - mbhall88/lrge: Genome size estimation from long read overlaps Genome size estimation from long read overlaps. Contribute to mbhall88/lrge development by creating an account on GitHub.

8/ Try it out!
LRGE is open-source and ready to integrate into your workflows as a Rust library or CLI application. Whether youโ€™re on a high-performance cluster or a basic laptop, LRGE delivers fast and reliable genome size estimates. Get it here: github.com/mbhall88/lrge

03.12.2024 01:38 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

7/ We validated LRGE on 3370 long read bacterial datasets which have associated high-quality RefSeq assemblies ๐Ÿฆ . We also confirmed it generalises to eukaryote organisms ๐Ÿชฐ๐ŸŒฑ๐Ÿž

03.12.2024 01:38 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

6/ And itโ€™s efficient! โšก
LRGE uses significantly less CPU and memory than traditional approaches, making it ideal for both high-performance clusters and resource-limited environments.

03.12.2024 01:38 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

5/ LRGE vs. the competition ๐Ÿ”ฅ
LRGE delivers estimates as reliable as assembly-based methods and better than k-mer-based approaches.
Relative error (y-axis) measures the proportional difference between the estimated and true genome size.

03.12.2024 01:38 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

4/ LRGE also provides a confidence interval for the estimated genome size, offering users an expected range of variation.

03.12.2024 01:38 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

3/ Why choose LRGE?
* Outperforms traditional k-mer-based tools in accuracy and resource usage.
* Comparable in accuracy to quick assembly tools (like Raven) but much faster and with lower memory requirements.
* Built in Rust, with zero external dependencies. ๐Ÿ’ป

03.12.2024 01:38 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

2/ How does it work?
the basic idea is that if we knew the genome size we could calculate the expected number of overlaps between each read and all other reads. We invert this relationship to estimate the genome size based on the observed number of overlaps for each read

03.12.2024 01:38 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

1/ Accurate genome size estimation is crucial for genomics, yet many tools are optimised for short reads, leaving long-read datasets underserved. Enter LRGE: a lightweight, fast, and highly efficient tool specifically designed for long-read sequencing technologies.

03.12.2024 01:38 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐ŸŒŸ Excited to share my latest preprint with @lachlanjmc.bsky.social on @biorxivpreprint.bsky.social: "Genome size estimation from long read overlapsโ€! ๐Ÿš€

Check it out here: doi.org/10.1101/2024...
And find the code here: github.com/mbhall88/lrge

๐Ÿงต๐Ÿ‘‡

03.12.2024 01:38 โ€” ๐Ÿ‘ 29    ๐Ÿ” 14    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1

@mbhall88 is following 20 prominent accounts