Burrows-Wheeler Indexing - YouTube
Videos on : (a) the Burrows-Wheeler Transform (BWT), (b) the FM Index, which uses the BWT to construct a full-text index, (c) Wheeler graphs, (d) r-index, an...
I've added 7 videos to my Burrows-Wheeler indexing playlist (www.youtube.com/playlist?lis...), rounding out the r-index series and adding a 5-part series on the move structure. Now 27 videos in that playlist. I aim to add videos on prefix-free parsing, PBWT, Wheeler languages/automata in the future.
07.10.2025 14:17 β π 39 π 12 π¬ 1 π 1
Yes I got the upload pings and almost posted the same here. A treasure trove of material.
07.10.2025 18:21 β π 2 π 0 π¬ 0 π 0
No harm in sticking with an old version :-) Results should be identical to 0.10.0 except for seqs containing IUPAC ambiguous bases, changing the classification of exactly one read in my IUPAC benchmark dataset.
07.10.2025 17:13 β π 2 π 0 π¬ 1 π 0
Release 0.11.0 Β· bede/deacon
Major release incorporating new features, fixes and peformance optimisations. Includes many PRs from @RagnarGrootKoerkamp, taking advantage of new features in simd-minimizers, packed-seq and parase...
Deacon 0.11.0:
- Local server mode
- Ultra-careful handling of non-ACGT
- Faster indexing & index loading
- Denser index now stores k-mers not hashes
- xxHash & FxHash replaced with rapidhash::fast
- Bug fixes
Thanks @curiouscoding.nl (and others!) for contributions
github.com/bede/deacon/...
07.10.2025 17:00 β π 6 π 1 π¬ 1 π 0
Do you happen to have a pointer to a good open source dataset to look at? Naivel... | Hacker News
I had a quick look but was intimidated by the docs honestly. Nick from the Zstd team mentioned BAMs wrt a forthcoming post on use cases, so hopefully we'll soon have some diverse profiles to learn from. news.ycombinator.com/item?id=4549...
07.10.2025 15:13 β π 2 π 0 π¬ 0 π 0
Nick Terrell (Meta): "Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting."
07.10.2025 07:50 β π 1 π 0 π¬ 1 π 0
Quick Start - OpenZL
It doesn't compress e.g. FASTA out of the box currently. The bundled profiles are geared towards sensible fixed-length formats like Parquet. I imagine we'll see new profiles emerge quickly openzl.org/getting-star...
06.10.2025 22:01 β π 1 π 0 π¬ 0 π 0
It was really hard to resist spilling the beans about OpenZL on this recent HN p... | Hacker News
"It was really hard to resist spilling the beans about OpenZL on this recent HN post about compressing genomic sequence data"
news.ycombinator.com/item?id=4549...
06.10.2025 21:12 β π 3 π 0 π¬ 2 π 0
"OpenZL is our answer to the tension between the performance of format-specific compressors and the maintenance simplicity of a single executable binary."
engineering.fb.com/2025/10/06/d...
06.10.2025 20:58 β π 13 π 5 π¬ 2 π 0
Oh it's a pity if splitting into crates is punished with much worse compile times. Is this the case?
06.10.2025 10:50 β π 1 π 0 π¬ 1 π 0
Fun question! Doubt it would noticeably change the plot though. Gzip's 32KB window is a tiny fraction of genome length for anything larger than a virus. Would compress low complexity regions well but miss e.g. gene duplications.
06.10.2025 10:46 β π 3 π 0 π¬ 0 π 0
I can't bring myself to bet against HashSet given recent experience
02.10.2025 12:51 β π 1 π 0 π¬ 1 π 0
GitHub - RagnarGrootKoerkamp/simd-sketch: Compute bottom-s sketches and s-buckets sketches, using simd-minimizers crate.
Compute bottom-s sketches and s-buckets sketches, using simd-minimizers crate. - RagnarGrootKoerkamp/simd-sketch
Looking for people to test the latest version of simd-sketch.
It's now 2x as fast at sketching, and supports skipping over kmers containing N and other ambiguous bases (which is only ~35% slower).
'cargo install simd-sketch' is right there under your fingertips ;)
github.com/RagnarGrootK...
01.10.2025 14:38 β π 12 π 4 π¬ 2 π 0
Plot of (inverse) throughput of querying an FxHashSet<u32> of increasing size. 3 lines show the throughput when 1%, 50%, or 99% of queries is present in the set. The 1% and 50% lines show big spikes just before every power of 2, where they are up to 3x slower than in the best case.
FxHashSet::<u32>::contains throughput is wild!
- Up to 4x slowdown for negative queries due to probing.
- Positive queries are fast for small tables, but slow in RAM because they need 2 cache misses.
Lots of variance depending on the load factor, ie whether n is close to 87.5% of a power of 2.
28.09.2025 23:19 β π 2 π 2 π¬ 0 π 0
Pleased to see this pre-printed, highlighting the completeness/accuracy of @nanoporetech.com long-read genome assembly for clinical Enterobacterales: www.biorxiv.org/content/10.1...
Thanks to colleagues @modmedmicro.bsky.social, @ukhsa.bsky.social, @genewiz.bsky.social and @oxfordbrc.bsky.social!
25.09.2025 08:48 β π 10 π 12 π¬ 1 π 1
Terrific new feature presented by @theo.io on @pathoplexus.org called SeqSets for generating DOIs for sequence subsets used in publications, that can then be tracked for impact via CrossRef that will allow data generators to track impact! #IMMEMXiV
19.09.2025 11:10 β π 45 π 18 π¬ 2 π 0
Thanks! Had assumed there'd be a good reasonβ¦
15.09.2025 17:18 β π 0 π 0 π¬ 0 π 0
Oh I didn't know 2bit supported Ns through extra data blocks. I wonder how easily it can be parsed in parallel compared to Binseq β perhaps @noamteyssier.bsky.social can comment?
15.09.2025 17:10 β π 0 π 0 π¬ 1 π 0
Some applications do need that 5th symbol to represent ambiguity / N, which I assume is why BAM uses 4bit encoding (IIRC). 2bit + bit mask(s) is another way to do it.
15.09.2025 16:37 β π 1 π 0 π¬ 2 π 0
Predictably the comments section is mostly horror at the state of FASTA format
15.09.2025 15:00 β π 11 π 0 π¬ 2 π 0
Front page of Hacker News π«¨
bsky.app/profile/bede...
15.09.2025 13:11 β π 26 π 3 π¬ 4 π 0
Thanks for sharing βΒ another related tip is to increase the window size for even higher CR bsky.app/profile/bede...
14.09.2025 12:57 β π 2 π 0 π¬ 0 π 0
Blogged about how zstd --long fills the gap between fast and slow-but-high-ratio genome compression methods log.bede.im/2025/09/12/z...
12.09.2025 15:07 β π 17 π 9 π¬ 0 π 3
Congratulations both!
11.09.2025 10:11 β π 2 π 0 π¬ 1 π 0
Efficient sequence alignment against millions of prokaryotic genomes with LexicMap - Nature Biotechnology
LexicMap uses a fixed set of probes to efficiently query gene sequences for fast and low-memory alignment.
Sometimes you meet absolutely incredible bioinfo-magicians.
It was a huge privilege when @shenwei356.bsky.social
joined our group for a year on an @embl.org sabbatical.
While here, he developed a new way of aligning to
millions of bacteria, called LexicMap 1/n
www.nature.com/articles/s41...
10.09.2025 09:12 β π 189 π 98 π¬ 5 π 4
Thanks, I particularly like your reordering-based methods requiring no specialist tools for decompression, which presumably achieve much higher CR. But yes, for a built-in option with small perf overhead, zstd --long seems v effective!
09.09.2025 12:27 β π 0 π 0 π¬ 0 π 0
In this case compressing with --long was only ~20% slower than default Zstandard while tripling the compression ratio.
09.09.2025 11:19 β π 1 π 0 π¬ 1 π 0
Zstandard's --long range mode works wonders for assemblies, but needs uninterrupted single line sequences.
*AllTheBacteria 661k, multiline fasta*
gzip (pigz): 751GB
zstandard --long: 641GB (30% original size)
*Single line fasta*
gzip (pigz): 700GB
zstandard --long: 232GB (10% original size)
09.09.2025 10:27 β π 36 π 12 π¬ 2 π 3
USDA withdraws a plan to limit salmonella levels in raw poultry
The Agriculture Department says it is withdrawing a plan to limit salmonella bacteria in poultry products.
"The Agriculture Department will not require poultry companies to limit salmonella bacteria in their products, halting a Biden Administration effort to prevent food poisoning from contaminated meat."
www.seattletimes.com/business/usd...
25.04.2025 01:32 β π 3835 π 2288 π¬ 487 π 1555
Democratize CUDA optimizations @NVIDIA
CUDA Core Compute Libraries (CCCL)
Former lead of the Sparse Linear Algebra team | Opinions are my own.
scientist at UC Berkeley inventing advanced genomic technologies
lover of molecules, user of computers
https://scholar.google.com/citations?user=63ZRebIAAAAJ&hl=en
PhD student studying protein trafficking in the early secretory pathway. Big fan of LC-MS/MS, nanopore sequencing, structural biology & #nanobodies π
https://fabianackle.ch/
PhD student at the NHGRI genomeinformatics.github.io & JHU Schatzlab
π°οΈ π§¬π¨π¦
A diverse and collaborative community on the cutting edge of computing and technology within hopkinsengineer.bsky.social at the Johns Hopkins University.
cs.jhu.edu β’ Baltimore, MD
Evolutionary & conservation genomics, eDNA, parasites, disease ecology, marine mammal conservation. Seal nerd, co-chair IUCN Pinniped specialist group. University of Leeds, UK. π¦π³π§¬π§ͺπ https://www.goodmanlab.org
Evolutionary genomics of bacteria
A website for exploring the output of compilers. aka godbolt.org
Supports C, C++, Rust, Fortran, COBOL and many many more.
Support us at https://patreon.com/mattgodbolt
Looking for a PhD position!
Former Research Engineer @csh.ac.at
MSc in AI from @jku.at
Into AI, ALife, Biology, AI4Science , and more
T1D
post-doctoral fellow studying sub-viral RNAs
bioinformatics phd candidate @ ucsc
https://github.com/cademirch
SYSTEMS HACKERS SOLVE THE BEAR MENACE
virus evolution / phylogeny / epidemiology
Taiwanese. Previous @duke-nus @UoGlasgow. Current @twCDC.
https://yaotli.github.io
Zebra, Lisp programmer. Haskell and
Math lover. Bioinformatics and Computational Biology π§¬
github.com/carht
Assistant professor, WashU Pathology & Immunology
Author of FGSEA and other packages for omics data analysis
https://scholar.google.com/citations?user=fcH0gPgAAAAJ
https://github.com/assaron