Bede Constantinides's Avatar

Bede Constantinides

@bedec.bsky.social

Interested in infectious disease informatics. Research fellow at the University of Birmingham. Also cycling, photography, active travel. https://bede.im

900 Followers  |  1,385 Following  |  114 Posts  |  Joined: 24.09.2023  |  2.3154

Latest posts by bedec.bsky.social on Bluesky

Preview
Burrows-Wheeler Indexing - YouTube Videos on : (a) the Burrows-Wheeler Transform (BWT), (b) the FM Index, which uses the BWT to construct a full-text index, (c) Wheeler graphs, (d) r-index, an...

I've added 7 videos to my Burrows-Wheeler indexing playlist (www.youtube.com/playlist?lis...), rounding out the r-index series and adding a 5-part series on the move structure. Now 27 videos in that playlist. I aim to add videos on prefix-free parsing, PBWT, Wheeler languages/automata in the future.

07.10.2025 14:17 β€” πŸ‘ 39    πŸ” 12    πŸ’¬ 1    πŸ“Œ 1

Yes I got the upload pings and almost posted the same here. A treasure trove of material.

07.10.2025 18:21 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

No harm in sticking with an old version :-) Results should be identical to 0.10.0 except for seqs containing IUPAC ambiguous bases, changing the classification of exactly one read in my IUPAC benchmark dataset.

07.10.2025 17:13 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Release 0.11.0 Β· bede/deacon Major release incorporating new features, fixes and peformance optimisations. Includes many PRs from @RagnarGrootKoerkamp, taking advantage of new features in simd-minimizers, packed-seq and parase...

Deacon 0.11.0:
- Local server mode
- Ultra-careful handling of non-ACGT
- Faster indexing & index loading
- Denser index now stores k-mers not hashes
- xxHash & FxHash replaced with rapidhash::fast
- Bug fixes

Thanks @curiouscoding.nl (and others!) for contributions
github.com/bede/deacon/...

07.10.2025 17:00 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Do you happen to have a pointer to a good open source dataset to look at? Naivel... | Hacker News

I had a quick look but was intimidated by the docs honestly. Nick from the Zstd team mentioned BAMs wrt a forthcoming post on use cases, so hopefully we'll soon have some diverse profiles to learn from. news.ycombinator.com/item?id=4549...

07.10.2025 15:13 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Nick Terrell (Meta): "Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting."

07.10.2025 07:50 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Quick Start - OpenZL

It doesn't compress e.g. FASTA out of the box currently. The bundled profiles are geared towards sensible fixed-length formats like Parquet. I imagine we'll see new profiles emerge quickly openzl.org/getting-star...

06.10.2025 22:01 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
It was really hard to resist spilling the beans about OpenZL on this recent HN p... | Hacker News

"It was really hard to resist spilling the beans about OpenZL on this recent HN post about compressing genomic sequence data"
news.ycombinator.com/item?id=4549...

06.10.2025 21:12 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

"OpenZL is our answer to the tension between the performance of format-specific compressors and the maintenance simplicity of a single executable binary."
engineering.fb.com/2025/10/06/d...

06.10.2025 20:58 β€” πŸ‘ 13    πŸ” 5    πŸ’¬ 2    πŸ“Œ 0

Oh it's a pity if splitting into crates is punished with much worse compile times. Is this the case?

06.10.2025 10:50 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Fun question! Doubt it would noticeably change the plot though. Gzip's 32KB window is a tiny fraction of genome length for anything larger than a virus. Would compress low complexity regions well but miss e.g. gene duplications.

06.10.2025 10:46 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I can't bring myself to bet against HashSet given recent experience

02.10.2025 12:51 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
GitHub - RagnarGrootKoerkamp/simd-sketch: Compute bottom-s sketches and s-buckets sketches, using simd-minimizers crate. Compute bottom-s sketches and s-buckets sketches, using simd-minimizers crate. - RagnarGrootKoerkamp/simd-sketch

Looking for people to test the latest version of simd-sketch.

It's now 2x as fast at sketching, and supports skipping over kmers containing N and other ambiguous bases (which is only ~35% slower).

'cargo install simd-sketch' is right there under your fingertips ;)

github.com/RagnarGrootK...

01.10.2025 14:38 β€” πŸ‘ 12    πŸ” 4    πŸ’¬ 2    πŸ“Œ 0
Plot of (inverse) throughput of querying an FxHashSet<u32> of increasing size. 3 lines show the throughput when 1%, 50%, or 99% of queries is present in the set. The 1% and 50% lines show big spikes just before every power of 2, where they are up to 3x slower than in the best case.

Plot of (inverse) throughput of querying an FxHashSet<u32> of increasing size. 3 lines show the throughput when 1%, 50%, or 99% of queries is present in the set. The 1% and 50% lines show big spikes just before every power of 2, where they are up to 3x slower than in the best case.

FxHashSet::<u32>::contains throughput is wild!

- Up to 4x slowdown for negative queries due to probing.
- Positive queries are fast for small tables, but slow in RAM because they need 2 cache misses.

Lots of variance depending on the load factor, ie whether n is close to 87.5% of a power of 2.

28.09.2025 23:19 β€” πŸ‘ 2    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

Pleased to see this pre-printed, highlighting the completeness/accuracy of @nanoporetech.com long-read genome assembly for clinical Enterobacterales: www.biorxiv.org/content/10.1...

Thanks to colleagues @modmedmicro.bsky.social, @ukhsa.bsky.social, @genewiz.bsky.social and @oxfordbrc.bsky.social!

25.09.2025 08:48 β€” πŸ‘ 10    πŸ” 12    πŸ’¬ 1    πŸ“Œ 1
Post image

Terrific new feature presented by @theo.io on @pathoplexus.org called SeqSets for generating DOIs for sequence subsets used in publications, that can then be tracked for impact via CrossRef that will allow data generators to track impact! #IMMEMXiV

19.09.2025 11:10 β€” πŸ‘ 45    πŸ” 18    πŸ’¬ 2    πŸ“Œ 0

Thanks! Had assumed there'd be a good reason…

15.09.2025 17:18 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Oh I didn't know 2bit supported Ns through extra data blocks. I wonder how easily it can be parsed in parallel compared to Binseq – perhaps @noamteyssier.bsky.social can comment?

15.09.2025 17:10 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Some applications do need that 5th symbol to represent ambiguity / N, which I assume is why BAM uses 4bit encoding (IIRC). 2bit + bit mask(s) is another way to do it.

15.09.2025 16:37 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

Predictably the comments section is mostly horror at the state of FASTA format

15.09.2025 15:00 β€” πŸ‘ 11    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

Front page of Hacker News 🫨
bsky.app/profile/bede...

15.09.2025 13:11 β€” πŸ‘ 26    πŸ” 3    πŸ’¬ 4    πŸ“Œ 0

Thanks for sharing – another related tip is to increase the window size for even higher CR bsky.app/profile/bede...

14.09.2025 12:57 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Blogged about how zstd --long fills the gap between fast and slow-but-high-ratio genome compression methods log.bede.im/2025/09/12/z...

12.09.2025 15:07 β€” πŸ‘ 17    πŸ” 9    πŸ’¬ 0    πŸ“Œ 3

Congratulations both!

11.09.2025 10:11 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Anti-Islamic US biker gang members run security at deadly Gaza aid sites BBC identifies members of Infidels MC gang hired as armed security at US and Israel-backed aid sites.

First rate BBC journalism that somehow didn't make the cut for #r4today www.bbc.co.uk/news/article...
@andyverity.bsky.social

10.09.2025 14:20 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Efficient sequence alignment against millions of prokaryotic genomes with LexicMap - Nature Biotechnology LexicMap uses a fixed set of probes to efficiently query gene sequences for fast and low-memory alignment.

Sometimes you meet absolutely incredible bioinfo-magicians.
It was a huge privilege when @shenwei356.bsky.social
joined our group for a year on an @embl.org sabbatical.
While here, he developed a new way of aligning to
millions of bacteria, called LexicMap 1/n
www.nature.com/articles/s41...

10.09.2025 09:12 β€” πŸ‘ 189    πŸ” 98    πŸ’¬ 5    πŸ“Œ 4

Thanks, I particularly like your reordering-based methods requiring no specialist tools for decompression, which presumably achieve much higher CR. But yes, for a built-in option with small perf overhead, zstd --long seems v effective!

09.09.2025 12:27 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

In this case compressing with --long was only ~20% slower than default Zstandard while tripling the compression ratio.

09.09.2025 11:19 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Zstandard's --long range mode works wonders for assemblies, but needs uninterrupted single line sequences.

*AllTheBacteria 661k, multiline fasta*
gzip (pigz): 751GB
zstandard --long: 641GB (30% original size)

*Single line fasta*
gzip (pigz): 700GB
zstandard --long: 232GB (10% original size)

09.09.2025 10:27 β€” πŸ‘ 36    πŸ” 12    πŸ’¬ 2    πŸ“Œ 3
Preview
USDA withdraws a plan to limit salmonella levels in raw poultry The Agriculture Department says it is withdrawing a plan to limit salmonella bacteria in poultry products.

"The Agriculture Department will not require poultry companies to limit salmonella bacteria in their products, halting a Biden Administration effort to prevent food poisoning from contaminated meat."

www.seattletimes.com/business/usd...

25.04.2025 01:32 β€” πŸ‘ 3835    πŸ” 2288    πŸ’¬ 487    πŸ“Œ 1555

@bedec is following 20 prominent accounts