Jouni Sirén jltsiren - Bluesky Statics

Claude Code, standard seat in a team plan, and the default model (probably Sonnet 4.6).

19.02.2026 22:14 — 👍 0 🔁 0 💬 1 📌 0

I could have probably made it work by breaking the task into smaller subtasks. But at that point, it was faster and easier to implement it myself. 7/7

19.02.2026 21:38 — 👍 0 🔁 0 💬 0 📌 0

I tried to make Claude implement multithreading, but it failed repeatedly due to exceeding token limits in the response. Every time Claude thought it was ready to start writing code, it discovered yet another issue it had to deal with. 6/

19.02.2026 21:38 — 👍 0 🔁 0 💬 2 📌 0

With some additional prompting, Claude fixed the bug and created comprehensive tests. Then I had a Rust tool for sorting GAF files, except that Claude had not implemented multithreaded sorting. 5/

19.02.2026 21:38 — 👍 0 🔁 0 💬 1 📌 0

It was probably due to the differences between C++ and Rust move semantics. In C++, you can move temporary file objects from a vector. Claude tried to do the same in Rust with drain(), without updating the indexes into the vector. 4/

19.02.2026 21:38 — 👍 0 🔁 0 💬 1 📌 0

Claude did the initial conversion quickly. There was an amusing bug: every round of merges would lose ~50% of the records. 3/

19.02.2026 21:38 — 👍 0 🔁 0 💬 1 📌 0

GitHub - jltsiren/gbz-base: Prototype for an immutable pangenome graph in SQLite Prototype for an immutable pangenome graph in SQLite - jltsiren/gbz-base

It's something I need to do eventually anyway, as I want to make GAF-base construction independent of vg. 2/

19.02.2026 21:38 — 👍 0 🔁 0 💬 1 📌 0

Inspired by this, I decided to try if Claude could convert my GAF sorting code from C++ to Rust. It's basically external memory multi-way mergesort with integer keys, opaque values, special handling for header lines, and compressed temporary files. 1/

19.02.2026 21:38 — 👍 4 🔁 0 💬 1 📌 0

This will explode to >100 GiB in memory, as Giraffe needs direct access to uncompressed node sequences and their reverse complements.

10.02.2026 22:50 — 👍 0 🔁 0 💬 0 📌 0

By replacing bit packing with zstd for the sequences, GBZ v2 reduces the size of the full graphs significantly (e.g. from 27 GiB to 8 GiB for the CHM13-based release 2.1 graph).

10.02.2026 22:50 — 👍 1 🔁 0 💬 1 📌 0

Release vg 1.72.0 - Littlefoot · vgteam/vg Don't forget to mark the static binary executable: chmod +x vg Docker Image: quay.io/vgteam/vg:v1.72.0 Buildable Source Tarball: vg-v1.72.0.tar.gz Includes source for vg and all submodules. Use th...

Our latest vg release introduces GBZ v2 with better compression for sequences. I originally assumed that the total sequence length in a pangenome graph would be similar to the size of the genome. This does not hold in the full HPRC graphs due to unaligned centromeres.

10.02.2026 22:50 — 👍 7 🔁 3 💬 1 📌 0

Guarracino Lab | Pangenome Research We develop methods to build and analyze pangenomes, with applications in cancer and complex disease. Translational Genomics Research Institute, Phoenix, AZ.

Looking for a postdoc to build my new lab at TGen (Phoenix, AZ) focused on pangenome methods for cancer and complex disease. Full stack — from pangenome assembly and compression to association studies and somatic variant discovery. Reach out if interested! guarracinolab.github.io#join

06.02.2026 16:02 — 👍 11 🔁 9 💬 0 📌 1

HLi Lab - Vacancies Openings

I am looking for a postdoc to develop high-performance algorithms in computational genomics. Email or DM me if interested. For more information, see hlilab.github.io/vacancies. RTs appreciated!

14.01.2026 15:44 — 👍 43 🔁 64 💬 1 📌 0

Compact PAT trees

This should be the original: uwspace.uwaterloo.ca/items/ae3b39...

There is also a paper (invited talk?) called "Tables" by Munro that cites a manuscript in preparation by Clark and Munro, but I've not been able to determine if the manuscript was ever published.

06.01.2026 15:44 — 👍 2 🔁 0 💬 1 📌 0

Sampled Movi 2 is quite similar, but with a flat array.

21.12.2025 07:55 — 👍 1 🔁 0 💬 1 📌 0

Ropebwt3 can afford to store the ranks up to the start of a block more directly, because it assumes a small alphabet. But there is still some nontrivial overhead in decompressing the run-length encoded block up to the query position.

21.12.2025 07:55 — 👍 1 🔁 0 💬 1 📌 0

Ropebwt3 is faster than the usual r-index implementations. They assume a byte alphabet and do rank+select in a sparse bitvector, rank in a wavelet tree, and select in another sparse bitvector for each rank operation in the BWT.

21.12.2025 07:55 — 👍 1 🔁 0 💬 1 📌 0

abseil / Performance Hints An open-source collection of core C++ library code

First time seeing this and it is really great! abseil.io/fast/hints.h...

20.12.2025 03:04 — 👍 27 🔁 5 💬 0 📌 0

Exact matches are the easy cases. Either you have a single hit with a high mapping quality, or you take a random hit with a low mapping quality. The hard case is when you get multiple overlapping seeds for a read, with many hits each, and you need to choose the hits you try to align.

11.11.2025 23:58 — 👍 1 🔁 0 💬 0 📌 0

My impression is that the throughput of a fast read aligner is usually ~1 Mbp / CPU-second. Most of the time is spent with reads with many potential mappings, as the aligner needs to explore them until it's confident it has found the best one and can estimate the mapping quality.

11.11.2025 23:11 — 👍 1 🔁 0 💬 1 📌 0

GitHub - jltsiren/pggname: Pangenome graph naming based on hashing in a canonical order Pangenome graph naming based on hashing in a canonical order - jltsiren/pggname

One intended use for the header lines is specifying which graphs can be used as a reference for the alignments. This will use stable graph names based on hashing a canonical GFA representation of the graph. The idea is similar to refget, but for graphs instead of sequences.

31.10.2025 03:27 — 👍 0 🔁 0 💬 0 📌 0

VG will soon start adding headers to the GAF files it generates. The specifics are still uncertain, but if you maintain a GAF parser, it may be a good idea to skip lines starting with "@". Here is a draft specification for the vg flavor of GAF.

31.10.2025 03:27 — 👍 1 🔁 1 💬 1 📌 0

GitHub - mohsenzakeri/Movi: Fast, Cache-Efficient, and Scalable Queries on Pangenomes Fast, Cache-Efficient, and Scalable Queries on Pangenomes - mohsenzakeri/Movi

1/6 Movi 2 is here: faster and more space-efficient for pangenome queries. Its fastest mode uses half the memory of Movi 1 while running ~30% faster. github.com/mohsenzakeri...

21.10.2025 20:00 — 👍 44 🔁 24 💬 1 📌 2

For the weekend crowd. I'm hiring a postdoc! If you're interested in algorithms, data structures and high-dimensional inference, and if you want to invent new methods for genomics and implement them in high-performance, robust and easy-to-use software, do I have a lab for you; ours!

11.10.2025 13:09 — 👍 15 🔁 5 💬 0 📌 0

🦒Long read giraffe is out!🦒
Mapping long reads to pangenome graphs is ~10x faster than with GraphAligner, with veeery slightly better mapping accuracy, short variant calling, and SV genotyping than GraphAligner or Minimap2

02.10.2025 06:28 — 👍 43 🔁 22 💬 1 📌 0

DSB 2026 Venice - February 18-19 Workshop Data Structures in Bioinformatics

We are glad to announce that the next workshop “Data Structures in Bioinformatics” (DSB 2026) will take place in Venice, Italy, on *February 18-19*, 2026. dsb-meeting.github.io/DSB2026/ Book the dates! #DSB26

01.09.2025 18:10 — 👍 14 🔁 8 💬 1 📌 0

So maybe we need some kind of stable identifiers (hashes?) for pangenome graphs. And then we need a way of storing graph / parent identifiers in GFA and alignment files. 7/7

28.08.2025 00:49 — 👍 0 🔁 0 💬 0 📌 0

We also need a way of specifying the correct reference for reconstructing the reads. That's not as easy with graphs as with linear sequences. For example, if you have aligned the reads to a subgraph (e.g. personalized graph), the supergraph (e.g. clipped graph) is also a valid reference. 6/n

28.08.2025 00:49 — 👍 0 🔁 0 💬 1 📌 0

While working on GAF-base, I realized that GAF is not the file format I want to use. GAF prioritizes numerical statistics, while the information needed for reconstructing the read and the alignment is optional. In archival and variant calling, it should be the opposite. 5/n

28.08.2025 00:49 — 👍 0 🔁 0 💬 1 📌 0

When used with GBZ-base, GAF-base allows extracting all reads overlapping with / contained in the subgraph. Queries with 10 kbp subgraphs are effectively instantaneous with short reads, while taking a second or two with long reads. 4/n

28.08.2025 00:49 — 👍 0 🔁 0 💬 1 📌 0

Posts by Jouni Sirén (@jltsiren.bsky.social)