Sassy2: Batch Searching of Short DNA Patterns https://www.biorxiv.org/content/10.64898/2026.03.10.710811v1
After a long review process, I'm excited that our paper is finally in print: www.cell.com/cell/fulltex...
TL;DR: We use CRISPR screens in iPSC-derived neurons to find a new tau E3 ligase and a relationship between oxidative stress, the proteasome, and tau proteolytic fragments.
More below π
Arc bioinformatics scientists @noamteyssier.bsky.social
and Alex Dobin have just released cyto, an ultra-high throughput processor specifically optimized for
@10xgenomics.bsky.social Flex single-cell data.
We are excited to make this resource open source: www.biorxiv.org/content/10.6...
cyto is free, open-source, and production-ready. Built in Rust for reliability at scale.
Currently supports 10x Flex GEX and CRISPR, with more modalities coming.
Try it out and let us know how it works for you!
github.com/ArcInstitute...
cyto is the first large-scale bioinformatics project to build with BINSEQ. Switching to BINSEQ can achieve mapping rates of 50M reads per second reduce your storage requirements by about 40%.
github.com/ArcInstitute...
We also show that we can reproduce the results of CellRanger at a fraction of the resource cost. Our concordance is above 99.85% as measured via Spearman on matched cell UMIs and our lower-dimensional representations show perfect overlap with no method specific clustering.
cyto was built from the ground up to be modular and to expose the individual modules to the user. Each step is highly optimized and can be run independently, perfect for production scale workflows as it allows for better parallelization and resource allocation on smaller nodes
Currently the only tool that supports this data type is CellRanger and we show that cyto provides runtimes an order of magnitude faster (16x), uses less than half the memory, dramatically reduces CPU-hours (30x) and reduces total IO by more than 5x.
Today Iβm happy to release cyto, a tool Iβve developed at @arcinstitute.org to dramatically increase our computational throughput with 10x-flex single-cell processing by more than 16X!
I've tried this at least 3 times haha I think honestly the best way to do it is not really to port it but drastically rework the way that its written.
Maybe this will be the year we start to really question the foundational infrastructure of the field.
Obligatory BINSEQ mention here - keep a lookout the next couple weeks for an update!
github.com/arcInstitute...
It is the year 2026 - bioinformaticians are still trying to figure out the best way to handle fastq
To be one with the borrow checker one must first be willing to let go
We can make as many pipes as we have threads, each with a fixed record range and with a specified segment (R1 / R2). Then we can connect to each pipe on a reader and treat it as a normal FASTX file.
What's great about this is we can process *either* sequentially or in parallel *without* deadlocks!
This was a fun engineering problem but ultimately was not very difficult because of the way BINSEQ is designed in the first place!
Named pipes can be a headache because it requires coordination between readers and writers but because BINSEQ is random access the implementation is straightforward.
New feature to bqtools v0.4.14 that I'm stoked on!
One of the limiting factors to adopting BINSEQ is that it's new and not widely supported by existing tools.
`bqtools pipe` addresses this by transparently creating FASTX named-pipes which can be processed normally by existing tools.
If you ever need to fuzzy search some DNA, sassy is your tool.
Please spread the word; I think many people just outside my own circle could benefit from this :)
cc @rickbitloo.bsky.social
github.com/RagnarGrootK...
Some optimization on VBQ with the latest binseq update, especially in lossless mode. Some ways to trim the fat:
1. Reuse zstd decoders for each thread. I was creating a decoder for each vbq block which incurred redundant allocations
2. Zero-copy parsing of blocks, referencing similar to paraseq
ARM64 linux I think is pretty common on cloud computing environments. Might be worth to build for it also
Excited to announce a new bqtools tutorial on sandbox.bio by @noamteyssier.bsky.social! Learn about the BINSEQ file format, and how it can replace FASTQ files for better data compression and faster parallel processing: sandbox.bio/tutorials/bq...
Built with uv so you don't have to worry about the dependencies or environments. Simple as:
```
uv tool install anntools-bio
anntools --help
```
I work with large collections of AnnDatas for single-cell work and got tired of opening notebooks for simple operations. Built a CLI tool to handle some common stuff directly from the terminal.
Quick ops: downsample, concat, pseudobulk, QC, metadata export, etc.
github.com/noamteyssier...
Side note: sandbox.bio is so cool.
Setting up an environment where you can learn and play around with these tools in the browser is no simple feat and I think it's an excellent educational resource for the bioinformatics community.
I'm very happy and proud to contribute to it!
BINSEQ is a high-performance format for sequencing data and bqtools is a CLI tool that lets you create and manipulate these files in the style of samtools.
Excited to release a tutorial with @robert.bio showcasing how to use it to encode, decode, and grep sequences in the browser on sandbox.bio!
The pattern counting is something I'm especially stoked about. I was actually very surprised to see that this feature isn't more common on grep-like tools (outside of bioinformatics as well).
I've had this problem for years and I end up writing bespoke tools that do some variation of it.
New bqtools release with some nice new features!
1. Support for fuzzy matching using sassy
2. Multi-Pattern counting (like `grep -c` but the count is for each individual pattern provided)
3. Pattern files (providing large lists of patterns as either regex or literals)
github.com/ArcInstitute...
And stay on the look out the next couple weeks (hopefully) for the release of an even bigger project built with binseq!
And if you're interested in building with binseq here is the place to start!
github.com/arcinstitute...