Jacob Schreiber's Avatar

Jacob Schreiber

@jmschreiber91.bsky.social

Studying genomics, machine learning, and fruit. My code is like our genomes -- most of it is junk. Guest Scientist IMP Vienna, Board of Directors NumFOCUS Incoming Prof UMass Chan Medical Previously Stanford Genetics, UW CSE.

6,543 Followers  |  1,398 Following  |  773 Posts  |  Joined: 17.11.2023  |  2.2339

Latest posts by jmschreiber91.bsky.social on Bluesky

Preview
Breaking the silo: composable bioinformatics through cross-disciplinary open standards SciPy 2025 The practice of data science in genomics and computational biology is fraught with friction. This is in large part because bioinformatic tools tend to be tightly coupled to file input/output. As a res...

I’m also excited to be presenting Oxbow as part of my talk on composability at the #SciPy2025 Conference on Wednesday! Hope to see some of you there.

cfp.scipy.org/scipy2025/ta...

07.07.2025 21:22 β€” πŸ‘ 8    πŸ” 3    πŸ’¬ 2    πŸ“Œ 0
Post image

medium demand expected

02.07.2025 11:03 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Today marks the end of en-JUNE-eering, the month where I focused mostly on the nitty gritty of improving genomics ML infrastructure.

Here are some of the highlights:

30.06.2025 18:34 β€” πŸ‘ 20    πŸ” 5    πŸ’¬ 1    πŸ“Œ 0

(5) A new tomtom-lite command-line tool that allows quick querying of motifs without needing to go to the Tomtom website.

bsky.app/profile/jmsc...

30.06.2025 18:39 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

(4) bpnet-lite: Load official Chrom/BPNet models into PyTorch for downstream tangermeme integration. Improved command-line tools + docs. Still concerns about perf of models trained from scratch -- will be resolved next version!

github.com/jmschrei/bpn...

bsky.app/profile/jmsc...

30.06.2025 18:38 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

(3) tangermeme: significant quality-of-life improvements, fixing an issue with seqlet calling, plotting w/ annotations, and tomtom-lite integration across several functions.

github.com/jmschrei/tan...

bsky.app/profile/jmsc...

30.06.2025 18:37 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(2) bam2bw: a simple utility that allows you to go from BAMs -> un/stranded bigWigs without intermediary bedGraph files. Way less memory, disk, and hassle. Now extended to work on fragment files and .tsv/.gz, and depth normalize.

github.com/jmschrei/bam...

bsky.app/profile/jmsc...

30.06.2025 18:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(1) tomtom-lite: a significantly faster implementation of the original tomtom algorithm that can be over 1000x faster. Now built-in to a variety of my other tools.

github.com/jmschrei/mem...

bsky.app/profile/jmsc...

30.06.2025 18:35 β€” πŸ‘ 2    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

Today marks the end of en-JUNE-eering, the month where I focused mostly on the nitty gritty of improving genomics ML infrastructure.

Here are some of the highlights:

30.06.2025 18:34 β€” πŸ‘ 20    πŸ” 5    πŸ’¬ 1    πŸ“Œ 0
Post image

Thank you, Google Flights, for recommending this 10 hour layover in Athens first by "convenience" when trying to find a Frankfurt -> Vienna flight.

30.06.2025 17:54 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
NucleoBench: A Large-Scale Benchmark of Neural Nucleic Acid Design Algorithms One outstanding open problem with high therapeutic value is how to design nucleic acid sequences with specific properties. Even just the 5’ UTR sequence admits 2 Γ— 10120 possibilities, making exhausti...

This evaluation of DNA design methods is very well written. If you're interested in the field, you should def take a look. Also, glad to see Ledidi performing so well!

www.biorxiv.org/content/10.1...

27.06.2025 12:57 β€” πŸ‘ 10    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

yet

24.06.2025 07:24 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

very sad to report that I have begun adopting the verbal ticks of my first advisor

23.06.2025 13:14 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

8 months after submitting my first grant, and a month after my advisory council met, I'm thrilled to report that the council's recommendation is in. πŸ₯³

Hopefully, soon I'll find out what that was.

20.06.2025 16:18 β€” πŸ‘ 8    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
In vivo mapping of mutagenesis sensitivity of human enhancers - Nature Human enhancers contain a high density of sequence features that are required for their normal in vivo function.

In vivo mapping of mutagenesis sensitivity of human enhancers

www.nature.com/articles/s41...

18.06.2025 21:20 β€” πŸ‘ 49    πŸ” 19    πŸ’¬ 0    πŸ“Œ 1

(6) Several minor code re-orgs and changes have been added. You can now use any dtype and device for the steps, allowing you to use a CPU if necessary or do half-precision for large-scale prediction.

18.06.2025 19:56 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Although there are some challenges with simply mapping seqlets to motif databases, this can be viewed as a fast + dirty alternative to the robust de novo motif discovery of TF-MoDISco. It'll just give you a sense for what your model has learned (if anything at all)!

18.06.2025 19:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

When running the pipleine, the seqlets will be annotated using tomtom-lite + motif database, and counted so you get the top driving motifs. For example, for a CTCF model, here are the seqlet counts when mapped to JASPAR, with MET28 overlapping one of the fingers in CTCF.

18.06.2025 19:46 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Note that seqlets for negative attributions are trickier than for positive attributions because there are fewer negative attributions and real negative seqlets overall. More work will be done to make this more robust in the future.

18.06.2025 19:40 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(5) A new `bpnet seqlets` command has been added in. This will take the attributions you just calculated, call seqlets on them, and return a BED file of coordinates. These seqlets are the high-attributions spans that your model thinks are driving model predictions.

18.06.2025 19:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

For each step, the pipeline command:

(1) creates a JSON for that individual step
(2) runs the JSON and saves results

This makes the procedure self-documenting, in that every step has every parameter specified. You can edit any of them and re-run it, or refer to them later.

18.06.2025 16:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

What does this pipeline command do?

1. Fit a model (still under development)
2. Predictions
3. Attributions
4. Seqlet Calling + Annotation (new)
5. TF-MoDISco
6. Marginalizations

18.06.2025 16:53 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

This means that going from raw data to trained models and results is now as simple as

```
bpnet pipeline-json ... -o pipeline.json
bpnet pipeline pipeline.json
```

This works for both stranded and unstranded outputs, fragment files or reads, etc.

18.06.2025 10:18 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

A new `pipeline-json` command has been added that you point to your input data (signal, optional controls, peaks, genome, etc) with optional processing commands, and it creates a default JSON that will run out-of-the-box.

You may want to edit some of the parameters after, but WAY easier now.

18.06.2025 10:15 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(4) bpnet-lite offers a Python API for prototyping and integrating BPNet models into your Python workflows, and a suite of command-line tools for common usage.

To expose all possible parameters, these command-line tools take in massive JSONs that can be annoying to copy/paste just to get started.

18.06.2025 10:13 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(3) These models work best when trained with GC-matched negatives. You can pass in a second BED file with these or optionally now it will automatically find those negatives for you, removing this manual step from the training pipeline.

18.06.2025 10:02 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(2) You can now pass in .tsv/.tsv.gz fragment files and an optional `-f` flag to denote that the data are fragments (recording both ends of each entry). This is most common when working with single-cell pseudo bulked data and training ChromBPNet models.

18.06.2025 09:58 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(1) You can now pass in BAM/SAM files as raw inputs and bam2bw will automatically convert them to un/stranded bigWigs for the subsequent steps. Because bam2bw is really fast, no need to preprocess your data past BAMs (which can be remote!).

18.06.2025 09:55 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
GitHub - kundajelab/basepairmodels Contribute to kundajelab/basepairmodels development by creating an account on GitHub.

Importantly, bpnet-lite is not the official repository for BPNet/ChromBPNet models.

BPNet: github.com/kundajelab/b...
ChromBPNet: github.com/kundajelab/c...

Significant work has been put into making the above packages robust and high performing, especially by @anusri.bsky.social

18.06.2025 09:49 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1

Last week I released bpnet-lite v0.5.0.

BPNet/ChromBPNet are powerful models for understanding regulatory genomics from @anshulkundaje.bsky.social's group, and now it's way easier to go from raw data to trained models and analysis + results in PyTorch

Try it out with `pip install bpnet-lite`

18.06.2025 09:48 β€” πŸ‘ 38    πŸ” 11    πŸ’¬ 1    πŸ“Œ 1

@jmschreiber91 is following 20 prominent accounts