James Ferguson's Avatar

James Ferguson

@psy-fer.bsky.social

Bioinformatician/Genomics Software Engineer @garvaninstitute.bsky.social Views my own. Mastodon @Psy_Fer_@genomic.social, https://genomic.social

138 Followers  |  167 Following  |  205 Posts  |  Joined: 14.11.2024  |  2.4443

Latest posts by psy-fer.bsky.social on Bluesky

Post image

Getting pretty close now!
Still moving through a big bug squashing phase, but I can see the light at the end.

12.02.2026 03:08 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Can also cheat and do like zip and stick the index in the file πŸ˜…

10.02.2026 11:35 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Trying really hard not to make many performance changes while just doing a faithful port. it's like 17s for 10k reads vs STAR doing it in like 3s. But i have not really focussed on this at all

10.02.2026 09:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Finally, ruSTAR and STAR are actually agreeing on things. haha. Claude is pretty psyched about this too. Onward and upwards

10.02.2026 06:42 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

You're not gonna believe this...but ruSTAR has single threaded fastq.gz reading.....

10.02.2026 03:02 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

that is cool!
Could this potentially solve the single threaded read bottleneck in tools that read fastq.gz files?

10.02.2026 02:37 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I think most of it is in the seed generation steps. Claude has had a really rough time trying to get this right. It has the method from STAR, reads the code into context, and still really struggles to get it right. I am going to write it myself and then see if it can build on that.

09.02.2026 10:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

so in the seed finding/expansion phase, it's creating too many soft clippings. weird. this is causing a bunch of other issues. So need to fix this, then go from there.

for 10k reads, STAR runs in 3.5s and ruSTAR runs in ~17s. So there are also quite a lot of things to change for performance.

09.02.2026 10:40 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Claude thinks that if a read maps vs not maps is the be all and end of all testing πŸ˜…
I have told it that it needs to consider WHERE a read maps too. It is overjoyed at this insight, and thinks i'm brilliant (i'm not).
Well, at least we are on the same page that reads should align correctly πŸ˜‚

09.02.2026 08:49 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

now fixing another bug in seed generation and scoring, where mismatches wouldn't impact the final score after seed stitching. This obviously impacts the alignments. So hopefully this improve things quite a lot. Claude actually found this one this time (after I told it to look at seed scoring)

09.02.2026 08:21 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

zooming in...yep, that's about right :)

09.02.2026 08:14 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I am waaaaay off to the side here but near my "people". Cool visualisation

09.02.2026 08:13 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
QCatch: A framework for quality control assessment and analysis of single-cell sequencing data Motivation: Single-cell sequencing data analysis requires robust quality control (QC) to mitigate technical artifacts and ensure reliable downstream results. While tools like alevin-fry and simpleaf (...

While we've been somewhat successful in spurring adoption of alevin-fry/simpleaf for scRNAseq processing, an impediment Dongze brought to our attention (now working directly w/ many experimentalists) was the need for a nice QC report for it's output; hence QCatch www.biorxiv.org/content/10.1... 1/x

03.01.2026 18:18 β€” πŸ‘ 26    πŸ” 6    πŸ’¬ 1    πŸ“Œ 0

And i'm back at it fixing some critical bugs in the cigar string construction (it's still trying to do like 2S10M12M4M lol). It also started getting integer overflow when casting i32 into u32 on negative scores from gaps. Like...just follow the logic in star ya dummy!
Got a new testing framework now

09.02.2026 08:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

And tests in your code aren't enough. These things will go round and round on circles chasing their tails over relatively obvious (to us) issues. Like the cigar string in those posts. Obvious to us that it was wrong, yet it didn't pick up on it at all until I pointed it out.

08.02.2026 13:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Where are you in the tech sphere? i'm a scientist, in bioinformatics more on the computer science and tool building things.

08.02.2026 10:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I'm a bioinformatician that builds these kinds of tools for work, so i know what i'm doing, and ooooh boy, someone who didn't know this stuff would have no chance. LLMs really couldn't do this withough me

08.02.2026 10:25 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Took a break today. Played some music on my viola, played with my cats and played some video games with my mates. I'll get back into it tomorrow

08.02.2026 04:00 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

People that buy those places probably aren't living in them. They rent them out to make someone else pay off their asset, and then gaslight them into thinking they are bad with money. A place near me went for like 3.5M and it's tiny and not even that nice. Housing market makes zero sense.

08.02.2026 00:37 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

let results = vec![] // TODO: build the results vector

it's always the "hard" thing too, that actually gives you results. Bloody lazy clankers

07.02.2026 10:34 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

holy malloc batman
yea gonna have to fix that πŸ˜‚

07.02.2026 10:31 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

So after burning a an insane amount of tokens, we finally have something close to STAR in output

Still a lot of work to go in making sure the logic is actually sound. No matter how many times I tell it to match the logic from STAR, it goes off on some tangent and just doesn't do that.

07.02.2026 10:19 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

basically this

07.02.2026 09:54 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Yea LLMs absolutely speed up development. I still don't think the quality is anywhere near as good as if you or I did it manually. In my other tools I like to ask the LLMs of I wrote something or if it wrote it (trick question, I wrote all of it) and it thinks it wrote most of it πŸ˜…

07.02.2026 09:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

It can read docs. But it is kinda dumb about it too. It took a loooong time to get the syntax of the noodles library right.

In one of my tools I have cigar parsers. They aren't that hard to write, there are only so many op codes.

07.02.2026 09:41 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

here you go, done with loc

07.02.2026 09:36 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

It even has a bunch of tests, but the issue is getting the logic right. So if it tried to do TDD, then it would still get the logic for the tests wrong, and then also write the code wrong πŸ˜‚

Let me check where i'm at with the LOC

07.02.2026 09:32 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

This is even more reason why LLMs are dumb as rocks. It will see that kind of CIGAR string, write 500 lines of code using it, testing it, and have no idea why it's parsing logic isn't working and creating an edge case format bug.

07.02.2026 09:22 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

me: yea pretty sure the CIGAR string construction is broken. Cigar strings don't go 5S10M12M10M...

claude: i'm a double dumbass

me: yea...

07.02.2026 09:21 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

claude: The problem is that for reverse-strand alignments, we're comparing the FORWARD read sequence to the genome, when we should be comparing the REVERSE-COMPLEMENTED read sequence!

me: I literally said that like 2 context windows ago wtf

claude: oh yea, it was last on my "to check" list

07.02.2026 09:14 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@psy-fer is following 20 prominent accounts