I want bare metal instances that can launch within 2-3 seconds, for a better (local dev <-> remote execution) REPL workflow
vs. fast launching containers (eg, Cloud Run), bare metal gives me more reliable benchmark measurements and the ability to bring tools like Nsight
17.01.2026 18:28 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0
i've found some success sticking to SIMD-friendly scalar patterns
i get loop order right (polyhedral analysis), add hints like `#pragma omp simd`, and run at -O3: that's _usually_ enough. you can check output with -S (gives readable ASM)
or use SIMDe, that works too
17.01.2026 02:34 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
1. Introduction โ PTX ISA 9.1 documentation
This feels like a continuation of the reduction operators introduced in Blackwell's TMA (cp.reduce.async.bulk). Fun fact! Data movement often dominates power usage vs compute because of physics: thicker longer wires = more power needed to transmit each bit. Makes a lot of sense to optimise here.
17.01.2026 02:02 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0
Mmm! Nice corollary: software optimisations for prefix sums (re-parenthesizing) generalise across associative ops: +, ^, prefix-of-prefix.
I made a thread about it: bsky.app/profile/asht...
17.01.2026 01:31 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
perf-portfolio/delta at main ยท ashtonsix/perf-portfolio
HPC research and demonstrations. Contribute to ashtonsix/perf-portfolio development by creating an account on GitHub.
Full write-up, implementation (NEON) and benchmark results (Graviton4) here: github.com/ashtonsix/pe...
I love solving these kinds of performance puzzlesโand I'm currently available for hire! Reach out if interested ๐. 3/3
17.01.2026 00:55 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
The ILP trick:
# Local prefix sums
out[0..3] = prefix(in[0..3])
out[4..7] = prefix(in[4..7])
...
# Late carry broadcast (redundant compute)
out[4..7] += out[3];
out[8..11] += out[7];
...
By delaying the carry we allow the CPU to compute all local prefix sums in parallel, >doubling throughput. 2/
17.01.2026 00:55 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
I got SOTA (L1-hot, SIMD) on prefix sum by ADDING instructions (7.7 GB/s โ 19.8 GB/s). Consider:
for i = 0..n: out[i] = out[i-1] + in[i]
This SUCKS, because out[i] must wait on out[i-1]. There's an unbroken dependency chain which disrupts Instruction Level Parrallelism (ILP). 1/
17.01.2026 00:55 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 1
Opinions about product management, technology news and inclusivity in tech. Diversity is about demographics, inclusion is about creating a sense of belonging.
Mathematician. Likes nature, hiking, cycling, the Arctic, drawing, photography, cooking. Makes colorful simulations of mathematical and physical systems.
All posted photos are either my own work, or, in a few cases, by a fellow traveler.
๐ฎ indie tech artist
๐๏ธ I made Shader Forge & Shapes
๐ working on https://half-edge.xyz
๐ฅ shader sorceress
๐ math dork
๐ฅ rare YouTuber/streamer
๐ก ex-founder of @NeatCorp
my kids:
๐ฅช @toast.acegikmo.com
๐ฅ @salad.acegikmo.com
๐โโฌ @thor.acegikmo.com
dance music enjoyer & technology sister.
๐นBrooklyn
music mixes: https://plyr.fm/u/piss.beauty
A website for exploring the output of compilers. aka godbolt.org
Supports C, C++, Rust, Fortran, COBOL and many many more.
Support us at https://patreon.com/mattgodbolt
Sometime verb, real person, lover of 8-bit computers, husband & father, trying to be a kind person. #blacklivesmatter; trans rights are human rights.
he/him
Daily juggler of jobs: HPC Consultant at Red Oak Consulting, Director of Outdoor Centre, Farmer, Teaching Bushcraft/Forest skills, Electrician, Father/Husband to a large clan, & Beekeeper. Yes it's a weird mix.
#HPC #Outdoors #beekeeping #farming #sparky
I work in HPC and AI, love playing guitar, listening to guitars, staring at guitars and the proud nephew of three magnificent uncles.
Work: research platforms engineer (HPC, cloud). Before: research software engineer. Play: muddy bikes, climbing. He/him.
I am a supercomputing enthusiast, but I usually don't know what I'm talking about. I post about large-scale infrastructure for #HPC and #AI.
25+ years using + researching + buying + architecting #supercomputing
Now engineering leader for future #AI infrastructure + #HPC capabilities at Microsoft
Posts about #supercomputers #AI #technology #F1 #LFC #aviation #travel โฆ
https://www.hpcnotes.com
Not a HPC Guru, but I play one on social media
HPC group lead by Florina Ciorba at Department of Mathematics and Computer Science University of Basel.
hpc.dmi.unibas.ch
We're the HPC Engineering team from AWS, and we publish stuff about running R&D workloads in the cloud. Follow us here, and on YouTube (hpc.news/techshorts), too.