All the above on a desktop 4090.
Pretty shitty image quality though, hard to read anything xD
@vassvik.bsky.social
Simulation and rendering nerd. Co-founder and CTO @JangaFX Working on EmberGen and more. Discord: vassvik @vassvik@mastodon.gamedev.place
All the above on a desktop 4090.
Pretty shitty image quality though, hard to read anything xD
2026 will be mostly about fleshing out the remaining features and actually shipping something usable. Shouldn't be that much longer now, hopefully.
10.12.2025 13:37 β π 3 π 0 π¬ 1 π 0At the moment I'm really happy about the performance itself, which is close to match the performance of a typical dense solver despite actually being sparse. Gives a lot of room to add features that ultimately take some of that performance away by having a really fast baseline.
10.12.2025 13:37 β π 2 π 0 π¬ 1 π 0There's also an efficient downsampler, doing 32x32x32 -> 16x16x16 -> 8x8x8 -> 4x4x4 -> 2x2x2 -> 1x1x1 downsamples in one pass at high effective bandwidth.
10.12.2025 13:37 β π 0 π 0 π¬ 1 π 0There's are other neat bits, like a single-pass vorticity confinement that runs at slightly above 50% bandwidth, which is alright since it combines two passes that each would probably had run at 80-90% bandwidth.
10.12.2025 13:37 β π 0 π 0 π¬ 1 π 0The other big part of the sim itself: The advection.
There's cubic interpolation on the velocity, smoke and flames components, while the fuel and temperature components are linearly interpolated.
Runs at roughly 1/6 occupancy, and roughly 1/6 peak bandwidth, which seems alright given what it does
The tradeoffs have been in favor of larger sims, which makes it fairly easy to push towards high effective bandwidths even with a sparse solver, which is based on a relatively simple brick map structure. For smaller simulations, especially in a realtime game setting the tradeoffs would be different.
10.12.2025 13:37 β π 0 π 0 π¬ 1 π 0For the tool itself, whose main purpose will be generating assets in a semi-offline manner, the tradeoffs have been in favor of maximizing fidelity (broadly voxel count) and performance (making it run fast). The voxel resolution in this capture is roughly 50-100x larger than what you'd use in a game
10.12.2025 13:37 β π 0 π 0 π¬ 1 π 0The solver has 3 main parts:
- Advection (movement)
- Injection
- Projection (vortices)
This shows an nsight capture of the last part, the projection.
In this case it's simulating a dense 512x256x256 grid, even though the solver is sparse. Effective bandwidth close to 90% across the board.
It's roughly 4 years since I started to work on the 3D sparse fluid solver for what will ultimately become EmberGen 2.0.
A quick look at the current performance state of the solver.
A previous retrospective thread from a couple years ago on that other place: x.com/vassvik/stat...
After nine years of development, meshoptimizer has reached its first major version, 1.0!
This release focuses on improvements in clusterization and simplification as well as stabilization. Here's a release announcement with more details on past, present and future; please RT!
meshoptimizer.org/v1
This lets me do a 2x2 and 4x4 reduction efficiently using very few subgroup shuffles, as long as the subgroup size is 16 or bigger. With a little work and using texture filtering I can do even more work with very little effort.
08.12.2025 21:12 β π 1 π 0 π¬ 1 π 0For example, if I have a 256x1x1 workgroup I can use decode_morton2_8b to map gl_LocalInvocationID.x to values in a 16x16 grid, where contiguous groups of 4 threads span 2x2 values, and contiguous groups of 16 span 4x4 values, and so on.
08.12.2025 21:12 β π 1 π 0 π¬ 1 π 0Swizzling compute shader threads for doing mip downsampling, and swizzling between workgroups as well. At least that's what spurred all of this
08.12.2025 21:12 β π 2 π 0 π¬ 1 π 0Final version for the 2D Morton decode functions. Fairly happy with the structure in general.
A 64-bit version should be easy enough to produce, by just using the 32-bit version twice and combining the results.
Similar for the packed 16-bit and 8-bit versions:
07.12.2025 23:35 β π 0 π 0 π¬ 0 π 0GPUs have "left shift and add" instructions, e.g. V_LSHL_ADD_U32 on RDNA (and likely the same on NV/SASS), and all the muls mentioned so far are basically x + (x << a) anyway.
So we could use those instead since the inputs are generally 32 bit in the generalized 32-bit case:
Spent some more time on this one, with some input from ryg as well.
So general integer multiplication isn't really ideal if either of the inputs are 32 bit, and it'll effectively be worse than a shift and an add no matter the platform
If both inputs are 24bits it's usually fast on the GPU, though
gah, typo in the comment
07.12.2025 21:20 β π 0 π 0 π¬ 0 π 0Took a bit of time before I could go over these. Really nice trick!
Can obviously apply these to the "specialized" packed versions as well, e.g. could get rid of 3 ops for decode_morton2_256x256, and so on, mostly replacing the | and >> combinations with a single multiply.
Static images really doesn't do it justice π
07.12.2025 08:38 β π 1 π 0 π¬ 0 π 0Everything about this could at least be half the size/length, still
07.12.2025 08:37 β π 1 π 0 π¬ 0 π 0Would seeing it in person feel any less surreal?
07.12.2025 08:32 β π 0 π 0 π¬ 1 π 0Or, I get stuck in a critical loop trying to evaluate and find the right context for and light to see the results in
Does this make sense?
Are these results good?
Do these results make sense?
Evaluating things in the context for which they were written and not my own is often very hard, too
Often I'm reading a paper, but then get myself repeatedly side tracked from mind tangents seeded by something I read, or I get stuck trying to parse something that just so happened to miss some crucial context that's necessary for it to click. Actually reaching the end can be a lot of work π
05.12.2025 23:49 β π 1 π 0 π¬ 1 π 0Feels like I've come full circle.
I'm sort of curious how common or inevitable this is.
In a way I feel very much "content" in my own little bubble of niche interests and expertise, exploring things on my own without external influence or pressure, maybe to a fault.
Looking back at the time when I worked on a PhD (which I never finished) I was very much feeling rather academically "lonely" in the sense that very few others had the same level of interest and expertise in the things I worked on to give genuine feedback and have a serious discussion on topic
05.12.2025 23:37 β π 4 π 0 π¬ 1 π 0At conferences most of the technical content barely interest me, unless it's an interesting story or generally just an excellent presentation (typically on a topic I do know something about in the first place).
The people and interactions are generally what I'm there for instead these days.
In almost everything I do these days I get an idea first, spend a lot of time on it in isolation, and then when I'm done or nearly done I might stumble upon something similar or maybe even the exact same thing in the wild.
Makes it hard to empathize with other people's work.
I can't really recall the last time I actually read a paper/article for the sake of learning something, or for staying up to date, and I can barely muster the motivatation to skim new research papers even on topics I genuinely like.
Anything that isn't explicitly "mine" is really hard to focus on