Leviathan Gamer @leviathangamer

It is for loading of assets. The game freezes until it is done. On the original Xbox, this is fairly quick most of the time loading from a DVD Discs. It is short frame drops on faster storage like a HDD.

25.10.2025 19:20 — 👍 1 🔁 0 💬 0 📌 0

It is pretty nice of Microsoft to switch everyone to Linux for free.

16.10.2025 14:03 — 👍 11 🔁 0 💬 0 📌 0

AMD leaked a driver that runs dp4a which supports RDNA2 GPUs.

14.10.2025 17:20 — 👍 4 🔁 0 💬 0 📌 0

What will probably happen is DX13(DX12_3 alternatively) will add it, and AMD will use DGF to implement it, and Nvidia will use RTX Mega Geometry.

07.08.2025 12:06 — 👍 0 🔁 0 💬 0 📌 0

it grows dramatically, which impacts the Switch 2. Does that make sense?

26.06.2025 02:48 — 👍 0 🔁 0 💬 0 📌 0

Tensor Core throughput however is just one portion of DLSS(the math heavy portion). As we scale up the Tensor Cores plow through the math heavy portion and get scheduling limited in the non-Tensor Core part, but if we scale down, the portion of DLSS that is Tensor Core heavy...

26.06.2025 02:47 — 👍 0 🔁 0 💬 1 📌 0

is we are trying to determine if the Tensor Cores are Compute or Memory Limited. Since we can properly feed them, we are very Compute Limited across all Nvidia GPUs(that is why we do both per Clock and per SM to make sure we can rule on all of them)...

26.06.2025 02:45 — 👍 0 🔁 0 💬 2 📌 0

So for context, I got a Bachelor's in Computer Engineering and Master's in Electrical Engineering, so I am coming at it from a chip design perspective. Maybe referring to it as Bytes per Clock per SM would make it easier to understand. The reason we are doing this...

26.06.2025 02:43 — 👍 0 🔁 0 💬 1 📌 0

You can see there are 4 sub-partitions in the Block Diagram. Nvidia's peak throughput assumes all 4 instructions issued each clock go to either the CUDA Cores or Tensor Cores. So at 2 to CUDA Cores and 2 to Tensor Cores we are at half peak for both.

26.06.2025 02:42 — 👍 0 🔁 0 💬 0 📌 0

I never said that Ada/Blackwell don't change the "MMA Ops Compute Capability". I stated they don't change SM Bytes Per Clock, which they don't. The reason I mention this is because we are talking about if it is Memory Limited and it isn't.

26.06.2025 01:56 — 👍 0 🔁 0 💬 1 📌 0

cycles to the Tensor Cores on the Switch at 60fps(or 8ms of frametime).

26.06.2025 01:53 — 👍 0 🔁 0 💬 0 📌 0

1) I never explained that you were right. Quote me or give up.
2)the peak throughput listed for Nvidia GPUs assume you issue to the CUDA Cores 100% of the time and 0% of the time to the Tensor Cores. The Peak Tensor Core throughput assumes the opposite.
3)This results in giving up half your clock...

26.06.2025 01:52 — 👍 0 🔁 0 💬 2 📌 0

FP8 and FP4 are not used in the CNN model. FP8 is used in the Transformer Model. Also FP8 is half the size of FP16, so you get twice the throughput because FP16 is 2 Bytes while FP8 is 1 Byte. Same applies to FP4.

26.06.2025 01:39 — 👍 0 🔁 0 💬 1 📌 0

for frametime cost if the Tensor Cores were "free" now in Ampere. That would be an Apples to Oranges comparison, so why would they confuse their devs.
3)The RTX 2050 Mobile doesn't get Memory Bandwidth Limited in the DF Video. It got crippled by Post-Processing post-upscale.

26.06.2025 01:37 — 👍 0 🔁 0 💬 1 📌 0

not allow you to co-issue to the CUDA Cores and Tensor Cores. This is why they mention frametime cost in the DLSS documentation. Think logically, If this was truly different, why would Nvidia have Turing and Ampere GPUs in the same chart...

26.06.2025 01:36 — 👍 0 🔁 0 💬 1 📌 0

On Turing, you can't use Async Compute simultaneously with Ray Tracing Shaders, which means the RT Cores can't have workloads scheduled at the same time as the Tensor Cores which require Compute Shaders. Ampere now allows you to run RT Cores and Tensor Cores at the same time, but does...

26.06.2025 01:34 — 👍 0 🔁 0 💬 1 📌 0

to read whitepapers. In Turing(and all subsequent Nvidia GPUs), you can't schedule an instruction to both the CUDA Cores and the Tensor Cores at the same time. People get this confused because of what the Whitepaper says, but they are just reading it wrong...

26.06.2025 01:32 — 👍 0 🔁 0 💬 2 📌 0

per Clock per SM to work with for any data not currently loaded into your register file. You could probably reach close to 90% Tensor Core utilization before you might have Memory Bandwidth limits. 2) As for the Async Compute thing, I actually get this one a lot because people don't know how...

26.06.2025 01:31 — 👍 0 🔁 0 💬 1 📌 0

I will just answer everything here instead of branching at each point:
1) Blackwell/Ada do not improve FP16 Tensor Core throughput, so they consume the same amount which is 384 Bytes per Clock, of which the Register Files can feed up to 384 Bytes per Clock. You also get an additional 128 Bytes...

26.06.2025 01:28 — 👍 0 🔁 0 💬 2 📌 0

I never said you can't see ghosting. I said you aren't seeing a 33ms Pixel response time, which if you counted the 5 separate frames in the soccer ball picture you would know would necessitate a minimum of 33ms of Pixel response at that instance(4 frames/120Hz).

26.06.2025 01:22 — 👍 1 🔁 0 💬 0 📌 0

You didn't answer my question. Is it possible you don't know that Pixel response times are highly variable depending on the content displayed? Also your screenshot is a poor one to tell Pixel Response Times from every LCD could look like that if captured at the right moment.

26.06.2025 00:45 — 👍 0 🔁 0 💬 1 📌 0

They trained their main CNN Model around using DLSS before post-processing which covers all the presets the Switch 2 will use.

26.06.2025 00:43 — 👍 0 🔁 0 💬 1 📌 0

anything else.

26.06.2025 00:41 — 👍 0 🔁 0 💬 0 📌 0

You don't get memory limited with DLSS on any platform. SM Bytes per Clock for Tensor Cores has not changed since Ampere and it obviously scales with SMs. Sure you can use it like that, but now you have allocates 8ms of your 16.66ms Frame Time just to DLSS, which means you can't use it for...

26.06.2025 00:41 — 👍 0 🔁 0 💬 2 📌 0

That isn't the intentional use that they train their presets on, which makes it look extra worse.

26.06.2025 00:15 — 👍 0 🔁 0 💬 1 📌 0

What do you mean? I have literally implemented Matrix Accelerators before. Tensor Cores get really Compute bound at as they need close to 200GFLOPs per Frame for 4K DLSS Performance which is a little heavy on the Switch 2.

26.06.2025 00:13 — 👍 0 🔁 0 💬 1 📌 0

I don't know what to tell you. I can easily see it. Just to check your eyes, how many ghost trails do you see of this 120Hz screenshot?:

26.06.2025 00:10 — 👍 1 🔁 0 💬 2 📌 0

DLSS becomes more Compute bound the lower you go in GPU power. DLSS is intended for use(and the Desktop presets are tuned around) you handling post-processing afterwards. Switch 2 is not intended to ever do that as it costs too much frametime, and that is what the "Lite" mode is doing actually.

25.06.2025 23:25 — 👍 0 🔁 0 💬 2 📌 0

No, you can definitely see it, ghost up to that much. And 33.3ms is not even that much historically. The PSP, DS, some GBAs, and Game Boy Color had that level of Pixel Response time or even worse. That isn't even counting the Game Boy's 180ms Pixel Response Time.

25.06.2025 23:19 — 👍 0 🔁 0 💬 1 📌 0

The first 15ms of every new frame would be blurred transition between the two frames and 1.66ms of a sharp clear image before the next frame starts. That is 90% of the time blurry with 10% of the time sharp.

25.06.2025 21:54 — 👍 0 🔁 0 💬 1 📌 0

Leviathan Gamer

Latest posts by leviathangamer.bsky.social on Bluesky

@leviathangamer is following 13 prominent accounts