Underfox's Avatar

Underfox

@underfox3.bsky.social

Physicist, Telecom Engineering lover, HPC Enthusiast. Prog Rock/Metal fan. --- Independent tech analyst focused on semiconductors, patent analysis and emerging technologies.

759 Followers  |  17 Following  |  667 Posts  |  Joined: 12.11.2023  |  2.3886

Latest posts by underfox3.bsky.social on Bluesky

Post image

Github:

primecai.github.io/moc/

29.08.2025 08:56 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Each query dynamically selects a few informational blocks, as well as mandatory anchors, with causal routing that avoids loop closures. The model is able to allocate computation to relevant histories, preserving identities, actions, and scenes across minutes of content.

29.08.2025 08:56 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image Post image

Researchers have proposed Mixture of Contexts, a long video generation framework that learns to route each query to the most relevant segments of the video sequence, instead of relying on uniform or static sparse attention or a fixed selection strategy.

arxiv.org/pdf/2508.21058

29.08.2025 08:56 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

This work could pave the way not only for automatic optimizations for ML and science kernels but also for the development of LLM-optimized AMD GPU drivers. Congrats to the authors for this excellent work.

29.08.2025 08:35 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

SwizzlePerf is the first work that adds rich context from a suite of profilers into the context to directly reflect cache-locality improvements and improve LLM optimization.

29.08.2025 08:35 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

This isn't the first time AMD researchers have ventured into AI-powered GPU optimization. The biggest and most important difference is that this work takes hardware-awareness into account.

29.08.2025 08:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

By grouping cooperative blocks into a single XCD, the proposed workflow reduces off-chip traffic and stabilizes residency in the disaggregated caches, reducing the average energy per instruction even in kernels whose execution time is dominated by arithmetic throughput.

29.08.2025 08:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

While the primary focus of the presented work was performance, it is clear that the same remapping will also have pronounced benefits in terms of energy efficiency.

29.08.2025 08:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

The results show that SwizzlePerf can achieve on a wide range of ML and scientific GPU kernels of up to a 2.1x speedup and 70% L2 hit rate improvement.

29.08.2025 08:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image Post image

In this paper is presented SwizzlePerf, a LLM workflow that automatically generates spatial optimizations for GPU kernels on disaggregated architectures by giving LLMs explicit hardware-awareness.

arxiv.org/pdf/2508.20258

29.08.2025 08:35 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

This work will be presented at the in 58th IEEE/ACM International Symposium on Microarchitecture (MICRO 25), which will be held October 18 - 22, 2025 at Seoul, Korea.

29.08.2025 06:04 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

OmniSim is able to successfully simulate 11 designs previously unsupported by any HLS tool, achieving up to 35.9x speedup over traditional C/RTL co-simulation, and up to 6.61x speedup over the state-of-the-art yet less capable simulator, LightningSim, on its own benchmark suite.

29.08.2025 06:04 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image

OmniSim carefully orchestrates functionality and performance simulation threads to accurately model hardware-level behavior under arbitrary OS scheduling, achieving near-C simulation speed with near-RTL accuracy for both functionality and performance.

29.08.2025 06:04 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image Post image

In this paper is presented OmniSim, a framework that extend C-level simulation capability of HLS tools by enabling both functionality and performance simulations for those complex dataflow designs that are currently unsupported or considered infeasible.

arxiv.org/pdf/2508.19299

29.08.2025 06:04 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image

The implemented proof of concept is capable of demonstrating softmax computation and invertible logic without the need to create a network of probabilistic devices, offering major scalability advantages.

29.08.2025 05:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image Post image Post image

For the first time, researchers reported the realization of multi-value probabilistic computing by leveraging the thermally activated diffusion of magnetic skyrmions through an effectively non-flat energy landscape defined by a discrete number of sites.

arxiv.org/pdf/2508.19623

29.08.2025 05:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Excerpt from: Y. Wong, G. Zocchi, Spontaneous spiral patterns etched on Germanium, Arxiv, 2025

Link: arxiv.org/pdf/2508.16764

29.08.2025 03:59 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

A thin metallic film on germanium, in the presence of water, results in a remarkable pattern-forming system, such as this beautiful spiral spontaneously etched on the surface with a total structure diameter of 680 ΞΌm.

29.08.2025 03:59 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1

These findings represent a major step toward lower-power and faster spintronic devices for memory logic applications, creating new possibilities for electrical modulation of spin dynamics and ultrafast spin injection into two-dimensional quantum material.

28.08.2025 23:31 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

The experimental results employing direct contacts as well as contacts involving tunnel barriers show efficient gate control, with over 100% enhancement in the demagnetization rate compared to bare Cobalt by modulating the junction resistance.

28.08.2025 23:31 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image Post image

Researchers have demonstrated graphene spin-field-effect junctions where the electric field can control ultrafast spin currents and spin dynamics in thin-film ferromagnets.

PRL link: journals.aps.org/prl/pdf/10.1...

28.08.2025 23:31 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image Post image

It is important to note that the proposed experiment in this work also revealed that the topology-aware losses could also contribute to improving the geometry of the interpolated data.

28.08.2025 23:19 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Given an input sequence of persistence diagrams and a sparse temporal sampling of the corresponding data, the porposed approach inverts the non-keyframe diagrams to produce plausible estimations of the missing data.

28.08.2025 23:19 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image Post image

In this paper, researchers have developed a neural approach for the topology aware interpolation of scalar fields losses based on persistence diagrams, for constraining the topology and geometry of the output interpolations.

arxiv.org/pdf/2508.17995

28.08.2025 23:19 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image Post image

In this paper is presented an ab-initio transistor simulation of unprecedented scale including electron-electron interactions within the self-consistent GW approximation, carefully optimized to take advantage of the Alps and Frontier supercomputers. #HPC

arxiv.org/pdf/2508.19138

28.08.2025 05:00 β€” πŸ‘ 8    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image Post image

Through a proposed single-pass plane sweeping strategy, the present method achieve over 60fps for 90+ views, up to 228 FPS for 45 views using a single RTX 5090.

28.08.2025 05:34 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image Post image Post image

Nvidia researchers have developed a unified framework for real-time radiance field rendering on light field displays, supporting a wide range of radiance field representations within a shared architecture based on a single-pass plane sweeping strategy.

arxiv.org/pdf/2508.18540

28.08.2025 05:34 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

The results show that the proposed approach yields up to 6.20x throughput and 5.93x energy improvements for general workloads and 1.59x and 1.12x improvement to throughput and energy, respectively, for ML workloads on an A100 GPU.

28.08.2025 05:21 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Furthermore, a dynamic partition manager is proposed that manages MIG configurations aimed at maximizing flexibility of future partition creation using a state machine model and its integration with the schedulers.

28.08.2025 05:21 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

The proposed scheduler is based on a new time series-based predictive technique to determine memory footprints of dynamically unanalyzable jobs, while also allowing to schedule these jobs in constrained partitions.

28.08.2025 05:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@underfox3 is following 17 prominent accounts