HGPU group's Avatar

HGPU group

@hgpu.bsky.social

High performance computing on graphics processing units (GPU): AMD, Nvidia, Intel, CUDA, OpenCL, OpenGL, HPC

86 Followers  |  10 Following  |  239 Posts  |  Joined: 15.11.2024  |  1.9269

Latest posts by hgpu.bsky.social on Bluesky

Preview
An MLIR pipeline for offloading Fortran to FPGAs via OpenMP With the slowing of Moore’s Law, heterogeneous computing platforms such as Field Programmable Gate Arrays (FPGAs) have gained increasing interest for accelerating HPC workloads. In this work …

An MLIR pipeline for offloading Fortran to FPGAs via OpenMP

#OpenMP #FPGA #Fortran #Package

hgpu.org?p=30356

16.11.2025 15:02 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
HipKittens: Fast and Furious AMD Kernels AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, rece…

HipKittens: Fast and Furious AMD Kernels

#AMD #Performance #Package

hgpu.org?p=30355

16.11.2025 15:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel g…

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

#CUDA #LLM #CodeGeneration

hgpu.org?p=30354

16.11.2025 15:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data The torrential influx of floating-point data from domains like IoT and HPC necessitates high-performance lossless compression to mitigate storage costs while preserving absolute data fidelity. Leve…

A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data

#CUDA #Compression #Package

hgpu.org?p=30353

16.11.2025 14:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies Understanding GPU topology is essential for performance-related tasks in HPC or AI. Yet, unlike for CPUs with tools like hwloc, GPU information is hard to come by, incomplete, and vendor-specific. …

MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies

#CUDA #PTX #HIP #Benchmarking #Package

hgpu.org?p=30352

16.11.2025 14:58 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automati…

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

#CUDA #CodeGeneration #Performance #Package

hgpu.org?p=30343

09.11.2025 16:29 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs Different compilers can generate code with notably different performance characteristics – even on the same system. Today, GPU developers have three popular options for compiling CUDA or HIP …

Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs

#CUDA #HIP #Compression #Package

hgpu.org?p=30342

09.11.2025 16:28 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computatio…

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

#FP8 #Precision

hgpu.org?p=30341

09.11.2025 16:28 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
AMD MI300X GPU Performance Analysis The rapid growth of large language models (LLMs) has driven the need for high-performance, scalable GPU hardware capable of efficiently serving models with hundreds of billions of parameters. While…

AMD MI300X GPU Performance Analysis

#AMD #HIP #Benchmarking #Performance

hgpu.org?p=30340

09.11.2025 16:27 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
RDMA Point-to-Point Communication for LLM Systems Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point c…

RDMA Point-to-Point Communication for LLM Systems

#CUDA #RDMA #LLM #Package

hgpu.org?p=30339

09.11.2025 16:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations Deep learning (DL) has already played a significant role in numerous fields, making it crucial to ensure the stability of both training and inference in DL systems. The computation of DL models can…

A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations

#CUDA #DeepLearning #DL #Package

hgpu.org?p=30330

02.11.2025 16:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Enhancing Transformer Performance and Portability through Auto-tuning Frameworks Abstract Transformer-based models such as BERT and GPT2 have become the foundation of many modern applications, yet their execution requires substantial computational and memory resources. To addre…

Enhancing Transformer Performance and Portability through Auto-tuning Frameworks

#CUDA #LLM #AutoTuning #PerformancePortability #Package

hgpu.org?p=30329

02.11.2025 16:04 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Serve Programs, Not Prompts Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible des…

Serve Programs, Not Prompts

#LLM #NLP

hgpu.org?p=30328

02.11.2025 16:03 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Scalable GPU-Based Integrity Verification for Large Machine Learning Models We present a security framework that strengthens distributed machine learning by standardizing integrity protections across CPU and GPU platforms and significantly reducing verification overheads. …

Scalable GPU-Based Integrity Verification for Large Machine Learning Models

#SYCL #oneAPI #Rust #Security #Package

hgpu.org?p=30327

02.11.2025 16:02 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats Modern AI hardware, such as Nvidia’s Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language …

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

#CUDA #MachineLearning #ML #Package

hgpu.org?p=30326

02.11.2025 16:02 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels Tensor Cores (TCs) are specialized hardware units designed for efficient matrix multiplication and are widely utilized in deep learning workloads. However, their adoption in more irregular high-per…

Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels

#CUDA #Chemistry #MolecularDocking #Package

hgpu.org?p=30318

26.10.2025 20:04 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
STARK: Strategic Team of Agents for Refining Kernels The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor-intensive task due to complex interactions between memory hierarchies, threa…

STARK: Strategic Team of Agents for Refining Kernels

#CodeGeneration #LLM

hgpu.org?p=30317

26.10.2025 20:04 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines We present a framework for developing compute graph-based applications targeting the AI Engine (AIE) array of AMD Versal SoCs. This framework enables users to embed AIE-based dataflow graph prototy…

A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines

#AMD #FPGA #CodeGeneration #AI

hgpu.org?p=30316

26.10.2025 20:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Collective Communication for 100k+ GPUs The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs. T…

Collective Communication for 100k+ GPUs

#CUDA #GPUcluster #LLM #Performance #Package

hgpu.org?p=30315

26.10.2025 20:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Tutoring LLM into a Better CUDA Optimizer Recent leaps in large language models (LLMs) caused a revolution in programming tools (like GitHub Copilot) that can help with code generation, debugging, and even performance optimization. In this…

Tutoring LLM into a Better CUDA Optimizer

#CUDA #LLM #CodeGeneration #Package

hgpu.org?p=30314

26.10.2025 20:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Compiler and Runtime Systems for Generative AI Models Generative AI (GenAI) workloads have rapidly become the predominant data center GPU workload. However, designing efficient GPU kernels for GenAI presents significant challenges due to two central f…

Thesis: Compiler and Runtime Systems for Generative AI Models

#CUDA #LLM #DeepLearnig #DL #Package

hgpu.org?p=30305

19.10.2025 20:40 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation Specializing kernels by including runtime information during just-in-time (JIT) -compilation can improve performance at the expense of potentially generating more kernels. In this work, we contribu…

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

#SYCL #HIP #CUDA #Performance #Package

hgpu.org?p=30304

19.10.2025 20:40 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Anonymized Network Sensing using C++26 std::execution on GPUs Large-scale network sensing plays a vital role in network traffic analysis and characterization. As network packet data grows increasingly large, parallel methods have become mainstream for network…

Anonymized Network Sensing using C++26 std::execution on GPUs

#CUDA #CXX

hgpu.org?p=30303

19.10.2025 20:40 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
A Performance Portable Matrix Free Dense MTTKRP in GenTen We extend the GenTen tensor decomposition package by introducing an accelerated dense matricized tensor times Khatri-Rao product (MTTKRP), the workhorse kernel for canonical polyadic (CP) tensor de…

A Performance Portable Matrix Free Dense MTTKRP in GenTen

#Kokkos #CUDA #OpenMP #Package

hgpu.org?p=30302

19.10.2025 20:40 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs Operator fusion has become a key optimization for deep learning, which combines multiple deep learning operators to improve data reuse and reduce global memory transfers. However, existing tensor c…

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

#CUDA #ROCm #Performance #DeepLearning #DL #Package

hgpu.org?p=30301

19.10.2025 20:35 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
High-Performance Computing: from Optimization to Automation The digital revolution of our society is driven by major technological advancements, enabled not only by the growing capabilities of computers but also by the evolution of their uses. These develop…

Thesis: High-Performance Computing: from Optimization to Automation

#CUDA #HIP #HPC

hgpu.org?p=30292

12.10.2025 14:49 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR MLIR (Multi-Level Intermediate Representation) has rapidly become a foundational technology for modern compiler frameworks, enabling extensibility across diverse domains. However, ensuring the corr…

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

#MLIR #OpenCL #Testing #Package

hgpu.org?p=30291

12.10.2025 14:48 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
ConCuR: Conciseness Makes State-of-the-Art Kernel Generation GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the s…

ConCuR: Conciseness Makes State-of-the-Art Kernel Generation

#CUDA #CodeGeneration #LLM #DeepLearning #DL #Package

hgpu.org?p=30290

12.10.2025 14:48 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Accelerating cosmological simulations on GPUs: a portable approach using OpenMP In this work we present the porting to Graphics Processing Units (GPUs, using OpenMP target directives) and optimization of a key module within the cosmological {pinocchio} code, a Lagrangian Pertu…

Accelerating cosmological simulations on GPUs: a portable approach using OpenMP

#OpenMP #HPC #Astrophysics #Package

hgpu.org?p=30289

12.10.2025 14:47 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models CUDA kernel optimization has become a critical bottleneck for AI performance, as deep learning training and inference efficiency directly depends on highly optimized GPU kernels. Despite the promis…

EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models

#CUDA #LLM #AI #DeepLearning #DL #PyTorch

hgpu.org?p=30288

12.10.2025 14:47 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@hgpu is following 10 prominent accounts