HGPU group

@hgpu.bsky.social

High performance computing on graphics processing units (GPU): AMD, Nvidia, Intel, CUDA, OpenCL, OpenGL, HPC

86 Followers | 10 Following | 239 Posts | Joined: 15.11.2024 | 1.9269

Latest posts by hgpu.bsky.social on Bluesky

An MLIR pipeline for offloading Fortran to FPGAs via OpenMP With the slowing of Moore’s Law, heterogeneous computing platforms such as Field Programmable Gate Arrays (FPGAs) have gained increasing interest for accelerating HPC workloads. In this work …

An MLIR pipeline for offloading Fortran to FPGAs via OpenMP

#OpenMP #FPGA #Fortran #Package

hgpu.org?p=30356

16.11.2025 15:02 — 👍 0 🔁 0 💬 0 📌 0

HipKittens: Fast and Furious AMD Kernels AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, rece…

HipKittens: Fast and Furious AMD Kernels

#AMD #Performance #Package

hgpu.org?p=30355

16.11.2025 15:01 — 👍 0 🔁 0 💬 0 📌 0

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel g…

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

#CUDA #LLM #CodeGeneration

hgpu.org?p=30354

16.11.2025 15:00 — 👍 0 🔁 0 💬 0 📌 0

A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data The torrential influx of floating-point data from domains like IoT and HPC necessitates high-performance lossless compression to mitigate storage costs while preserving absolute data fidelity. Leve…

A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data

#CUDA #Compression #Package

hgpu.org?p=30353

16.11.2025 14:59 — 👍 0 🔁 0 💬 0 📌 0

MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies Understanding GPU topology is essential for performance-related tasks in HPC or AI. Yet, unlike for CPUs with tools like hwloc, GPU information is hard to come by, incomplete, and vendor-specific. …

MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies

#CUDA #PTX #HIP #Benchmarking #Package

hgpu.org?p=30352

16.11.2025 14:58 — 👍 0 🔁 0 💬 0 📌 0

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automati…

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

#CUDA #CodeGeneration #Performance #Package

hgpu.org?p=30343

09.11.2025 16:29 — 👍 1 🔁 0 💬 0 📌 0

Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs Different compilers can generate code with notably different performance characteristics – even on the same system. Today, GPU developers have three popular options for compiling CUDA or HIP …

Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs

#CUDA #HIP #Compression #Package

hgpu.org?p=30342

09.11.2025 16:28 — 👍 0 🔁 0 💬 0 📌 0

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computatio…

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

#FP8 #Precision

hgpu.org?p=30341

09.11.2025 16:28 — 👍 0 🔁 0 💬 0 📌 0

AMD MI300X GPU Performance Analysis The rapid growth of large language models (LLMs) has driven the need for high-performance, scalable GPU hardware capable of efficiently serving models with hundreds of billions of parameters. While…

AMD MI300X GPU Performance Analysis

#AMD #HIP #Benchmarking #Performance

hgpu.org?p=30340

09.11.2025 16:27 — 👍 0 🔁 0 💬 0 📌 0

RDMA Point-to-Point Communication for LLM Systems Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point c…

RDMA Point-to-Point Communication for LLM Systems

#CUDA #RDMA #LLM #Package

hgpu.org?p=30339

09.11.2025 16:26 — 👍 0 🔁 0 💬 0 📌 0

A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations Deep learning (DL) has already played a significant role in numerous fields, making it crucial to ensure the stability of both training and inference in DL systems. The computation of DL models can…

A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations

#CUDA #DeepLearning #DL #Package

hgpu.org?p=30330

02.11.2025 16:05 — 👍 1 🔁 0 💬 0 📌 0

Enhancing Transformer Performance and Portability through Auto-tuning Frameworks Abstract Transformer-based models such as BERT and GPT2 have become the foundation of many modern applications, yet their execution requires substantial computational and memory resources. To addre…

Enhancing Transformer Performance and Portability through Auto-tuning Frameworks

#CUDA #LLM #AutoTuning #PerformancePortability #Package

hgpu.org?p=30329

02.11.2025 16:04 — 👍 0 🔁 0 💬 0 📌 0

Serve Programs, Not Prompts Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible des…

Serve Programs, Not Prompts

#LLM #NLP

hgpu.org?p=30328

02.11.2025 16:03 — 👍 1 🔁 0 💬 0 📌 0

Scalable GPU-Based Integrity Verification for Large Machine Learning Models We present a security framework that strengthens distributed machine learning by standardizing integrity protections across CPU and GPU platforms and significantly reducing verification overheads. …

Scalable GPU-Based Integrity Verification for Large Machine Learning Models

#SYCL #oneAPI #Rust #Security #Package

hgpu.org?p=30327

02.11.2025 16:02 — 👍 0 🔁 0 💬 0 📌 0

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats Modern AI hardware, such as Nvidia’s Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language …

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

#CUDA #MachineLearning #ML #Package

hgpu.org?p=30326

02.11.2025 16:02 — 👍 1 🔁 0 💬 0 📌 0

Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels Tensor Cores (TCs) are specialized hardware units designed for efficient matrix multiplication and are widely utilized in deep learning workloads. However, their adoption in more irregular high-per…

Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels

#CUDA #Chemistry #MolecularDocking #Package

hgpu.org?p=30318

26.10.2025 20:04 — 👍 1 🔁 0 💬 0 📌 0

STARK: Strategic Team of Agents for Refining Kernels The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor-intensive task due to complex interactions between memory hierarchies, threa…

STARK: Strategic Team of Agents for Refining Kernels

#CodeGeneration #LLM

hgpu.org?p=30317

26.10.2025 20:04 — 👍 0 🔁 0 💬 0 📌 0

A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines We present a framework for developing compute graph-based applications targeting the AI Engine (AIE) array of AMD Versal SoCs. This framework enables users to embed AIE-based dataflow graph prototy…

A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines

#AMD #FPGA #CodeGeneration #AI

hgpu.org?p=30316

26.10.2025 20:03 — 👍 0 🔁 0 💬 0 📌 0

Collective Communication for 100k+ GPUs The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs. T…

Collective Communication for 100k+ GPUs

#CUDA #GPUcluster #LLM #Performance #Package

hgpu.org?p=30315

26.10.2025 20:03 — 👍 0 🔁 0 💬 0 📌 0

Tutoring LLM into a Better CUDA Optimizer Recent leaps in large language models (LLMs) caused a revolution in programming tools (like GitHub Copilot) that can help with code generation, debugging, and even performance optimization. In this…

Tutoring LLM into a Better CUDA Optimizer

#CUDA #LLM #CodeGeneration #Package

hgpu.org?p=30314

26.10.2025 20:03 — 👍 0 🔁 0 💬 0 📌 0

Compiler and Runtime Systems for Generative AI Models Generative AI (GenAI) workloads have rapidly become the predominant data center GPU workload. However, designing efficient GPU kernels for GenAI presents significant challenges due to two central f…

Thesis: Compiler and Runtime Systems for Generative AI Models

#CUDA #LLM #DeepLearnig #DL #Package

hgpu.org?p=30305

19.10.2025 20:40 — 👍 0 🔁 0 💬 0 📌 0

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation Specializing kernels by including runtime information during just-in-time (JIT) -compilation can improve performance at the expense of potentially generating more kernels. In this work, we contribu…

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

#SYCL #HIP #CUDA #Performance #Package

hgpu.org?p=30304

19.10.2025 20:40 — 👍 0 🔁 0 💬 0 📌 0

Anonymized Network Sensing using C++26 std::execution on GPUs Large-scale network sensing plays a vital role in network traffic analysis and characterization. As network packet data grows increasingly large, parallel methods have become mainstream for network…

Anonymized Network Sensing using C++26 std::execution on GPUs

#CUDA #CXX

hgpu.org?p=30303

19.10.2025 20:40 — 👍 0 🔁 0 💬 0 📌 0

A Performance Portable Matrix Free Dense MTTKRP in GenTen We extend the GenTen tensor decomposition package by introducing an accelerated dense matricized tensor times Khatri-Rao product (MTTKRP), the workhorse kernel for canonical polyadic (CP) tensor de…

A Performance Portable Matrix Free Dense MTTKRP in GenTen

#Kokkos #CUDA #OpenMP #Package

hgpu.org?p=30302

19.10.2025 20:40 — 👍 0 🔁 0 💬 0 📌 0

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs Operator fusion has become a key optimization for deep learning, which combines multiple deep learning operators to improve data reuse and reduce global memory transfers. However, existing tensor c…

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

#CUDA #ROCm #Performance #DeepLearning #DL #Package

hgpu.org?p=30301

19.10.2025 20:35 — 👍 2 🔁 0 💬 0 📌 0

High-Performance Computing: from Optimization to Automation The digital revolution of our society is driven by major technological advancements, enabled not only by the growing capabilities of computers but also by the evolution of their uses. These develop…

Thesis: High-Performance Computing: from Optimization to Automation

#CUDA #HIP #HPC

hgpu.org?p=30292

12.10.2025 14:49 — 👍 1 🔁 1 💬 0 📌 0

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR MLIR (Multi-Level Intermediate Representation) has rapidly become a foundational technology for modern compiler frameworks, enabling extensibility across diverse domains. However, ensuring the corr…

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

#MLIR #OpenCL #Testing #Package

hgpu.org?p=30291

12.10.2025 14:48 — 👍 0 🔁 0 💬 0 📌 0

ConCuR: Conciseness Makes State-of-the-Art Kernel Generation GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the s…

ConCuR: Conciseness Makes State-of-the-Art Kernel Generation

#CUDA #CodeGeneration #LLM #DeepLearning #DL #Package

hgpu.org?p=30290

12.10.2025 14:48 — 👍 1 🔁 0 💬 0 📌 0

Accelerating cosmological simulations on GPUs: a portable approach using OpenMP In this work we present the porting to Graphics Processing Units (GPUs, using OpenMP target directives) and optimization of a key module within the cosmological {pinocchio} code, a Lagrangian Pertu…

Accelerating cosmological simulations on GPUs: a portable approach using OpenMP

#OpenMP #HPC #Astrophysics #Package

hgpu.org?p=30289

12.10.2025 14:47 — 👍 2 🔁 0 💬 0 📌 0

EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models CUDA kernel optimization has become a critical bottleneck for AI performance, as deep learning training and inference efficiency directly depends on highly optimized GPU kernels. Despite the promis…

EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models

#CUDA #LLM #AI #DeepLearning #DL #PyTorch

hgpu.org?p=30288

12.10.2025 14:47 — 👍 1 🔁 0 💬 0 📌 0

@hgpu is following 10 prominent accounts

Intersect360 Research
@intersect360

HPC, AI, and hyperscale advisory and consulting. The industry’s most accurate market sizing, forecasting, and analysis. http://intersect360.com

HPC-AI Leadership Organization (HALO)
@hpcaileadershiporg

The HPC-AI Leadership Organization (HALO) is a cross-industry community of HPC and AI end users collaborating and sharing best practices to define and shape the future of high-performance computing and AI technology development.

Gary William Flake
@gwf

Independent Scientist and Inventor. Former exec at Overture, Yahoo!, Microsoft, Clipboard, and Salesforce. Author of "The Computational Beauty of Nature".

OpenMP ARB
@openmp-arb

The OpenMP Architecture Review Board's mission is to standardize directive-based multi-language high-level parallelism that is performant, productive and portable.

Yann LeCun
@yann-lecun

Professor a NYU; Chief AI Scientist at Meta. Researcher in AI, Machine Learning, Robotics, etc. ACM Turing Award Laureate. http://yann.lecun.com

@iwocl

IWOCL - The 13th International Workshop on OpenCL and SYCL is taking place on April 7-11, 2025 in Heidelberg, Germany. iwocl.org

Stream HPC
@streamhpc

HPC Guru
@hpcguru

Not a HPC Guru, but I play one on social media

hpc.social on BlueSky
@hpc.social

Expanding the HPC.social community to new and bluer skies! See our open-source community projects including jobs board and links to other media such as Slack, Discord, and Mastodon at https://hpc.social to introduce yourself, learn more, and get involved!

Bluesky
@bsky.app

official Bluesky account (check username👆) Bugs, feature requests, feedback: support@bsky.app