HGPU group's Avatar

HGPU group

@hgpu.bsky.social

High performance computing on graphics processing units (GPU): AMD, Nvidia, Intel, CUDA, OpenCL, OpenGL, HPC

76 Followers  |  10 Following  |  164 Posts  |  Joined: 15.11.2024  |  1.9798

Latest posts by hgpu.bsky.social on Bluesky

Preview
Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in compl…

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

#ROCm #Triton #AI #CodeGeneration #Package

hgpu.org?p=30073

03.08.2025 17:41 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning Empirical autotuning methods such as Bayesian optimization (BO) are a powerful approach that allows us to optimize tuning parameters of parallel codes as black-boxes. However, BO is an expensive ap…

[Thesis] GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning

#OpenMP

hgpu.org?p=30072

03.08.2025 17:40 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers Neural processing units (NPUs) are gaining prominence in power-sensitive devices like client devices, with AI PCs being defined by their inclusion of these specialized processors. Running AI worklo…

NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers

#CodeGeneration #LLM #NPU

hgpu.org?p=30071

03.08.2025 17:40 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Performance Portable Gradient Computations Using Source Transformation Derivative computation is a key component of optimization, sensitivity analysis, uncertainty quantification, and nonlinear solvers. Automatic differentiation (AD) is a powerful technique for evalua…

Performance Portable Gradient Computations Using Source Transformation

#Kokkos #HIP #CUDA #Performance

hgpu.org?p=30070

03.08.2025 17:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing As the era of heterogeneous computing evolves, benchmarking tools are vital for measuring performance across diverse architectures. We present OpenDwarfs 2025, a reengineered and modernized version…

OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing

#OpenCL #Benchmarking #Package

hgpu.org?p=30069

03.08.2025 17:39 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Kevin: Multi-Turn RL for Generating CUDA Kernels Writing GPU kernels is a challenging task and critical for AI systems’ efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. M…

Kevin: Multi-Turn RL for Generating CUDA Kernels

#CUDA #LLM #Performance #AI

hgpu.org?p=30055

20.07.2025 16:01 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks The rapid development in scientific research provides a need for more compute power, which is partly being solved by GPUs. This paper presents a microarchitectural analysis of the modern NVIDIA Bla…

Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks

#CUDA #PTX #HPC #Performance #Benchmarking

hgpu.org?p=30053

20.07.2025 16:01 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Using Deep Reinforcement Learning for Automatic Code Optimization in the MLIR Compiler This work focuses on the use of deep reinforcement learning (DRL) to automate code optimization within modern compiler infrastructures. Code optimization is a critical step in program transformatio…

Thesis: Using Deep Reinforcement Learning for Automatic Code Optimization in the MLIR Compiler

#Performance #Physics #QCD #MLIR

hgpu.org?p=30054

20.07.2025 16:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Pre-Training LLMs on a budget: A comparison of three optimizers Optimizers play a decisive role in reducing pre-training times for LLMs and achieving better-performing models. In this study, we compare three major variants: the de-facto standard AdamW, the simp…

Pre-Training LLMs on a budget: A comparison of three optimizers

#CUDA #LLM #MachineLearning #ML

hgpu.org?p=30052

20.07.2025 15:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Specx: a C++ task-based runtime system for heterogeneous distributed architectures Parallelization is needed everywhere, from laptops and mobile phones to supercomputers. Among parallel programming models, task-based programming has demonstrated a powerful potential and is widely…

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

#CUDA #HIP #TaskScheduling #Package

hgpu.org?p=30051

20.07.2025 15:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Mutual-Supervised Learning for Sequential-to-Parallel Code Translation The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a…

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

#CUDA #HPC #LLM #CodeGeneration #Package

hgpu.org?p=30038

13.07.2025 16:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems As GPU-using tasks become more common in embedded, safety-critical systems, efficiency demands necessitate sharing a single GPU among multiple tasks. Unfortunately, existing ways to schedule multip…

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

#CUDA #TaskScheduling #Package

hgpu.org?p=30037

13.07.2025 16:32 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs This study presents a benchmarking analysis of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator for large language model (LLM) inference, evaluating its energy efficiency (throughput per watt) an…

Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs

#Qualcomm #Cloud #LLM #HPC #DeepLearning #DL

hgpu.org?p=30036

13.07.2025 16:32 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms The NVIDIA Collective Communication Library (NCCL) is a critical software layer enabling high-performance collectives on large-scale GPU clusters. Despite being open source with a documented API, i…

Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms

#CUDA #GPUcluster #Communication

hgpu.org?p=30035

13.07.2025 16:31 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggl…

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

#GPU #Kubernets #Package

hgpu.org?p=30034

13.07.2025 16:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Accelerated discovery and design of Fe-Co-Zr magnets with tunable magnetic anisotropy through machine learning and parallel computing Rare earth (RE)-free permanent magnets, as alternative substitutes for RE-containing magnets for sustainable energy technologies and modern electronics, have attracted considerable interest. We per…

Accelerated discovery and design of Fe-Co-Zr magnets with tunable magnetic anisotropy through machine learning and parallel computing

#CUDA #Physics #MaterialsScience #CondensedMatter #MachineLearning #ML #Package

hgpu.org?p=30007

06.07.2025 12:23 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Efficient GPU Implementation of Multi-Precision Integer Division Efficient arithmetic on multi-precision integers is a cornerstone of many scientific and cryptographic applications that require computations on integers that exceed the native sizes supported by m…

Thesis: Efficient GPU Implementation of Multi-Precision Integer Division

#CUDA #Futhark #Package

hgpu.org?p=30008

06.07.2025 12:22 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication Sparse matrix multiplication operators (i.e., SpMM and SDDMM) are widely used in deep learning and scientific computing. Modern accelerators are commonly equipped with Tensor cores and CUDA cores t…

Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication

#CUDA #Sparse #SpMM #DeepLearning #DL #Package

hgpu.org?p=30006

06.07.2025 12:22 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks GPGPU architectures have become significantly diverse in recent years, which has led to an emergence of a variety of specialized programming models and software stacks to support them. While portab…

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

#CUDA #OpenMP #LLM #CodeGeneration #Benchmarking #Package

hgpu.org?p=30005

06.07.2025 12:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code We present P4OMP, a retrieval-augmented framework for transforming serial C/C++ code into OpenMP-annotated parallel code using large language models (LLMs). To our knowledge, this is the first syst…

P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code

#OpenMP #LLM #HPC #CodeGeneration

hgpu.org?p=30004

06.07.2025 12:21 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
No More Shading Languages: Compiling C++ to Vulkan Shaders Graphics APIs have traditionally relied on shading languages, however, these languages have a number of fundamental defects and limitations. By contrast, GPU compute platforms offer powerful, featu…

No More Shading Languages: Compiling C++ to Vulkan Shaders

#Vulkan #Compilers #GLSL #Rendering #Raytracing #Package

hgpu.org?p=29983

29.06.2025 14:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GCStack+GCScaler: Fast and Accurate GPU Performance Analyses Using Fine-Grained Stall Cycle Accounting and Interval Analysis To design next-generation Graphics Processing Units (GPUs), GPU architects rely on GPU performance analyses to identify key GPU performance bottlenecks and explore GPU design spaces. Unfortunately,…

GCStack+GCScaler: Fast and Accurate GPU Performance Analyses Using Fine-Grained Stall Cycle Accounting and Interval Analysis

#CUDA #Performance

hgpu.org?p=29982

29.06.2025 14:48 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Omniwise: Predicting GPU Kernels Performance with LLMs In recent years, the rapid advancement of deep neural networks (DNNs) has revolutionized artificial intelligence, enabling models with unprecedented capabilities in understanding, generating, and p…

Omniwise: Predicting GPU Kernels Performance with LLMs

#ROCm #LLM #Performance

hgpu.org?p=29981

29.06.2025 14:48 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Survey of HPC in US Research Institutions The rapid growth of AI, data-intensive science, and digital twin technologies has driven an unprecedented demand for high-performance computing (HPC) across the research ecosystem. While national l…

Survey of HPC in US Research Institutions

#HPC #AI

hgpu.org?p=29980

29.06.2025 14:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
WiLLM: An Open Wireless LLM Communication System The rapid evolution of LLMs threatens to overwhelm existing wireless infrastructure, necessitating architectural innovations for burgeoning mobile LLM services. This paper introduces WiLLM, the fir…

WiLLM: An Open Wireless LLM Communication System

#LLM #Package

hgpu.org?p=29979

29.06.2025 14:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Engineering Supercomputing Platforms for Biomolecular Applications A range of computational biology software (GROMACS, AMBER, NAMD, LAMMPS, OpenMM, Psi4 and RELION) was benchmarked on a representative selection of HPC hardware, including AMD EPYC 7742 CPU nodes, N…

Engineering Supercomputing Platforms for Biomolecular Applications

#CUDA #ROCm #Biology #Biomolecules #MolecularDynamics #HPC #Physics #Package

hgpu.org?p=29954

22.06.2025 12:46 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
A First Look at Bugs in LLM Inference Engines Large language model-specific inference engines (in short as emph{LLM inference engines}) have become a fundamental component of modern AI infrastructure, enabling the deployment of LLM-powered app…

A First Look at Bugs in LLM Inference Engines

#LLM #AI

hgpu.org?p=29953

22.06.2025 12:45 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
A CPU+FPGA OpenCL Heterogeneous Computing Platform for Multi-Kernel Pipeline Over the past decades, Field-Programmable Gate Arrays (FPGAs) have become a choice for heterogeneous computing due to their flexibility, energy efficiency, and processing speed. OpenCL is used in F…

A CPU+FPGA OpenCL Heterogeneous Computing Platform for Multi-Kernel Pipeline

#OpenCL #FPGA

hgpu.org?p=29952

22.06.2025 12:44 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs Sparse data structures are commonly used in neural networks to reduce the memory footprint. These data structures are compact but cause irregularities such as random memory accesses, which prevent …

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

#CUDA #Compilers #Sparse #MatrixMultiplication

hgpu.org?p=29951

22.06.2025 12:44 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters Parallel computing with multiple GPUs has become the dominant paradigm for machine learning tasks, especially those of large language models (LLMs). To reduce the latency incurred by inter-GPU comm…

LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters

#GPUcluster

hgpu.org?p=29950

22.06.2025 12:44 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@hgpu is following 10 prominent accounts