HGPU group's Avatar

HGPU group

@hgpu.bsky.social

High performance computing on graphics processing units (GPU): AMD, Nvidia, Intel, CUDA, OpenCL, OpenGL, HPC

88 Followers  |  10 Following  |  290 Posts  |  Joined: 15.11.2024  |  1.8335

Latest posts by hgpu.bsky.social on Bluesky

Preview
Generating Literature-Driven Scientific Theories at Scale Contemporary automated scientific discovery has focused on agents for generating scientific experiments, while systems that perform higher-level scientific activities such as theory building remain…

Generating Literature-Driven Scientific Theories at Scale

#LLM #Package

hgpu.org?p=30521

01.02.2026 22:37 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs SMEs increasingly seek alternatives to cloud LLM APIs, which raise data privacy concerns. Dedicated cloud GPU instances offer improved privacy but with limited guarantees and ongoing costs, while p…

Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs

#CUDA #LLM #Package

hgpu.org?p=30520

01.02.2026 22:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics This paper introduces BioAgent Bench, a benchmark dataset and an evaluation suite designed for measuring the performance and robustness of AI agents in common bioinformatics tasks. The benchmark co…

BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

#Bioinformatics #AI #LLM #Package

hgpu.org?p=30519

01.02.2026 22:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler As large language models (LLMs) move from research to production, understanding how inference engines behave in real time has become both essential and elusive. Unlike general-purpose engines such …

ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler

#OpenCL #LLM

hgpu.org?p=30518

01.02.2026 22:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Nsight Python: A Python-First Profiling Toolkit for Seamless GPU Kernel Analysis (Tool) The proliferation of Python DSLs for developing kernels has democratized GPU programming. While kernel development is now Python-native, performance analysis and optimization still rely on external…

Nsight Python: A Python-First Profiling Toolkit for Seamless GPU Kernel Analysis (Tool)

#CUDA #Triton #Profiling #Package

hgpu.org?p=30517

01.02.2026 22:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Towards Automated Kernel Generation in the Era of LLMs The performance of modern AI systems is fundamentally constrained by the quality of their underlying kernels, which translate high-level algorithmic semantics into low-level hardware operations. Ac…

Towards Automated Kernel Generation in the Era of LLMs

#CUDA #Triton #ROCm #LLM

hgpu.org?p=30511

25.01.2026 20:18 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially allevia…

A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

#CUDA #LLM

hgpu.org?p=30510

25.01.2026 20:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
SynPerf: A Hybrid Analytical-ML Framework for GPU Performance Prediction The rapid expansion of Transformer-based large language models has dramatically increased the need for high-performance GPUs. As a result, there is growing demand for fast, accurate, and widely gen…

SynPerf: A Hybrid Analytical-ML Framework for GPU Performance Prediction

#Triton #CUDA #Performance #ML

hgpu.org?p=30509

25.01.2026 20:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10 High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache perform…

Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10

#CUDA #Performance

hgpu.org?p=30508

25.01.2026 20:15 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
PhysProver: Advancing Automatic Theorem Proving for Physics The combination of verifiable languages and LLMs has significantly influenced both the mathematical and computer science communities because it provides a rigorous foundation for theorem proving. R…

PhysProver: Advancing Automatic Theorem Proving for Physics

#Physics #LLM #Package

hgpu.org?p=30507

25.01.2026 20:14 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
The New Compiler Stack: A Survey on the Synergy of LLMs and Compilers This survey has provided a systematic overview of the emerging field of LLM-enabled compilation by addressing several key research questions. We first answered how LLMs are being integrated by prop…

The New Compiler Stack: A Survey on the Synergy of LLMs and Compilers

#Compilers #LLM

hgpu.org?p=30502

11.01.2026 22:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation Diffusion models have achieved remarkable success in image and video generation. However, their inherently multiple step inference process imposes substantial computational overhead, hindering real…

DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation

#CodeGeneration #LLM

hgpu.org?p=30501

11.01.2026 22:18 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Equivalence Checking of ML GPU Kernels With the rapid progress of deep learning and large language models (LLMs), companies now spend enormous sums executing GPU kernels. These kernels have, therefore, become prime targets for aggressiv…

Equivalence Checking of ML GPU Kernels

#CUDA #PTX #LLM

hgpu.org?p=30500

11.01.2026 22:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis Modern AI models demand high-performance computation kernels. The growing complexity of LLMs, multimodal architectures, and recommendation systems, combined with techniques like sparsity and quanti…

AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis

#Triton #CUDA #CodeGeneration #DSL #LLM

hgpu.org?p=30499

11.01.2026 22:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation Parallel programming is central to HPC and AI, but producing code that is correct and fast remains challenging, especially for OpenMP GPU offload, where data movement and tuning dominate. Autonomou…

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

#CUDA #OpenMP #CodeGeneration #LLM #Package

hgpu.org?p=30498

11.01.2026 22:15 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
SeedFold: Scaling Biomolecular Structure Prediction Highly accurate biomolecular structure prediction is a key component of developing biomolecular foundation models, and one of the most critical aspects of building foundation models is identifying …

SeedFold: Scaling Biomolecular Structure Prediction

#Biology #Biomolecules

hgpu.org?p=30497

04.01.2026 20:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Hardware Acceleration for Neural Networks: A Comprehensive Survey Neural networks have become a dominant computational workload across cloud and edge platforms, but rapid growth in model size and deployment diversity has exposed hardware bottlenecks increasingly …

Hardware Acceleration for Neural Networks: A Comprehensive Survey

#FPGA #TPU #NeuralNetworks #NN #Survey

hgpu.org?p=30496

04.01.2026 20:24 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Generative Video Compression: Towards 0.01% Compression Rate for Video Transmission Whether a video can be compressed at an extreme compression rate as low as 0.01%? To this end, we achieve the compression rate as 0.02% at some cases by introducing Generative Video Compression (GV…

Generative Video Compression: Towards 0.01% Compression Rate for Video Transmission

#Compression #Video #AI

hgpu.org?p=30495

04.01.2026 20:24 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled an…

GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

#CUDA #HIP #HPC #LLM #Performance

hgpu.org?p=30494

04.01.2026 20:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges – model architecture diversity, ker…

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

#CUDA #Triton #PTX #AI #Meta #LLM

hgpu.org?p=30493

04.01.2026 20:22 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation Serving Large Language Models (LLMs) is critical for AI-powered applications, yet it demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-…

Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation

#CUDA #PTX #Triton #ProgrammingLanguages #Package

hgpu.org?p=30481

29.12.2025 11:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs Recent advances in transformer-based foundation models have made them the default choice for many tasks, but their rapidly growing size makes fitting a full model on a single GPU increasingly diffi…

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

#CUDA #AI #Memory #Package

hgpu.org?p=30480

29.12.2025 11:32 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI accelerators, eliminating the need for expert-provided hardware-s…

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

#LLM #AI #Performance

hgpu.org?p=30479

29.12.2025 11:31 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs GPU architectures have continued to grow in complexity, with recent incarnations introducing increasingly powerful fixed-function units for matrix multiplication and data movement to accompany high…

Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

#CUDA #ProgrammingLanguages

hgpu.org?p=30478

29.12.2025 11:31 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations Advancements in large language models (LLMs) are showing promising impact in software development and programming assistance. However, these models struggle when operating on low-level backend code…

PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations

#CUDA #HIP #HLSL #AI #LLM #NLP

hgpu.org?p=30477

29.12.2025 11:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA …

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

#CUDA #CUBLAS #MatrixMultiplication #Package

hgpu.org?p=30469

21.12.2025 21:27 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
BoltzGen:Toward Universal Binder Design We introduce BoltzGen, an all-atom generative model for designing proteins and peptides across all modalities to bind a wide range of biomolecular targets. BoltzGen builds strong structural reasoni…

BoltzGen:Toward Universal Binder Design

#Biology #Bioinformatics #Biomolecules #Package

hgpu.org?p=30468

21.12.2025 21:27 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks l…

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

#CUDA #CodeGeneration #LLM

hgpu.org?p=30467

21.12.2025 21:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage The AI hardware boom has led modern data centers to adopt HPC-style architectures centered on distributed, GPU-centric computation. Large GPU clusters interconnected by fast RDMA networks and backe…

PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage

#CUDA #PyTorch #Databases

hgpu.org?p=30466

21.12.2025 21:25 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
ML Inference Scheduling with Predictable Latency Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may com…

ML Inference Scheduling with Predictable Latency

#ML #MachineLearning #TaskScheduling

hgpu.org?p=30465

21.12.2025 21:24 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@hgpu is following 10 prominent accounts