An MLIR pipeline for offloading Fortran to FPGAs via OpenMP
#OpenMP #FPGA #Fortran #Package
hgpu.org?p=30356
@hgpu.bsky.social
High performance computing on graphics processing units (GPU): AMD, Nvidia, Intel, CUDA, OpenCL, OpenGL, HPC
An MLIR pipeline for offloading Fortran to FPGAs via OpenMP
#OpenMP #FPGA #Fortran #Package
hgpu.org?p=30356
HipKittens: Fast and Furious AMD Kernels
#AMD #Performance #Package
hgpu.org?p=30355
PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization
#CUDA #LLM #CodeGeneration
hgpu.org?p=30354
A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data
#CUDA #Compression #Package
hgpu.org?p=30353
MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies
#CUDA #PTX #HIP #Benchmarking #Package
hgpu.org?p=30352
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
#CUDA #CodeGeneration #Performance #Package
hgpu.org?p=30343
Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs
#CUDA #HIP #Compression #Package
hgpu.org?p=30342
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
#FP8 #Precision
hgpu.org?p=30341
AMD MI300X GPU Performance Analysis
#AMD #HIP #Benchmarking #Performance
hgpu.org?p=30340
RDMA Point-to-Point Communication for LLM Systems
#CUDA #RDMA #LLM #Package
hgpu.org?p=30339
A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations
#CUDA #DeepLearning #DL #Package
hgpu.org?p=30330
Enhancing Transformer Performance and Portability through Auto-tuning Frameworks
#CUDA #LLM #AutoTuning #PerformancePortability #Package
hgpu.org?p=30329
Serve Programs, Not Prompts
#LLM #NLP
hgpu.org?p=30328
Scalable GPU-Based Integrity Verification for Large Machine Learning Models
#SYCL #oneAPI #Rust #Security #Package
hgpu.org?p=30327
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
#CUDA #MachineLearning #ML #Package
hgpu.org?p=30326
Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels
#CUDA #Chemistry #MolecularDocking #Package
hgpu.org?p=30318
STARK: Strategic Team of Agents for Refining Kernels
#CodeGeneration #LLM
hgpu.org?p=30317
A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines
#AMD #FPGA #CodeGeneration #AI
hgpu.org?p=30316
Collective Communication for 100k+ GPUs
#CUDA #GPUcluster #LLM #Performance #Package
hgpu.org?p=30315
Tutoring LLM into a Better CUDA Optimizer
#CUDA #LLM #CodeGeneration #Package
hgpu.org?p=30314
Thesis: Compiler and Runtime Systems for Generative AI Models
#CUDA #LLM #DeepLearnig #DL #Package
hgpu.org?p=30305
Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation
#SYCL #HIP #CUDA #Performance #Package
hgpu.org?p=30304
Anonymized Network Sensing using C++26 std::execution on GPUs
#CUDA #CXX
hgpu.org?p=30303
A Performance Portable Matrix Free Dense MTTKRP in GenTen
#Kokkos #CUDA #OpenMP #Package
hgpu.org?p=30302
Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs
#CUDA #ROCm #Performance #DeepLearning #DL #Package
hgpu.org?p=30301
Thesis: High-Performance Computing: from Optimization to Automation
#CUDA #HIP #HPC
hgpu.org?p=30292
Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR
#MLIR #OpenCL #Testing #Package
hgpu.org?p=30291
ConCuR: Conciseness Makes State-of-the-Art Kernel Generation
#CUDA #CodeGeneration #LLM #DeepLearning #DL #Package
hgpu.org?p=30290
Accelerating cosmological simulations on GPUs: a portable approach using OpenMP
#OpenMP #HPC #Astrophysics #Package
hgpu.org?p=30289
EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models
#CUDA #LLM #AI #DeepLearning #DL #PyTorch
hgpu.org?p=30288