The LLVM Project Blog's Avatar

The LLVM Project Blog

@blog.llvm.org.web.brid.gy

LLVM Project News and Details from the Trenches [bridged from https://blog.llvm.org/ on the web: https://fed.brid.gy/web/blog.llvm.org ]

4 Followers  |  0 Following  |  29 Posts  |  Joined: 01.01.0001  |  2.6477

Latest posts by blog.llvm.org.web.brid.gy on Bluesky

LLVMCGO25 - CARTS: Enabling Event-Driven Task and Data Block Compilation for Distributed HPC # LLVMCGO25 - CARTS: Enabling Event-Driven Task and Data Block Compilation for Distributed HPC Hello everyone! I’m Rafael, a PhD candidate at the University of Delaware. I recently flew from Philadelphia to Las Vegas to attend the CGO conference,where I had the chance to present my project and soak in new ideas about HPC. In this blog, I’ll dive into the project I discussed at the conference and share some personal insights and lessons I learned along the way.Although comments aren’t enabled here, I’d love to hear from you, feel free to reach out at (_rafaelhg at udel dot edu_) if you’re interested in collaborating, have questions, or just want to chat. ## Motivation: Why CARTS? Modern High-Performance Computing (HPC) and AI/ML workloads are pushing our hardware and software to the limits. Some key challenges include: * **Evolving Architectures:** Systems now have complex memory hierarchies that need smart utilization. * **Hardware Heterogeneity:** With multi-core CPUs, GPUs, and specialized accelerators in the mix, resource management gets tricky. * **Performance Pressure:** Large-scale systems demand efficient handling of concurrency, synchronization, and communication. These challenges led to the creation of CARTS—a compiler framework that combines the flexibility of MLIR with the reliability of LLVM to optimize applications for distributed HPC environments. ## A Closer Look at ARTS and Its Inspirations At the heart of CARTS is ARTS. Originally, ARTS stood for the **Abstract Runtime System**.I often get mixed up and mistakenly call it the **Asynchronous Runtime System**. To keep things light,we sometimes joke about it being the **Any Runtime System**. ARTS is inspired by the Codelet model, a concept I could talk about all day!The Codelet model breaks a computation into small, independent tasks (or “codelets”) that can run as soon as their data dependencies are met.If you’re curious to learn more about this model (or find it delightfully abstract), I suggest you visit our research group websiteat CAPSL, University of Delaware and check out the Codelet Model website. ### What Does ARTS Do? ARTS is designed to support fine-grained, event-driven task execution in distributed systems. Here’s a simple breakdown of some key concepts: * **Event-Driven Tasks (EDTs):** These are the basic units of work that can be scheduled independently. Think of an EDT as a small, self-contained task that runs once all its required data is ready. * **DataBlocks:** These represent memory regions holding the data needed by tasks. ARTS tracks these DataBlocks across distributed nodes so that tasks have quick and efficient access to the data they need. * **Events:** These are signals that tell the system when a DataBlock is ready or when a task has finished. They help synchronize tasks without the need for heavy locks. * **Epochs:** These act as synchronization boundaries. An epoch groups tasks together, ensuring that all tasks within the group finish before moving on to the next phase. By modeling tasks, DataBlocks, events, and epochs explicitly, ARTS makes it easier to analyze and optimize how tasks are executed across large, distributed systems. ## The CARTS Compiler Pipeline Building on ARTS, CARTS creates a task-centric compiler workflow. Here’s how it works: ### Clang/Polygeist: From C/OpenMP to MLIR * **Conversion Process:** Using the Polygeist infrastructure, we translate C/OpenMP code into MLIR. This process handles multiple dialects (like OpenMP, SCF, Affine, and Arith). * **Extended Support:** We’ve enhanced it to handle more OpenMP constructs, including OpenMP Tasks ### ARTS Dialect: Simplifying Concurrency * **Custom Language Constructs:** The ARTS dialect converts high-level OpenMP tasks into a form that directly represents EDTs, DataBlocks, events, and epochs. * **Easier Analysis:** This clear representation makes it simpler to analyze and optimize the code. ### Optimization and Transformation Passes * **EDT Optimization:** We remove redundant tasks and optimize task structures—for example, turning a “parallel” task that contains only one subtask into a “sync” task. * **DataBlock Management:** We analyze memory access patterns to decide which DataBlocks are needed and optimize their usage. * **Event Handling and Classic Optimizations:** We allocate and manage events, applying techniques like dead code elimination and common subexpression elimination to clean up the code. ### Lowering to LLVM IR and Runtime Integration * **Conversion to LLVM IR:** The ARTS-enhanced MLIR is converted into LLVM IR. This involves outlining EDT regions into functions and inserting ARTS API calls for task, DataBlock, epoch, and event management. * **Seamless Integration:** The final binary runs on the ARTS runtime, which schedules tasks dynamically based on data readiness. ## Looking Ahead: Future Directions for CARTS The journey with CARTS is just beginning. Here’s a glimpse of what’s next: * **Comprehensive Benchmarking:** Testing the infrastructure with a variety of benchmarks to validate performance under diverse scenarios. * **Expanded OpenMP Support:** Enhancing support for additional OpenMP constructs such as loops, barriers, and locks. * **Advanced Transformation Passes:** Developing techniques like dependency pruning, task splitting/fusion, and affine transformations to further optimize task management and data locality. * **Memory-Centric Optimizations:** Implementing strategies like cache-aware tiling, data partitioning, and optimized memory layouts to reduce cache misses and enhance data transfer efficiency. * **Feedback-Directed Compilation:** Incorporating runtime profiling data to adapt optimizations dynamically based on actual workload and hardware behavior. * **Domain-Specific Extensions:** Creating specialized operations for domains such as stencil computations and tensor operations to boost performance in targeted HPC applications. ## Wrapping Up Conferences like CGO are not just about technical presentations, they’re also about meeting people and sharing ideas. I really enjoyed the mix of technical sessions and informal conversations.One of my favorite moments was meeting a professor at the conference and joking about how we only seem to meet when we’re away from Newark.It’s these human connections, along with the valuable feedback on my work, that make attending such events worthwhile. Here are a few personal takeaways: * **Invaluable Feedback:** Presenting work-in-progress at LLVM CGO workshops has taught me that constructive criticism is the fuel for innovation. * **Community Spirit:** Reconnecting with fellow researchers, whether through formal sessions or casual hallway conversations, enriches both our professional and personal lives.I encourage fellow PhD candidates and early-career researchers to take every opportunity to present your work,your ideas might not be 100% polished, but the community is there to help you refine them. Presenting CARTS allowed me to share detailed technical insights, discuss the practical challenges of HPC, and even have a few laughs along the way. While the technical details might seem dense at times, Ihope the mix of personal anecdotes and hands-on explanations makes the topic accessible and engaging.If you’re interested in discussing more about ARTS, the Codelet model, or anything else related to HPC, please drop me an email at (_rafaelhg at udel dot edu_). I’d love to chat, collaborate, or simply hang out. ## Acknowledgements * This work is supported by the US DOE Office of Science project “Advanced Memory to Support Artificial Intelligence for Science” at PNNL. PNNL is operated by Battelle Memorial Institute under Contract DEAC06-76RL01830. * Thanks to the LLVM Foundation for the travel award that made attending the CGO conference possible.
04.08.2025 00:00 — 👍 0    🔁 0    💬 0    📌 0
LLVM Fortran Levels Up: Goodbye flang-new, Hello flang! LLVM has included a Fortran compiler “Flang” since LLVM 11 in late 2020. However,until recently the Flang binary was not `flang` (like `clang`) but instead`flang-new`. LLVM 20 ends the era of `flang-new`. The community has decided that Flang isworthy of a new name. The “new” name? You guessed it, `flang`. A simple change that represents a major milestone for Flang. This article will cover the almost 10 year journey of Flang. The firstconcepts, multiple rewrites, the adoption of LLVM’s Multi Level IntermediateRepresentation (MLIR) and Flang entering the LLVM Project. If you want to try `flang` right now, you candownloadit or try it in your browser usingCompiler Explorer. # Why Fortran? Fortran was first created in the 1950s, and the name came from “Formula Translation”.Fortran focused on the mathematics use case and freed programmers from writingassembly code that could only run on specific machines. Instead they could write code that looked like a formula. You expect this todaybut for the time it was a revolution. This feature led to heavy use in scientificcomputing: weather modelling, fluid dynamics and computational chemistry, justto name a few. > Whilst many alternative programming languages have comeand gone, it [Fortran] has regained its popularity for writing highperformance codes. Indeed, over 80% of the applicationsrunning on ARCHER2, a 750,000 core Cray EX which isthe UK national supercomputer, are written in Fortran. * Fortran High-Level Synthesis: Reducing the barriersto accelerating High Performance Computing (HPC) codes on FPGAs (Gabriel Rodriguez-Canal et al., 2023) Fortran has had a resurgencein recent years, gaining a package manager, an unofficialstandard library and LFortran,a compiler that supports interactive programming (LFortran also uses LLVM). For the full history of Fortran, IBM has an excellent articleon the topic and I encourage you to look at the“Programmer’s Primer for Fortran”if you want to see the early form of Fortran. If you want to learn the language, fortran-lang.orgis a great place to start. # Why Would You Make Another Fortran Compiler? There are many Fortran compilers. Some are vendor specific such as theIntel Fortran Compileror NVIDIA’s HPC compilers. Thenthere are open source options like GFortran, whichsupports many platforms. Why build one more? The two partners in the early days of Flang were the US National Labs and NVIDIA. For Pat McCormick (Flang project lead at Los Alamos National Laboratory) preservingthe utility of Fortran code was imperative: > These [Fortran] codes represent an essential capability that supports manyelements of our [The United States’] scientific mission and will continue to doso for the foreseeable future. A fundamental risk facing these codes is theabsence of a long-term, non-proprietary support path for Fortran. GFortran might seem to counter that statement, but remember that a single projectis a single point of failures, incompatibilities and disagreements. Having multipleimplementations reduces that risk. NVIDIA’s Gary Klimowicz laid outtheir goals for Flang in a presentation to FortranCon in 2020: * Use a permissive license like that of LLVM,which is more palatable to commercial users and contributors. * Develop an active community of Fortran compiler developers that includescompanies and institutions. * Support Fortran tool development by basing Flang on existing LLVM frameworks. * Support Fortran language experimentation for future language standards proposals. Intentions echoed by Pat McCormick: > The overarching goal was to establish an open-source, modern implementation andsimultaneously grow a community that spanned industry, academia, and federalagencies at both the national and international levels. Fortran as a language also benefits from having many implementations. For C++language features, it is common to implement them on top of Clang and GCC, toprove the feature is viable and get feedback. Implementing the feature multiple times in different compilers uncoversassumptions that may be a problem for certain compilers, or certain groups ofcompiler users. In the same way, Flang and GFortran can provide that diversity. However, even when features are standardised, standards can be ambiguous andimplementations do make mistakes. A new compiler is a chance to uncover these. Jeff Hammond (NVIDIA) is very familiar with this, having tested Flang with manyexisting applications. They had this to say on the motivations for Flangand how users have reacted to it: > The Fortran language has changed quite a bit over the past 30 years. Modern Fortrandeserves a modern compiler ecosystem, that’s not only capable of compiling allthe old codes and all the code written for the current standard, but also supportsinnovation in the future. > > Because it’s a huge amount of work to build a feature-complete modern Fortran compiler,it’s useful to leverage the resources of the entire LLVM community for this effort.NVIDIA and ARM play leading roles right now, with important contributions from IBM,Fujitsu and LBNL [Lawrence Berkeley National Laboratory], e.g. related to testsuites and coarrays. We hope to see the developer community grow in the future. > > Another benefit from the LLVM Fortran compiler is that users are more likely toinvest in supporting a new compiler when it has full language support and runs onall the platforms. A broad developer base is critical to support all the platforms. > > What I have seen so far interacting with our Fortran users is that they are veryexcited about LLVM Flang and were willing to commit to supporting it in theirbuild systems and CI systems, which has driven quality improvements in both theFlang compiler and the applications. > > Like Clang did with C and C++ codes when it started to become popular, Flangis helping to identify bugs in Fortran code that weren’t noticed before, whichis making the Fortran software ecosystem better. # PGI to LLVM: The Flang Timeline The story of Flang really starts in 2015, but the Portland Group (PGI) collaboratedwith US National Labs prior to this. PGI would later become part of NVIDIA andbe instrumental to the Flang project. * **1989** The Portland Groupis formed. To provide C, Fortran 77 and C++ compilers for the Intel i860 market. * **1990** Intel bundles PGI compilers with its iPSC/860 supercomputer. * **1996** PGIworks withSandia National Laboratories to provide compilers for the Accelerated Strategic Computing Initiative (ASCI) Option Redsupercomputer. * **December 2000** PGI becomes awholly owned subsidiary ofSTMicroElectronics. * **August 2011** Away from PGI, Bill Wendling startsan LLVM based Fortran compiler called “Flang” (later known as “Fort”).Bill is joined by several collaborators a few months later. * **July 2013** PGI is sold to NVIDIA. In late 2015 there were the first signs of what would become “Classic Flang”. Thoughat the time it was just “Flang”, I will use “Classic Flang” here for clarity. Development of what was to become “Fort” continued under the “Flang” name,completely separate from the Classic Flang project. * **November 2015** NVIDIA joins the US Department of EnergyExascale Computing Project. Including a commitment to create an open sourceFortran compiler. > “The U.S. Department of Energy’s National Nuclear Security Administration and itsthree national labs [Los Alamos, Lawrence Livermore and Sandia] have reached anagreement with NVIDIA’s PGI division to adapt and open-source PGI’s Fortranfrontend, and associated Fortran runtime library, for contribution to the LLVM project.” (this news is also the first appearance of Flang in an issue ofLLVM Weekly) * **May 2017** The first release of Classic Flang as a separaterepository, outside of the LLVM Project. Composed of a PGI compiler frontendand a new backend that generates LLVM Intermediate Representation (LLVM IR). * **August 2017** The Classic Flang project is announced officially(according to LLVM Weekly’s report, the original mailing list is offline). During this time, plans were formed to propose moving Classic Flang into the LLVMProject. * **December 2017** The original “Flang” is renamed to“Fort”so as not to compete with Classic Flang. * **April 2018** Steve Scalpone (NVIDIA) announcesat the European LLVM Developers’ Conference that the frontend of Classic Flang will be rewritten to addressfeedback from the LLVM community. This new front end became known as “F18”. * **August 2018** Eric Schweitz (NVIDIA) begins work on what would become“Fortran Intermediate Representation”, otherwise known as “FIR”. This work wouldlater become the `fir-dev` branch. * **February 2019** Steve Scalpone proposescontributing F18 to the LLVM Project. * **April 2019** F18 is approvedfor migration into the LLVM Project monorepo. At this point F18 was only the early parts of the compiler, it could not generatecode (later `fir-dev` work addressed this). Despite that, it moved into `flang/`in the monorepo, awaiting the completion of the rest of the work. * **June 2019** Peter Waller (Arm) proposesadding a Fortran mode to the Clang compiler driver. * **August 2019** The first appearanceof the `flang.sh` driver wrapper script (more on this later). * **December 2019** The planfor rewriting the F18 git history to fit into the LLVM project is announced.This effort was led by Arm, with Peter Waller going so far as to writea custom toolto rewrite the history of F18. Kiran Chandramohan (Arm) proposesan OpenMP dialect for MLIR, with the intention of using it in Flang (discussioncontinues on Discourseduring the following January). * **February 2020** The planfor improvements to F18 to meet the standards required for inclusion in theLLVM monorepo is announced by Richard Barton (Arm). * **April 2020** Upstreaming of F18 into the LLVM monorepo iscompleted. At this point what was in the LLVM monorepo was F18, the rewritten frontend ofClassic Flang. Classic Flang remained unchanged, still using the PGI based frontend. Around this time work started in the Classic Flang repo on the `fir-dev` branchthat would enable code generation when using F18. For the following events remember that Classic Flang was still in use. The ClassicFlang binary is named `flang`, just like the folder F18 now occupies in the LLVM Project. **Note:** Some LLVM changes referenced below will appear to have skipped an LLVM release.This is because they were done after the release branch was created, but beforethe first release from that branch was distributed. * **April 2020** The first attempt at adding a new compiler driver for Flang isposted for review. It used the name`flang-tmp`. This change was later abandoned in favour of a different approach. * **September 2020** Flang’s new compiler driver is addedas an experimental option. This is the first appearance of the `flang-new` binary,instead of `flang-tmp` as proposed before. > The name was intended as temporary, but not the driver. * Andrzej Warzyński (Arm, Flang Driver Maintainer) * **October 2020** Flang is included in an LLVM release for the first time inLLVM 11.0.0. There is an `f18` binary and the previously mentioned script`flang.sh`. * **August 2021** `flang-new` is no longer experimental and replacesthe previous Flang compiler driver binary `f18`. * **October 2021** LLVM 13.0.0 is the first release to include a `flang-new` binary(alongside `f18`). * **March 2022** LLVM 14.0.0 releases, with `flang-new` replacing `f18` as the Flangcompiler driver. * **April 2022** NVIDIA ceases developmentof the `fir-dev` branch in the Classic Flang project. Upstreaming of `fir-dev`to the LLVM Project begins around this date. `flang-new` can now do code generationif the `-flang-experimental-exec` option is used. This change used workoriginally done on the `fir-dev` branch. * **May 2022** Kiran Chandramohan announcesat the European LLVM Developers’ Meeting that Flang’s OpenMP 1.1 support is close to complete. The `flang.sh` compiler driver script becomes `flang-to-external-fc`. Itallows the user to use `flang-new` to parse Fortran source code, then write it backto a file to be compiled with an existing Fortran compiler. The script can be put in place of an existing compiler to test Flang’s parser onlarge projects. * **June 2022** Brad Richardson (Berkeley Lab) changes`flang-new` to generate code by default, removing the `-flang-experimental-exec`option. * **July 2022** Valentin Clément (NVIDIA) announcesthat upstreaming of `fir-dev` to the LLVM Project is complete. * **September 2022** LLVM 15.0.0 releases, including Flang’s experimental codegeneration option. * **September 2023** LLVM 17.0.0 releases, with Flang’s code generation enabledby default. At this point the LLVM Project contained Flang as it is known today. Sometimesreferred to as “LLVM Flang”. “LLVM Flang” is the combination of the F18 frontend and MLIR-based code generationfrom `fir-dev`. As opposed to “Classic Flang” that combines a PGI based frontend andits own custom backend. The initiative to upstream Classic Flang was in some sense complete. Thoughwith all of the compiler rewritten in the process, what landed in the LLVM Projectwas very different to Classic Flang. * **April 2024** The `flang-to-external-fc` script is removed. * **September 2024** LLVM 19.1.0 releases. The first release of `flang-new`as a standalone compiler. * **October 2024** The community deems that Flang has met the criteria to not be“new” and the name is changed. Goodbye `flang-new`, hello `flang`! * **November 2024** AMD announcesits next generation Fortran compiler, based on LLVM Flang. Arm releases an experimental versionof its new Arm Toolchain for Linux product, which includes LLVM Flangas the Fortan compiler. * **March 2025** LLVM 20.1.0 releases. The first time the `flang` binary has beenincluded in a release. # Flang and the Definition of New Renaming Flang was discussed a few times beforethe final proposal. It was always contentious, so for the finalproposalBrad Richardson decided to use the LLVM proposal process.Rarely used, but specifically designed for these situations. > After several rounds of back and forth, I thought the discussion wasdevolving and there wasn’t much chance we’d come to a consensus without someoutside perspective. * Brad Richardson That outside perspective included Chris Lattner (co-founder of the LLVM Project),who quicklyidentifieda unique problem: > We have a bit of an unprecedented situation where an LLVM project is takingthe name of an already established compiler [Classic Flang]. Everyone seems towant the older flang [Classic Flang] to fade away, but flang-new is not asmature and it isn’t clear when and what the criteria should be for that. Confusion about the `flang` name was a key motivation for Brad Richardson too: > Part of my concern was that the name “flang-new” would get common usagebefore we were able to change it. I think it’s now been demonstrated that thatconcern was valid, because right now November 2024] fpm [[Fortran Package Manager]recognizes the compiler by that name. > > My main goal at that point was just clear goals for when we wouldmake the name change. No single list of goals was agreed, but some came up many times: * Known limitations and supported features should be documented. * As much as possible, work that was expected to fix knownbugs should be completed, to prevent duplicate bug reports. * Unimplemented language features should fail with a message saying that they areunimplemented. Rather than with a confusing failure or by producing incorrectcode. * LLVM Flang should perform relatively well when compared to other Fortrancompilers. * LLVM Flang must have a reasonable pass rate with a large Fortran language testsuite, and results of that must be shown publicly. * All reasonable steps should be taken to prevent anyone using a pre-packagedClassic Flang confusing it with LLVM Flang. You will see a lot of relative language in those, like “reasonable”. Noone could say exactly what that meant, but everyone agreed that it wasinevitable that one day it would all be true. Paul T Robinson summarised the dilemma earlyin the thread: > > the plan is to replace Classic Flang with the new Flang in the future. > > I suppose one of the relevant questions here is: Has the future arrived? After that Steve Scalpone (NVIDIA) gavetheir perspectivethat it was not yet time to change the name. So the community got to work on those goals: * Many performance and correctness issues were addressed by the “High LevelFortran Intermediate Representation” (HLFIR) (which this article will explain later). * A cross-company team including Arm, Huawei, Linaro, Nvidia and Qualcommcollaboratedto make it possible to build the popular SPEC 2017benchmark with Flang. * Flang gained support for OpenMP up to version 2.5, and was able to compile OpenMPspecific benchmarks like SPEC OMP and theNAS Parallel Benchmarks. * Linaro showed that the performanceof Flang compared favourably with Classic Flang and was not far behind GFortran. * The GFortran test suite was added to the LLVM Test Suite,and Flang achieved good results. * Fujitsu’s test suite was madepublic and tested with Flang. The process to make IBM’s Fortran test suite publicwas started. With all that done, in October of 2024 `flang-new`became`flang`. The future had arrived. > And it’s merged! It’s been a long (and sometimes contentious) process, butthank you to everyone who contributed to the discussion. * Brad Richardson, closing out the proposal. The goals the community achieved have certainly been worth it for Flang as acompiler, but did Brad achieve their own goals? > What did I hope to see as a result of the name change? I wanted it to beeasier for more people to try it out. So once you have finished reading this article,downloadFlang or try it out on Compiler Explorer.You know at least one person will appreciate it! # Fortran Intermediate Representation (FIR) All compilers that use LLVM as a backend eventually produce code in the form ofthe LLVM Intermediate Representation(LLVM IR). A drawback of this is that LLVM IR does not include language specific information.This means that for example, it cannot be used to optimise arrays in a wayspecific to Fortran programs. One solution to this has been to build a higher level IR that represents theunique features of the language, optimise that, then convert the result into LLVM IR. Eric Schweitz (NVIDIA) started to do that for Fortran in late 2018: > FIR was originally conceived as a high-level IR that would interoperate withLLVM but have a representation more friendly and amenable to Fortranoptimizations. Naming is hard but Eric did well here: > FIR was a pun of sorts. Fortran IR and meant to be evocative of the trees(Abstract Syntax Trees). We will not go into detail about this early FIR, because MLIRwas revealed soon after Eric started the project and they quickly adopted it. > When MLIR was announced, I quickly switched gears from building datastructures for a new “intermediate IR” to porting my IR design to MLIR andusing that instead. > > I believe FIR was probably the first “serious project” outside of Google tostart using MLIR. The FIR work continued to develop, with Jean Perier (NVIDIA) joining Eric onthe project. It became its own public branch `fir-dev`, which was later contributedto the LLVM Project. The following sections will go into detail on the intermediate representationsthat Flang uses today. # MLIR The journey from Classic Flang to LLVM Flang involved a rewrite of theentire compiler. This provided an opportunity to pick up new things fromthe LLVM Project. Most notably MLIR. “Multi-Level Intermediate Representation” (MLIR) was firstintroduced to the LLVMcommunity in 2019, around the time that F18 was approved to move into the LLVM Project. The problem that MLIR addresses is the same one that Eric Schweitz tackled with FIR:It is difficult to map high level details of programming languagesinto LLVM IR. You either have to attach them to the IR as metadata, try to recover thelost details later, or fight an uphill battle to add the details toLLVM IR itself. These details are crucial for producing optimised code in certainlanguages. (Fortran array optimisations were one use case referenced). This led languages such as Swift and Rust to create their own IRs that includeinformation relevant to their own optimisations. After that IR has been optimisedit is converted into LLVM IR and goes through the normal compilation pipeline. To implement these IRs they have to build a lot of infrastructure, but it cannotbe shared between the compilers. This is where MLIR comes in. > The MLIR project aims to directly tackle these programming language design andimplementation challenges—by making it very cheap to define and introduce newabstraction levels, and provide “in the box” infrastructure to solve commoncompiler engineering problems. * “MLIR: A Compiler Infrastructure for the End of Moore’s Law”(Chris Lattner, Mehdi Amini et al., 2020) ## Flang and MLIR The same year MLIR debuted, Eric Schweitz gave a talk at the later USLLVM Developers’ meeting titled“An MLIR Dialect for High-Level Optimization of Fortran”.FIR by that point was implemented as an MLIR dialect. > That [switching FIR to be based on MLIR] happened very quickly and I neverlooked back. > > MLIR, even in its infancy, was clearly solving many of the exact same problemsthat we were facing building a new Fortran compiler. * Eric Schweitz The MLIR community were also happy to have Flang on board: > It was fantastic to have very quickly in the early days of MLIR a non-ML [Machine Learning] frontendto exercise features we built in MLIR in anticipation. It led us to course-correctin some cases, and Flang was a motivating factor for many feature requests.It contributed significantly to establishing and validating that MLIR had the right foundations. * Mehdi Amini Flang did not stop there, later adding another dialect“High Level Fortran Intermediate Representation”(HLFIR) which works at a higher level than FIR. A big target of HLFIRwas array optimisations, that were more complex to handle using FIR alone. > FIR was a compromise on both ends to some degree. It wasn’t trying to capturesyntactic information from Fortran, and I assumed there would be work done onan Abstract Syntax Tree. That niche would later be filled by “High Level FIR”[HLFIR]. * Eric Schweitz ## IRs All the Way Down The compilation process starts with Fortran source code. subroutine example(a, b) real :: a(:), b(:) a = bend subroutine (Compiler Explorer) Thesubroutine `example`assigns array `b` to array `a`. It is tempting to think of the IRs in a “stack” where each one is convertedinto the next. However, MLIR allows multiple “dialects” of MLIR to exist in thesame file. (The steps shown here are the most important ones for Flang. In reality thereare many more between Fortran and LLVM IR.) In the first step, Flang produces a file that is a mixture of HLFIR, FIRand the built-in MLIR dialect `func` (function). module attributes {<...>} { func.func @_QPexample(%arg0: !fir.box<!fir.array<?xf32>> {fir.bindc_name = "a"}, %arg1: !fir.box<!fir.array<?xf32>> {fir.bindc_name = "b"}) { %0 = fir.dummy_scope : !fir.dscope %1:2 = hlfir.declare %arg0 dummy_scope %0 {uniq_name = "_QFexampleEa"} : (!fir.box<!fir.array<?xf32>>, !fir.dscope) -> (!fir.box<!fir.array<?xf32>>, !fir.box<!fir.array<?xf32>>) %2:2 = hlfir.declare %arg1 dummy_scope %0 {uniq_name = "_QFexampleEb"} : (!fir.box<!fir.array<?xf32>>, !fir.dscope) -> (!fir.box<!fir.array<?xf32>>, !fir.box<!fir.array<?xf32>>) hlfir.assign %2#0 to %1#0 : !fir.box<!fir.array<?xf32>>, !fir.box<!fir.array<?xf32>> return }} For example, the “dummy arguments” (thearguments of a subroutine)are declared with `hlfir.declare` but their type is specified with `fir.array`. As MLIR allows multiple dialects to exist in the same file, there is no need forHLFIR to have a `hlfir.array` that duplicates `fir.array`, unless HLFIR wantedto handle that differently. The next step is to convert HLFIR into FIR: module attributes {<...>} { func.func @_QPexample(<...>) { <...> %c3_i32 = arith.constant 3 : i32 %7 = fir.convert %0 : (!fir.ref<!fir.box<!fir.array<?xf32>>>) -> !fir.ref<!fir.box<none>> %8 = fir.convert %5 : (!fir.box<!fir.array<?xf32>>) -> !fir.box<none> %9 = fir.convert %6 : (!fir.ref<!fir.char<1,17>>) -> !fir.ref<i8> %10 = fir.call @_FortranAAssign(%7, %8, %9, %c3_i32) : (!fir.ref<!fir.box<none>>, !fir.box<none>, !fir.ref<i8>, i32) -> none return }<...>} Then this bundle of MLIR dialects is converted into LLVM IR: define void @example_(ptr %0, ptr %1) { <...> store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } %37, ptr %3, align 8 call void @llvm.memcpy.p0.p0.i32(ptr %5, ptr %4, i32 48, i1 false) %38 = call {} @_FortranAAssign(ptr %5, ptr %3, ptr @_QQclX2F6170702F6578616D706C652E66393000, i32 3) ret void}<...> This LLVM IR passes through the standard compilation pipeline that clang also uses.Eventually being converted into target specificMachine IR (MIR), into assembly andfinally into a binary program. * Fortran * MLIR (including HLFIR and FIR) * MLIR (including FIR) * LLVM IR * MIR * Assembly * Binary At each stage, the optimisations most suited to that stage are done.For example, while you have HLFIR you could optimise array accesses because at thatpoint you have the most information about how the Fortran treats arrays. If Flang were to do this later on, in LLVM IR, it would be much more difficult.Either the information would be lost or incomplete, or you would be at a stage inthe pipeline where you cannot assume that you started with a specific sourcelanguage. # OpenMP to Everyone **Note:** Most of the points made in this section also apply to OpenACC support in Flang. In the interest of brevity, Iwill only describe OpenMP in this article. You can find more about OpenACCin this presentation. ## OpenMP Basics OpenMP is a standardised API for addingparallelism to C, C++ and Fortran programs. Programmers mark parts of their code with “directives”. These directivestell the compiler how the work of the program should be distributed.Based on this, the compiler transforms the code and inserts calls to anOpenMP runtime library for certain operations. This is a Fortran example: SUBROUTINE SIMPLE(N, A, B) INTEGER I, N REAL B(N), A(N)!$OMP PARALLEL DO DO I=2,N B(I) = (A(I) + A(I-1)) / 2.0 ENDDOEND SUBROUTINE SIMPLE (from “OpenMP Application Programming Interface Examples”, Compiler Explorer) **Note:** Fortran arrays are one-based by default. So the first element is at index 1. This example reads the previous element as well, so it starts `I` at 2. `!$OMP PARALLEL DO` is a directive in the form of a Fortran comment (Fortrancomments start with `!`).`PARALLEL DO` starts a parallel “region” thatincludes the code from `DO` to `ENDDO`. This tells the compiler that the work in the `DO` loop should be shared amongstall the threads available to the program. Clang has supported OpenMPfor many years now. The equivalent C++ code is: void simple(int n, float *a, float *b){ int i; #pragma omp parallel for for (i=1; i<n; i++) b[i] = (a[i] + a[i-1]) / 2.0;} (Compiler Explorer) For C++, the directive is in the form of a `#pragma` and attachedto the `for` loop. LLVM IR does not know anything about OpenMP specifically, so Clang does all thework of converting the intent of the directives into LLVM IR. The output fromClang looks like this: define dso_local void @simple(int, float*, float*) (i32 noundef %n, ptr noundef %a, ptr noundef %b) <...> {entry:<...> call void (<...>) @__kmpc_fork_call(@simple <...> (.omp_outlined) <...>) ret void}define internal void @simple(int, float*, float*) (.omp_outlined) (ptr <...> %.global_tid., ptr <...> %.bound_tid., ptr <...> %n, ptr <...> %b, ptr <...> %a) {entry:<...> call void @__kmpc_for_static_init_4(<...>)<...>omp.inner.for.body.i:<...>omp.loop.exit.i: call void @__kmpc_for_static_fini(<...>)<...> ret void} (output edited for readability) The body of `simple` no longer does all the work. Instead it uses`__kmpc_fork_call` to tell the OpenMPruntime libraryto run another function, `simple (.omp_outlined)` to do the work. This second function is referred to as a “micro task”. The runtime librarysplits the work across many instances of the micro task and each timethe micro task function is called, it gets a different slice of the work. The number of instances is only known at runtime, and can be controlled withsettings such as `OMP_NUM_THREADS`. The LLVM IR representation of `simple (.omp_outlined)` includes labels like`omp.loop.exit.i`, but these are not specific to OpenMP. They are just normal LLVM IRlabels whose name includes `omp`. ## Sharing Clang’s OpenMP Knowledge Shortly after Flang was approved to join the LLVM Project, it was proposed thatFlang should share OpenMP support code with Clang. > This is an RFC for the design of the OpenMP front-ends under the LLVMumbrella. It is necessary to talk about this now as Flang (aka. F18) ismaturing at a very promising rate and about to become a sub-project nextto Clang. > > TLDR;Keep AST nodes and Sema separated but unify LLVM-IR generation forOpenMP constructs based on the (almost) identical OpenMP directivelevel. * “RFC] Proposed interplay of Clang & Flang & LLVM wrt. OpenMP”,Johannes Doerfert (Lawrence Livermore National Laboratory), May 2019 (only one[partof this still exists online, this quote is from a copy of the other part, which was provided to me). For our purposes, the “TLDR” means that although both compilers have differentinternal representations of the OpenMP directives, they both have to produceLLVM IR from that representation. This proposal led to the creation of the `LLVMFrontendOpenMP` library in`llvm`. By using the same class `OpenMPIRBuilder`, there is no need to repeat work inboth compilers, at least for this part of the OpenMP pipeline. As you will see in the following sections, Flang has diverged from Clang for otherparts of OpenMP processing. ## Bringing OpenMP to MLIR Early in 2020, Kiran Chandramohan (Arm) proposedan MLIR dialect for OpenMP, for use by Flang. > We started the work for the OpenMP MLIR dialect because of Flang.… So, MLIR has an OpenMP dialect because of Flang. * Kiran Chandramohan This dialect would represent OpenMP specifically, unlike the generic LLVM IRyou get from Clang. If you compile the original Fortran OpenMP example without OpenMP enabled, youget this MLIR: module attributes {<...>} { func.func @_QPsimple(<...> { %1:2 = hlfir.declare %arg0 <...> {uniq_name = "_QFsimpleEn"} : <...> %3:2 = hlfir.declare %2 <...> {uniq_name = "_QFsimpleEi"} : <...> %10:2 = hlfir.declare %arg1(%9) <...> {uniq_name = "_QFsimpleEa"} : <...> %17:2 = hlfir.declare %arg2(%16) <...> {uniq_name = "_QFsimpleEb"} : <...> %22:2 = fir.do_loop <...> { <...> hlfir.assign %34 to %37 : f32, !fir.ref<f32> } fir.store %22#1 to %3#1 : !fir.ref<i32> return }} (output edited for readability) Notice that the `DO` loop has been converted into `fir.do_loop`. Now enableOpenMP and compile again: module attributes {<...>} { func.func @_QPsimple(<...> { %1:2 = hlfir.declare %arg0 <...> {uniq_name = "_QFsimpleEn"} : <...> %10:2 = hlfir.declare %arg1(%9) <...> {uniq_name = "_QFsimpleEa"} : <...> %17:2 = hlfir.declare %arg2(%16) <...> {uniq_name = "_QFsimpleEb"} : <...> omp.parallel { %19:2 = hlfir.declare %18 {uniq_name = "_QFsimpleEi"} : <...> omp.wsloop { omp.loop_nest (%arg3) : i32 = (%c2_i32) to (%20) inclusive step (%c1_i32) { hlfir.assign %32 to %35 : f32, !fir.ref<f32> omp.yield } } omp.terminator } return }} (output edited for readability) You will see that instead of `fir.do_loop` you have `omp.parallel`,`omp.wsloop` and `omp.loop_nest`. `omp` is an MLIR dialect that describesOpenMP. This translation of the `PARALLEL DO` directive is much more literal thanthe LLVM IR produced by Clang for `parallel for`. As the `omp` dialect is specifically made for OpenMP, it can representit much more naturally. This makes it easier to understand the code and towrite optimisations. Of course Flang needs to produce LLVM IR eventually, and to do that ituses the same `OpenMPIRBuilder` class that Clang does. From theMLIR shown previously, `OpenMPIRBuilder` produces the following LLVM IR: define void @simple_ <...> {entry: call void (<...>) @__kmpc_fork_call( <...> @simple_..omp_par <...>) ret void}define internal void @simple_..omp_par <...> {omp.par.entry: call void @__kmpc_for_static_init_4u <...>omp_loop.exit: call void @__kmpc_barrier(<...>) ret voidomp_loop.body: <...>} The LLVM IR produced by Flang and Clang is superficially different, butstructurally very similar. Considering the differences in source languageand compiler passes, it is not surprising that they are not identical. ## ClangIR and the Future It is surprising that a compiler for a language as old as Fortran got ahead ofClang (the most well known LLVM based compiler) when it came to adopting MLIR. This is largely due to timing, MLIR is a recent invention and Clang existedbefore MLIR arrived. Clang also has a legacy to protect, so it is unlikely tomigrate to a new technology right away. The ClangIR project is working to changeClang to use a new MLIR dialect, “Clang Intermediate Representation” (“CIR”).Much like Flang and its HLFIR/FIR dialects, ClangIR will convert C and C++into the CIR dialect. Work on OpenMP support for ClangIR has already started,using the `omp` dialect that was originally added for Flang. Unfortunately at time of writing the `parallel` directive is not supported byClangIR. However, if you look at the CIR produced when OpenMP is disabled, you cansee the `cir.for` element that the OpenMP dialect might replace: module <...> attributes {<...>} { cir.func @_Z6simpleiPfS_( <...> { %1 = cir.alloca <...> ["a", init] <...> %2 = cir.alloca <...> ["b", init] <...> %3 = cir.alloca <...> ["i"] <...> cir.scope { cir.for : cond { <...> } body { <...> cir.yield loc(#loc13) } step { <...> cir.yield loc(#loc36) } loc(#loc36) } loc(#loc36) cir.return loc(#loc2) } loc(#loc31)} loc(#loc) (on Compiler Explorer) # Flang Takes Driving Lessons **Note:** This section paraphrases material from“Flang Drivers”.If you want more detail please refer to that document, orDriving Compilers. “Driver” in a compiler context means the part of the compiler that decideshow to handle a set of options. For instance, when you use the option `-march=armv8a+memtag`,something in Flang knows that you want to compile for Armv8.0-a with the MemoryTagging Extension enabled. `-march=` is an example of a “compiler driver” option. These options are what usersgive to the compiler. There is actually a second driver after this, confusinglycalled the “frontend” driver, despite being behind the scenes. In Flang’s case the “compiler driver” is `flang` and the “frontend driver” is`flang -fc1` (they are two separate tools, contained in the same binary). They are separate tools so that the compiler driver can provide an interfacesuited to compiler users, with stable options that do not change over time.On the other hand, the frontend driver is suited to compiler developers, exposesinternal compiler details and does not have a stable set of options. You can see the differences if you add `-###` to the compiler command: $ ./bin/flang /tmp/test.f90 -march=armv8a+memtag -### "<...>/flang" "-fc1" "-triple" "aarch64-unknown-linux-gnu" "-target-feature" "+v8a" "-target-feature" "+mte" "/usr/bin/ld" \ "-o" "a.out" "-L/usr/lib/gcc/aarch64-linux-gnu/11" (output edited for readability) The compiler driver has split the compilation into a job for the frontend(`flang -fc1`) and the linker (`ld`). `-march=` has been converted into manyarguments to `flang -fc1`. This means that if compiler developers decided tochange how `-march=` was converted, existing `flang` commands would still work. Another responsibility of the compiler driver is to know where to find librariesand header files. This differs between operating systems and evendistributions of the same family of operating systems (for example Linuxdistributions). This created a problem when implementing the compiler driver for Flang. All thesedetails would take a long time to get right. Luckily, by this time Flang was in the LLVM Project alongside Clang.Clang already knew how to handle this and had been tested on all sorts ofsystems over many years. > The intent is to mirror clang, for both the driver and CompilerInvocation, asmuch as makes sense to do so. The aim is to avoid re-inventing the wheel andto enable people who have worked with either the clang or flang entry points,drivers, and frontends to easily understand the other. * Peter Waller (Arm) Flang became the first in-tree project to use Clang’s compiler driverlibrary (`clangDriver`) to implement its own compiler driver. This meant that Flang was able to handle all the targets and tools that Clangcould, without duplicating large amounts of code. # Reflections on Flang We are almost 10 years from the first announcement of what would become LLVMFlang. In the LLVM monorepo alone there have been close to 10,000 commitsfrom around 400 different contributors. Undoubtedly more in Classic Flang beforethat. So it is time to hear from users, contributors, and supporters, past andpresent, about their experiences with Flang. > Collaborating with NVIDIA and PGI on Classic Flang was crucial in establishingArm in High Performance Computing. It has been an honour to continue investingin Flang, helping it become an integral part of the LLVM project and a solidfoundation for building HPC toolchains. > > We are delighted to see the project reach maturity, as this was the last step inallowing us to remove all downstream code from our compiler. Look out for ArmToolchain for Linux 20, which will be a fully open source, freely availablecompiler based on LLVM 20, available later this year.” * Will Lovett, Director Technology Management at Arm. (the following quote is presented in Japanese and English, in case of differences,Japanese is the authoritative version) > 富士通は、我々の数十年にわたるHPCの経験を通じて培ったテストスイートを用いて、Flangの改善に貢献できたことを嬉しく思います。Flangの親切で協力的なコミュニティに大変感銘を受けました。 > > 富士通は、より高いパフォーマンスと使いやすさを実現し、我々のプロセッサを最大限に活用するために、引き続きFlangに取り組んでいきます。Flangが改善を続け、ユーザーを増やしていくことを強く願っています。 > > Fujitsu is pleased to have contributed to the improvement of Flang with ourtest suite, which we have developed through our decades of HPC experience.Flang’s helpful and collaborative community really impressed us. > > Fujitsu will continue to work on Flang to achieve higher performance andusability, to make the best of our processors. We hope that Flang will continueto improve and gain users. * 富士通株式会社 コンパイラ開発担当 マネージャー 鎌塚 俊 (Shun Kamatsuka, Manager of the Compiler Development Team at Fujitsu). > Collaboration between Linaro and Fujitsu on an active CI using Fujitsu’stestsuite helped find several issues and make Flang more robust, inaddition to detecting any regressions early. > > Linaro has been contributing to Flang development for two years now, fixing agreat number of issues found by the Fujitsu testsuite. * Carlos Seo, Tech Lead at Linaro. > SciPy is a foundational Python package. It provides easyaccess to scientific algorithms, many of which are written in Fortran. > > This has caused a long stream of problems for packaging and shipping SciPy,especially because users expect first-class support for Windows;a platform that (prior to Flang) had no license-free Fortran compilersthat would work with the default platform runtime. > > As maintainers of SciPy and redistributors in the conda-forgeecosystem, we hoped for a solution to this problem for many years. In the end,we switched to using Flang, and that processwas a minor miracle. > > Huge thanks to the Flang developers for removing a major source of pain for us! * Axel Obermeier, Quantsight Labs. > At the Barcelona Supercomputing Center, like many other HPC centers, we cannotignore Fortran. > > As part of our research activities, Flang has allowed us to apply our work inlong vectors for RISC-V to complex Fortran applications which we have been ableto run and analyze in our prototype systems. We have also used Flang to supportan in-house task-based directive-based programming model. > > These developments have proved to us that Flang is a powerful infrastructure. * Roger Ferrer Ibáñez, Senior Research Engineer at the Barcelona Supercomputing Center (BSC). > I am thrilled to see the LLVM Flang project achieve this milestone. It is a uniqueproject in that it marries state of the art compiler technologies like MLIR withthe venerable Fortran language and its large community of developers focused onhigh performance compute. > > Flang has set the standard for LLVM frontends by adopting MLIR and C++17 featuresearlier than others, and I am thrilled to see Clang and other frontends modernizebased on those experiences. > > Flang also continues something very precious to me: the LLVM Project’s abilityto enable collaboration by uniting people with shared interests even if theyspan organizations like academic institutions, companies, and other research groups. * Chris Lattner, serving member of the LLVM Board of Directors, co-founder ofthe LLVM Project, the Clang C++ compiler and MLIR. > The need for a more modern Fortran compiler motivated the creation of the LLVM Flangproject and AMD fully supports that path. > > In following with community trends, AMD’s Next-Gen Fortran Compiler will be adownstream flavor of LLVM Flang and will in time supplant the current AMD Flangcompiler, a downstream flavor of “Classic Flang”. > > Our mission is to allow anyone that is using and developing a Fortran HPC codebaseto directly leverage the power of AMD’s GPUs. AMD’s Next-Gen Fortran Compiler’s goalis fulfilling our vision by allowing you to deploy and accelerate your Fortran codeson AMD GPUs using OpenMP offloading, and to directly interface and invoke HIP andROCm kernels. * AMD, “Introducing AMD’s Next-Gen Fortran Compiler” # Getting Involved Flang might not be new anymore, but it is definitely still improving. If youwant to try Flang on your own projects, you candownloadit right now. If you want to contribute, there are many ways to do so. Bug reports,code contributions, documentation improvements and so on. Flang follows theLLVM contribution process and youcan find links to the forums, community calls and anything else youmight need here. # Credits Thank you to the following people for their contributions to this article: * Alex Bradbury (Igalia) * Andrzej Warzyński (Arm) * Axel Obermeier (Quansight Labs) * Brad Richardson (Lawrence Berkeley National Laboratory) * Carlos Seo (Linaro) * Daniel C Chen (IBM) * Eric Schweitz (NVIDIA) * Hao Jin * Jeff Hammond (NVIDIA) * Kiran Chandramohan (Arm) * Leandro Lupori (Linaro) * Luis Machado (Arm) * Mehdi Amini * Pat McCormick (Los Alamos National Laboratory) * Peter Waller (Arm) * Steve Scalpone (NVIDIA) * Tarun Prabhu (Los Alamos National Laboratory) # Further reading * Learn Fortran * The ’eu’ in eucatastrophe – Why SciPy builds for Python 3.12 on Windows are a minor miracle * Resurrecting Fortran * The Fortran Package Manager’s First Birthday * How to write a new compiler driver? The LLVM Flang perspective * Flang in the Exascale Supercomputing Project
11.03.2025 00:00 — 👍 0    🔁 0    💬 0    📌 0
GSoC 2024: Improve Clang Doc Hi, my name is Peter, and this year I was involved in Google Summer of Code 2024. I worked on improving the Clang-Doc documenation generator Mentors: Petr Hosek and Paul Kirth ## Project Background Clang-Doc is a documentation generator developed on top of libtooling, as analternative to Doxygen. Development started in 2018 and continued through 2019,however, it has since stalled. Currently, the tool can generate HTML, YAML, and markdown but the generated output has usability issues. This GSOC project aimed to address the pain points regarding the output of the HTML, by adding support for various C++ constructs and reworking the CSS of the HTML output to be more user-friendly. ## Work Done The original scope of the project was to improve the output of Clang-Doc’s generation. However during testing the tool was significantly slower than expected which made developing features for the tool impossible.Documentation generation for the LLVM codebase was taking upwards of 10 hours on my local machine. Additionally, the tool utilized a lot of memory and was prone to crashing with an out-of-memory error. Similar tools such as Doxygen and Hdoc ran in comparatively less time for the same codebase. This pointed to a significant bottleneck within Clang-Doc’s code path when generating large-scale software projects. Due to this the project scope quickly changed to improving the runtime of Clang-Doc so that it could run much faster. It was only during the latter half of the project did the scope changed back to improving Clang-Doc’s generation. ### Added More Test Cases to Clang-Doc test suite Clang-Doc previously had tests which did not test the full scope of the the HTML or Markdown output. I added more end-to-end tests to make sure that in the process of optimizing documentation generation we were not degrading the quality or functionality of the tool. In summary, I added four comprehensive tests that covered all features that we were not testing such as testing the generation for Enums, Namespace, and Records for HTML and Markdown. ### Improve Clang-Doc’s performance by 1.58 times Internally, the way Clang-Doc works is by leveraging libtooling’s ASTVisitor class to parse the source level declarations in each TU. The tool is architected using a Map-Reduce pattern. Clang-Doc parses each fragment of a declaration into an in-memory data format which is serialized then into an internal format and stored as a key value paired, identified by their USR. After, Clang-Doc deserializes and combines each of the fragment declarations back into the in-memory data format which is used by each of the backend to generate the results. Many experiments were conducted to identified the source of the bottleneck. First I tried benchmarking the code with many different codebases such JSON, and fmtlib to identify certain code patterns that slowed the code path down. This didn’t really work since the bottlenecking only showed up for large codebases like LLVM.Next I leverage Windows prolifer (since I was coding on windows) however the visualizations was not helpful and the my system was not capable of profiling the 10 hour runtime required to compile LLVM documenation. Eventually, we were able to identify a major bottleneck in Clang-Doc’s by leveraging the TimeProfiler (similar to -ftime-trace in clang) code to identify where the performance bottleneck was. Clang-Doc was performing redundant work when it was processing each declaration. We settled on a caching/memoization strategy to minimize the redundant work. For example, if we had the following project: //File: Base.hclass Base {} //File: A.cpp#include "Base.h"... //File: B.cpp#include "Base.h"... In this case, the ASTVisitor invoked by Clang-Doc would visit the serialized Base class three times, once when it is parsing Base.h, another when its visiting A.cpp then B.cpp. The problem was that there was no mechanism to identify declarations that we had already seen. Using a simple dictionary which kept track of a list of declaration that Clang-Doc had visited as a basic form of memoization ended up being a surprisingly effective optimization. Here is a plot of the benchmarking numbers: The benchmarking numbers were performed on a machine with a 6th gen Intel(R) Xeon(R) CPU @ 2.00GHz w/ 96 cores, and 180GB of ram. Clang-doc is able to run concurrently, however the benchmark here is with concurrency set to 2. This is because anything higher crashes the slow version of the tool with an out of memory error. It took around 6 hours to complete a full generation of LLVM documentation in the previous tool, where as current version took around 4 hours. Here is a plot of the benchmark by number of threads: We notice a pretty dramatic dropoff as more and more threads are utilize, the original time t 6 hours was cut down to 13 minutes at 64 threads. Considering the previous versions of the tool could not use the higher thread count without crashing (even on a machine with 180GB of ram), the performance gains are even more dramatic. ### Added Template Mustache HTML Backend Clang-Doc originally used an ad-hoc method of generating HTML. I introduced a templating language as a way of reducing project complexity and reducing the ease of development. Two RFCs were made before arriving at the idea of introducing Mustache as a library. Originally the idea was to introduce a custom templating language, however, upon further discussion, it was decided that the complexity of designing and implementing a new templating language was too much.An LLVM community member (@cor3ntin) suggested using Mustache as a templating language.Mustache was the ideal choice since it was very simple to implement, and has a well defined spec that fit what was needed for Clang-Doc’s use case. The feedback on the RFC was generally positive. While there was some resistance regarding the inclusion of an HTML support library in LLVM, this concern stemmed partly from a lack of awareness that HTML generation already occurs in several parts of LLVM. Additionally, the introduction of Mustache has the potential to simplify other HTML-related use cases.In terms of engineering wins, this library was able to cut the direct down on the HTML backend significantly dropping 500 lines of code compared to the original Clang-Doc HTML backend. This library was also designed for general-purpose use around LLVM since there are numerous places in LLVM where various tools generate HTML in its way. Using the Mustache templating library would be a nice way to standardize the codebase. ### Improve Clang-Doc HTML Output The previous version of Clang-Doc’s HTML output was a pretty minimal, bare bones implementation. It had a sidebar that contained every single declaration within the project which created a large unnavigable UI. Typedef documentation was missing, plus method documentation was missing details such as whether or not the method was a const or virtual. There was no linking between other declarations in the project and there was no syntax highlighting on any language construct. With the new Mustache changes an additional backend was added using the specifier (–format=mhtml). That addresses these issues. Below is a comparison of the same output between the two backends You can also visit the output project on my github.io page linkhere. Note: this output is still a work in progress. ## Learning Insight I’ve learned a lot in the past few months, thanks to GSOC I now have a much better idea of what it’s like to participate in a large open-source project. I received a lot of feedback through PRs, making RFC, and collaborating with other GSOC members. I learned a lot about how to interact with the community and solicit feedback. I also learned a lot about instrumentation/profiling code having conducted many experiments to try to speed Clang-Doc up. ## Future Work As my work concluded I was named as one of the maintainers of the project. In the future I plan to work on Clang-Doc until an MVP product can be generated and evaluated for the LLVM project. My remaining tasks include landing the Mustache support library and Clang-Doc’s Mustache backend, as well as gathering feedback from the LLVM community regarding Clang-Doc’s current output. Additionally, I intend to add test cases for the Mustache HTML backend to ensure its robustness and functionality. ## Conclusion Overall the current state of Clang-Doc is much healthier than it was before. It now has much better test coverage across all its output, markdown, html, yaml. Whereas previously there were no e2e test cases that were not as comprehensive. The tool is significantly faster especially for large scale projects like LLVM making documentation generation and development a much better experience.The tool also has a simplified HTML backend that will be much easier to work with compared to before leading to a faster velocity for development. ## Acknowledgements I’d like to thank my mentors, Paul and Petr for their invaluable input when I encounter issues with the project. This year has been tough for me mentally, and I’d like to thank my mentors for being accommodating with me.
23.12.2024 00:00 — 👍 0    🔁 0    💬 0    📌 0
GSoC 2024: Adding LLVM and Clang plugin support for windows Hello everyone! My name is Thomas and for GSOC I’ve been working on adding plugin support for LLVM and Clang to windows, which mainly involved implementing proper support for building LLVM and Clang as shared libraries (known as DLLs on Windows, dylibs on macOS, or DSOs/SOs on most other Unices) on Windows. ## Background The LLVM CMake buildsystem had some existing support for building LLVM as a shared library on Windows, but it suffers from two limitations when trying to make code symbol visibility for DLLs work like Linux.Most of the environments that LLVM works on use ELF as the object and executable file format. Windows, however, uses PE as its executable file format and COFF as its object file format. This difference is important to highlight as it impacts how dynamic libraries operate in these environments. The ELF (and MachO) based targets implicitly export symbols across the module boundary, but they can be explicitly controlled via the GNU attribute applied to the symbol: `__attribute__((__visibility__(“...”)))`. For PE/COFF environments, two different attributes are required. Symbols meant to be exposed to other modules are decorated with `__declspec(dllexport)` and symbols which are imported from other modules are decorated with `__declspec(dllimport)`. Additionally, the PE format maintains a list of symbols which are public and assigns them a numerical identity known as the ordinal. This is represented in the file format as a 16-bit field, limiting the number of exported symbols to 64K. In order to support DLL builds on MinGW, a python script would scan the object files generated during the build and extract the symbol names from them. In order to remain under the 64K limit, the symbols would be filtered by pattern matching. The final set would then be used to create an import library that the consumer could use. This technique not only potentially over-exported symbols, introduced a secondary source of truth for the code, but also relied on the linker to generate fix up thunks as the compiler could not determine if a symbol originated from a shared library or static library. This would add additional overhead on a function call that went through this import library as it would no longer be a simple indirect call. Such a thunk was also not possible for data symbols such as static fields of classes except for MinGW which uses a custom runtime fixup. ## What We Did Some initial work I did was update the LLVM CMake build system to be able to build a LLVM DLL and use clang-cl’s /Zc:dllexportInlines-. Inline declared class methods by default are not compiled unless used, but when the `__declspec(dllexport)` attribute is applied to a class all its methods are compiled and exported by the compiler even if not used. This option negates this behaviour, preventing inline methods from being compiled and exported. This avoids emitting these methods in every translation unit including the declaration, greatly reducing compile times for DLL builds. More importantly, it almost halves the number of symbols exported for LLVM to 28k and Clang DLL to 20k. The cost of this improvement is that DLLs built with this option cannot be consumed by code being built with MSVC as that code expects these methods to be available externally. There is a Microsoft Developer Community issue to add it to MSVC, please consider voting for it so that it may be considered by Microsoft for addition to the MSVC toolchain. Another major thing I worked on was extending a Clang tooling based tool ids that Saleem Abdulrasool created to automate adding symbol visibility macros to classes, global functions and variables in public LLVM and Clang headers. I made its file processing multi-threaded and added a config file system to make it simple to add exclusions for different headers and directories when adding macro annotations. I also changed it to automatically add a include the header that defines the visibility macros when code is annotated by macros in a file. I managed to get plugins for Clang and LLVM working including passing the LLVM and Clang test suite when building with clang-cl. Some of the changes to support this have already merged LLVM and Clang or are waiting in open PRs, but it will take some time to get all the changes merged across the whole LLVM and Clang codebase.The greatly reduced install size from using a non statically linked build of LLVM tools and Clang could also help with current limitation of the installer used for the official distribution on windows that forced the number of targets included in the official distribution to be limited. It would shrink from over 2GB to close to 500MB. ## Future Work Some of the next steps after all the symbol visibility changes have been merged in to LLVM and Clang would be to use the `ids` tool annotate new classes and functions added to either be integrated in to the PR LLVM pre-submit action to generate symbol visibility macros for new classes and functions added or alternative something like gn syncbot that runs as an after commit action along.A build bot will also later need to be set up to make sure the windows shared library builds continue to build and function correctly.Clang still has some weak areas when it comes to tracking the source location for some code constructs in its AST like explicit function template instantiation that `ids` could benefit from being fixed. ## Acknowledgements I’d like to thank Tom Stellards for doing a lot of the initial work I reused and built on top of. My mentors Saleem Abdulrasool and Vassil Vassilev. ## Links Github issue for current progress adding plugin and LLVM_BUILD_LLVM_DYLIB support for WindowsPrevious discussion of Supporting LLVM_BUILD_LLVM_DYLIB on Windows
16.12.2024 00:00 — 👍 0    🔁 0    💬 0    📌 0
Lightstorm: minimalistic Ruby compiler Some time ago I was talking about an ahead-of-time Ruby compiler.We started the project with certain goals and hypotheses in mind, and while the original compiler is at nearly 90% completion, there are still those other 90% that needs to be done. In the meantime, we decided to strip it down to a bare minimum and implement just enough features to validate the hypotheses. Just like the original compiler we use MLIR to bridge the gap between Ruby VM’s bytecode and the codegen, but instead of targeting LLVM IR directly, we go through EmitC dialect and targeting C language, as it significantly simplifies OS/CPU support. We go into a bit more details later. The source code of our minimalistic Ruby compiler is here: https://github.com/dragonruby/lightstorm. The rest of the article covers why we decided to build it, how we approached the problem, and what we discovered along the way. ## Motivation and the use case Our use case is pretty straightforward: we are building a cross-platform game engine that’s indie-focused, productive, and easy to use. The game engine itself is written in a mix of C and Ruby, but the main user-interface is the Ruby itself. As soon as the game development is done and the game is ready for “deployment,” the code is more or less static and so we asked ourselves if we can pre-compile it into machine code to make it run faster. But given all the dynamism of Ruby, why would a compilation make it faster? So here comes our hypothesis. But first let’s look at some high-level implementation details to see where the hypothesis comes from. ## Compilers vs Interpreters While a language itself cannot be strictly qualified as compiled or interpreted, the typical implementations certainly are. In case of a “compiled” language the compiler would take the whole program, analyze it and produce the machine code targeting a specific hardware (real or virtual), while an interpreter would take the program and execute it right away, one “instruction” at a time. _The definition above is a bit handwavy: zoom our far enough and everything is a compiler, zoom in close enough and everything is an interpreter. But you ’ve got the gist._ Most Ruby implementations are interpreter-based, and in our case we are using mruby. The mruby interpreter is a lightweight register-based virtual machine (VM). Let’s look at a concrete example. The following piece of code: 42 + 15 is converted into the following VM bytecode, consisting of various operations (ops for short): LOADI R1 42LOADI R2 15ADD R1 R2HALT The VM’s interpreter loop looks as follows (pseudocode): dispatch_next: Op op = bytecode.next_op(); switch (op.opcode) { case LOADI: { vstack.store(op.dest, mrb_int(op.literal)); goto dispatch_next; } case ADD: { mrb_value lhs = vstack.load(op.lhs); mrb_value rhs = vstack.load(op.rhs); vstack.store(op.dest, mrb_add(lhs, rhs)); goto dispatch_next; } // More ops... case HALT: goto halt_vm; }halt_vm: // ... For each bytecode operation the VM will jump/branch into the right opcode handler, and then will branch back to the beginning of the dispatch loop.In the meantime, each opcode handler would use a virtual stack (confusingly located on the heap) to store intermediate results. If we unroll the above bytecode manually, then the code can look like this: goto loadi_1;loadi_1: // LOADI R1 42 mrb_value t1 = mrb_int(42); vstack.store(1, t1); goto loadi_2;loadi_2: // LOADI R2 42 mrb_value t2 = mrb_int(15); vstack.store(2, t2); goto add;add: // ADD R1 R2 mrb_value lhs = vstack.load(1); mrb_value rhs = vstack.load(2); mrb_value t3 = mrb_add(lhs, rhs); vstack.store(1, t3); goto halt;halt: // shutdown VM Many items in this example can be eliminated: specifically, we can avoid load/stores from/to the heap, and we can safely eliminate `goto`s/branches: mrb_value t1 = mrb_int(42); mrb_value t2 = mrb_int(15);; mrb_value t3 = mrb_add(t1, t2); vstack.store(1, t3); goto halt;halt: // shutdown VM So here goes our hypothesis: > ### Hypothesis > > By precompiling/unrolling the VM dispatch loop we can eliminate many load/stores and branches along with branch mispredictions, this should improve the performance of the end program. We can also try to apply some optimizations based on the high-level bytecode analysis, but due to the Ruby’s dynamism the static optimization surface is limited. ## Approach As mentioned in the beginning, building a full-fledged AOT compiler is a laborious task which requires time and has certain constraints. For the minimalistic version we decided to change/relax some of the constraints as follows: * the compiled code must be compatible with the existing ecosystem/runtime * the existing runtime must not require any changes * the supported language features should be easily “representable” in C Unlike the original compiler, we are not targeting machine code directly, but C instead.This eliminates a lot of complexity, but it also means that we only support a subset of the language (e.g., blocks and exceptions are missing at the moment). This is obviously not ideal, but it serves important purpose - **our goal at this point is to validate the hypothesis**. A classical compilation pipeline looks as follows: To build a compiler one needs to implement the conversions from the raw source file all the way down to machine code and the language runtime library.Since we are targeting the existing implementation, we have the benefit of reusing the frontend (parsing + AST) and the runtime library. Still, we need to implement the conversion from AST to the machine code.And this is where the power of MLIR kicks in: we built a custom dialect (Rite) which represents mruby VM’s bytecode, and then use a number of builtin dialects (`cf`, `func`, `arith`, `emitc`) to convert our IR into C code. At this point, we can just use clang to compile/link the code together with the existing runtime and that’s it. The final compilation pipeline looks as follows: > With the benefit of MLIR we are able to build a funtional compiler in just a couple of thousands lines of code! Now let’s look at how it performs. ## Some numbers Benchmarking is hard, so take these numbers with a grain of salt. We run various (micro)-benchmarks showing results in the range of 1% to 1200% speedups, but we are sticking to the aobench as it is very close to the game-dev workloads we are targeting. mruby also uses aobench as part of its benchmark suite, though we slightly modified it to replace `Number.each` blocks with explicit `while` loops. Next we used excellent simple-kpc library to capture CPU counters on Apple M1 CPU, namely we collect total cycles, total instructions count, branches, branch mispredictions, and load/stores (`FIXED_CYCLES`, `FIXED_INSTRUCTIONS`, `INST_BRANCH`, `BRANCH_MISPRED_NONSPEC`, and `INST_LDST` respectively). Naturally, we also collect the total execution time. All the benchmarks compare vanilla bytecode interpretation against the “unrolled” compiled version. We are using mruby 3.0. While it’s not the most recent version at the time of writing, it was the most recent version at the time we’ve build the compiler. The following chart shows the results of our measurements.The three versions we compare are the baseline on the left, compiled version without optimizations in the middle, and the compiled version plus simple escape analysis and common subexpression elimination (CSE) on the right side. _The raw data and the formulas arehere._ With all the current optimizations in place both the number of cycles and the total execution time went down by roughly ~30%. We are able to eliminate ~17% of branches and ~28% of load/stores, while the branch misses were cut in half with ~55% decrease. The numbers look promising, although the amount of load/stores and branches will certainly go up as we implement all the language features due to the way blocks and exceptions are handled. On the other hand, we didn’t touch the runtime implementation which together with LTO will enable some more improvements due to more inlining. ## Where to next? As mentioned in the beginning, some parts of the engine itself are written in C with and some of them are purely due to performance reasons. We are looking into replacing those critical pieces with compiled Ruby. While we may still pay performance penalty, we hope that ease of maintenance will be worthwile. In the meantime, do not hesitate to give it a shot, and if you have any questions reach out to Alex or Amir! ## Some links * the compiler: https://github.com/DragonRuby/lightstorm * the game engine: https://dragonruby.org * our discord: https://discord.dragonruby.org
09.12.2024 00:00 — 👍 0    🔁 0    💬 0    📌 0
GSoC 2024: Out-Of-Process Execution For Clang-Repl Hello! I’m Sahil Patidar, and this summer I had the exciting opportunity toparticipate in Google Summer of Code (GSoC) 2024. My project revolved aroundenhancing Clang-Repl by introducing Out-Of-Process Execution. Mentors: Vassil Vassilev and Matheus Izvekov ## Project Background Clang-Repl, part of the LLVM project, is a powerful interactive C++ interpreter using Just-In-Time (JIT) compilation. However, it faced two major issues: high resource consumption and instability. Running both Clang-Repl and JIT in the same process consumed excessive system resources, and any crash in user code would shut down the entire session. To address these problems, **Out-Of-Process Execution** was introduced. By executing user code in a separate process, resource usage is reduced and crashes no longer affect the main session. This solution significantly enhances both the efficiency and stability of Clang-Repl, making it more reliable and suitable for a broader range of use cases, especially on resource-constrained systems. ## What We Accomplished As part of my GSoC project, I’ve been focused on implementing out-of-process execution in Clang-Repl and enhancing the ORC JIT infrastructure to support this feature. Here is a breakdown of the key tasks and improvements I worked on: ### Out-Of-Process Execution Support for Clang-Repl **PR** : #110418 One of the primary objectives of my project was to implement **out-of-process (OOP) execution** capabilities within Clang-Repl, enabling it to execute code in a separate, isolated process. This feature leverages **ORC JIT ’s remote execution capabilities** to enhance code execution flexibility by isolating runtime environments. To enable OOP execution in Clang-Repl, I utilized the `llvm-jitlink-executor`, allowing Clang-Repl to offload code execution to a dedicated executor process. This setup introduces a layer of isolation between Clang-Repl’s main process and the code execution environment. * **New Command-Line Flags** : To facilitate the out-of-process execution, I added two key command-line flags: * **`--oop-executor`** This flag starts a separate JIT executor process. The executor handles the actual code execution independently of the main Clang-Repl process. * **`--oop-executor-connect`** This flag establishes a communication link between Clang-Repl and the out-of-process executor. It allows Clang-Repl to transmit code to the executor and retrieve the results from the execution. With these flags in place, Clang-Repl can utilize `llvm-jitlink-executor` to execute code in an isolated environment. This approach significantly enhances separation between the compilation and execution stages, increasing flexibility and ensuring a more secure and manageable execution process. ### Issues Encountered * **Block Dependence Calculation in ObjectLinkingLayer**Commit Link **Code Example** clang-repl> int f() {return 1;}clang-repl> int f1() {return f();}clang-repl> f1();error: disconnectingclang-repl> JIT session error: FD-transport disconnectedJIT session error: disconnectingJIT session error: FD-transport disconnectedJIT session error: Failed to materialize symbols: { (main, { __Z2fv }) }disconnecting During my work on `clang-repl`, I encountered an issue where the JIT session would crash during incremental compilation. The root cause was a bug in `ObjectLinkingLayer::computeBlockNonLocalDeps`. The problem arose from the way the worklist was built: it was being populated within the same loop that records immediate dependencies and dependants, which caused some blocks to be missed from the worklist. This bug was fixed by **Lang Hames**. ### ORC JIT Enhancements As part of the OOP execution work, several improvements were made to ORC JIT, the underlying framework responsible for dynamic compilation and execution of code in Clang-Repl. These improvements target better handling of incremental execution, especially for Mach-O and ELF platforms, and ensuring that initializers are properly managed across different execution environments. 1. **Incremental Initializer Execution for Mach-O and ELF****PRs** : #97441, #110406 In a typical JIT execution environment, the `dlopen` function is used to handle code mapping, reference counting, and initializer execution for dynamically loaded libraries. However, this approach is often too broad for interactive environments like Clang-Repl, where we only need to execute newly introduced initializers rather than reinitializing everything. To address this, I introduced the **`dlupdate`** function in the ORC runtime. The `dlupdate` function is a targeted solution that focuses solely on running new initializers added during a REPL session. Unlike `dlopen`, which handles a variety of tasks and can lead to unnecessary overhead, `dlupdate` only triggers the execution of newly registered initializers, avoiding redundant operations. This improvement is particularly beneficial in interactive settings like Clang-Repl, where code is frequently updated in small increments. By streamlining the execution of initializers, this change significantly improves the efficiency of Clang-Repl. 2. **Push-Request Model for ELF Initializers****PR** : #102846 A push-request model has been introduced to manage ELF initializers within the runtime state for each `JITDylib`, similar to how initializers are handled for Mach-O and COFF. Previously, ELF required a fresh request for initializers with each invocation of `dlopen`, but lacked mechanisms to register, deregister, or retain these initializers. This created issues during subsequent `dlopen` calls, as initializers were erased after the `rt_getInitializers` function was invoked, making further executions impossible. To resolve these issues, the following functions were introduced: * **`__orc_rt_elfnix_register_init_sections`** : Registers ELF initializers for the `JITDylib`. * **`__orc_rt_elfnix_register_jitdylib`** : Registers the `JITDylib` with the ELF runtime state. With the new push-request model, the management and tracking of initializers for each `JITDylib` state are now more efficient. By leveraging Mach-O’s `RecordSectionsTracker`, only newly registered initializers are executed, greatly improving efficiency and reliability when working with ELF targets in `clang-repl`. This update is crucial for enabling out-of-process execution in `clang-repl` on ELF platforms, offering a more effective approach to managing incremental execution. ### Additional Improvements Beyond the main enhancements to Clang-Repl and ORC JIT, I also worked on several other improvements: 1. **Auto-loading Dynamic Libraries in ORC JIT.** **PR** : #109913 (On-going) With this update, we’ve introduced a new feature to the ORC executor and controller: **automatic loading of dynamic libraries in the ORC JIT**. This enhancement enables efficient resolution of symbols from both loaded and unloaded libraries. * How It Works: * **Symbol Lookup:** When a lookup request is made, the system first attempts to resolve the symbol from already loaded libraries. * **Unloaded Libraries Scan:** If the symbol is not found in any loaded library, the system then scans the unloaded dynamic libraries to locate it. * Key Addition: **Global Bloom Filter** A significant improvement in this update is the introduction of a **Global Bloom Filter**. When a symbol cannot be resolved in the loaded libraries, the symbol tables from the scanned libraries are incorporated into this filter. If the symbol is still not found, the bloom filter’s result is returned to the controller, allowing it to skip checking for symbols that do not exist in the global table during future lookups. Additionally, the system tracks symbols that were previously thought to be present but are actually absent in both loaded and unloaded libraries. With these enhancements, symbol resolution is significantly faster, as the bloom filter helps prevent unnecessary lookups, thereby improving efficiency for both loaded and unloaded dynamic libraries. 2. **Refactor of`dlupdate` Function****PR** : #110491 This update simplifies the `dlupdate` function by removing the `mode` argument, streamlining the function’s interface. The change enhances the clarity and usability of `dlupdate` by reducing unnecessary parameters, improving the overall maintainability of the code. ## Benchmarks: In-Process vs Out-of-Process Execution * Prime Finder * Fibonacci Sequence * Matrix Multiplication * Sorting Algorithms ## Result With these changes, `clang-repl` now supports out-of-process execution. We can run it using the following command: clang-repl --oop-executor=path/to/llvm-jitlink-executor --orc-runtime=path/to/liborc_rt.a ## Future Work * **Crash Recovery and Session Continuation :** Investigate and develop ways to enhance crash recovery so that if something goes wrong, the session can seamlessly resume without losing progress. This involves exploring options for an automatic process to restart the executor in the event of a crash. * **Finalize Auto Library Loading in ORC JIT :** Wrap up the feature that automatically loads libraries in ORC JIT. This will streamline symbol resolution for both loaded and unloaded dynamic libraries by ensuring that any required dylibs containing symbol definitions are loaded as needed. ## Conclusion With this project, **Clang-Repl** now supports **out-of-process execution** for both `ELF` and `Mach-O`, making it much more efficient and stable, especially on devices with limited resources. In the future, I plan to work on automating library loading and improving ORC-JIT to make Clang-Repl’s out-of-process execution even better. ## Acknowledgements I would like to thank **Google Summer of Code (GSoC)** and the LLVM community for providing me with this amazing opportunity. Special thanks to my mentors, **Vassil Vassilev** and **Matheus Izvekov** , for their continuous support and guidance. I am also deeply grateful to **Lang Hames** for sharing their expertise on ORC-JIT and helping improve `clang-repl`. This experience has been a major step in my development, and I look forward to continuing my contributions to open source. ## Related Links * LLVM Repository * Project Description * My GitHub Profile
04.11.2024 00:00 — 👍 0    🔁 0    💬 0    📌 0
GSoC 2024: The 1001 thresholds in LLVM Hey everyone! My name is Shourya and I worked on LLVM this summer through GSoC. My project is called The 1001 thresholds in LLVM. The main objective of this project was to study how varying different thresholds in llvm affect performance parameters like compile-time, bitcode-size, execution-time and llvm stats. # Background LLVM has lots of thresholds and flags to avoid “costly cases”. However, it is unclear if these thresholds are useful, their value is reasonable, and what impact they really have. Since there are a lot, one cannot do a simple exhaustive search. An example of work in this direction includes the introduction of a C++ class that can replace hardcoded values which offers control over the threshold, e.g., one can increase the recursion limit via a command line flag from the hardcoded “6” to a different number. As such there is a need to explore different thresholds in llvm, understand what it means for a threshold to be hit, profile different thresholds and select optimal values for different thresholds. # What We Did This work provides a tool that can efficiently explore these knobs and understand how modifying them affects metrics like compile time, size of the generated program, or any statistic that LLVM emits like “Number of loops vectorized”. (Note that execution-time is currently not evaluated because input-gen does not work on optimized IR and is thus part of future work.) We first built a clang matcher for which we looked for the following patterns : 1. Const knob_name = knob_val 2. Cl::init 3. Enum {knob_name = knob_val} to first identify the knobs in the codebase and then used a custom python tool (optimised to deal with I/O and cache bottlenecks) to collect the different stat values in parallel and stored them in a json file. After manual selection of interesting knobs, we have so far conducted three studies in which we measure compile-time and bitcode-size along with various other statistics, and present them in the form of interactive graphs. Two of them (on 10,000 and 100 bitcode files) look at average statistics for each knob value while the third one (on 10,000 bitcode files) studies how each file is affected individually by changing knob values. We see some very interesting patterns in these graphs, for instance in the following two graphs, representing the jump-threading-threshold, we can observe improved statistics (top graph) and decreased average compile time (bottom graph) if the knob value is increased. # Results The per file study proves that there is no one single magic knob value and the optimum, with regards to compile time or code size, depends on the file that is currently being compiled. For instance here we can see that different knob values (for the knob licm-mssa-optimization-cap) give good cumulative compile time improvements for different files. In detail, most files benefit from a knob value of 300 while 60 is the best knob value for the second most files. We further show that the presence of an oracle that can tell the best knob value for each file can significantly improve the cumulative compile time. In this project, we explored various thresholds in LLVM—specifically, 93 thresholds (a 100 file study for each can be found here) using the Clang matcher—and observed that these thresholds are largely file-specific. This indicates that there is no universally optimal value, or even a set of values, that can be applied across different scenarios. Instead, what is needed is an adaptive mechanism within LLVM, an oracle, that can dynamically determine the appropriate threshold values during compilation. We also experimented with varying thresholds cumulatively by leveraging file-specific information through an LLVM pass. However, after discussions with the mentors, this approach was set aside due to the significant changes it would necessitate across other parts of the LLVM codebase. As a result, we have not yet categorized different thresholds, such as identifying optimal threshold values for specific file types (e.g., I/O-intensive files). Nonetheless, we provide a tool that can efficiently collect this data (LLVM statistics, bitcode-size and compile-time) and help visualize it with the help of interactive graphs as well as histograms that examine these variations on a per-file basis. Additionally, a correlation table between knob values and performance metrics further illustrates the significant impact this study could have on improving LLVM’s overall performance. # Future Work The early results show that we need a better understanding of knob values to maximise various objectives. Our results will provide the community with the first step in developing a guided compilation model attune to the file that is being compiled. We further intend to show how these knobs interact with each other and whether modifying multiple knobs together compounds the benefits or not. One more area of work could be on input-gen that would enable us to collect and study execution-time in our performance parameters. # Acknowledgements This project would not have been possible without my amazing mentors, Jan Hückelheim, Johannes Doerfert, the LLVM Foundation admins, and the GSoC admins. # Links Code Studies GSoC Project Page
21.10.2024 00:00 — 👍 0    🔁 0    💬 0    📌 0
GSoC 2024: 3-way comparison intrinsics Hello everyone! My name is Volodymyr, and in this post I would like to talk about the project I have been working on for the past couple of months as part of Google Summer of Code 2024. The aim of the project was to introduce 3-way comparison intrinsics to LLVM IR and add a decent level of optimizations for them. # Background Three-way comparison is an operation present in many high-level languages, such as C++ and its spaceship operator or Rust and the `Ord` trait. It operates on two values for which there is a defined comparison operation and returns `-1` if the first operand is less than the second, `0` if they are equal, and `1` otherwise. At the moment, compilers that use LLVM express this operation using different sequences of instructions which are optimized and lowered individually rather than as a single operation. Adding an intrinsic for this operation would therefore help us generate better machine code on some targets, as well as potentially optimize patterns in the middle-end that we didn’t optimize before. # What was done Over the course of the project I have added two new intrinsics to the LLVM IR: `llvm.ucmp` for an unsigned 3-way comparison and `llvm.scmp` for a signed comparison. They both take two arguments that must be integers or vectors of integers and return an integer or a vector of integers with the same number of elements. The arguments and the result do not need to have the same type. In the middle-end the following passes received some support for these intrinsics: * InstSimplify (#1, #2) * InstCombine (#1, #2, #3, #4, #5) * CorrelatedValuePropagation * ConstraintElimination I have also added folds of idiomatic ways that a 3-way comparison can be expressed to a call to the corresponding intrinsic. In the backend there are two different ways of expanding the intrinsics: as a nested select (i.e. `(x < y) ? -1 : (x > y ? 1 : 0)`) or as a subtraction of zero-extended comparisons (`zext(x > y) - zext(x < y)`). The second option is the default one, but targets can choose to use the first one through a TLI hook. # Results I think that overall the project was successful and brought a small positive change to LLVM. To demonstrate its impact in a small test case, the following function in C++ that uses the spaceship operator was compiled twice, first with Clang 18.1 and then with Clang built from the main branch of LLVM repository: #include <compare>std::strong_ordering cmp(unsigned int a, unsigned int b){ return a <=> b;} With Clang 18.1: ; ====== LLVM IR ======define i8 @cmp(i32 %a, i32 %b) {entry: %cmp.lt = icmp ult i32 %a, %b %sel.lt = select i1 %cmp.lt, i8 -1, i8 1 %cmp.eq = icmp eq i32 %a, %b %sel.eq = select i1 %cmp.eq, i8 0, i8 %sel.lt ret i8 %sel.eq}; ====== x86_64 assembly ======cmp: xor ecx, ecx cmp edi, esi mov eax, 0 sbb eax, eax or al, 1 cmp edi, esi movzx eax, al cmove eax, ecx ret With freshly built Clang: ; ====== LLVM IR ======define i8 @cmp(i32 %a, i32 %b) {entry: %sel.eq = tail call i8 @llvm.ucmp.i8.i32(i32 %a, i32 %b) ret i8 %sel.eq}; ====== x86_64 assembly ======cmp: cmp edi, esi seta al sbb al, 0 ret As you can see, the number of instructions in the generated code had gone down considerably (from 8 to 3 excluding `ret`). Although this isn’t much and is a small synthetic test, it can still make a noticeable impact if code like this is found in a hot path somewhere. The impact of these changes on real-world code is much harder to quantify. Looking at llvm-opt-benchmark, there are quite a few places where the intrinsics are being used, which suggests that some improvement must have taken place, although it is unlikely to be significant in all but very few cases. # Future Work There are still many opportunities for optimization in the middle-end, some of which are already known and being worked on at the time of writing this, others are yet to be discovered. I would also like to allow pointers and vectors of pointers to be valid operands for the intrinsics, although that would be quite a minor change. In the backend I would also like to work on better handling of intrinsics in GlobalISel, which is something that I didn’t have enough time for and other members of LLVM community had helped me with. # Acknowledgements None of this would have been possible without my two amazing mentors, Nikita Popov and Dhruv Chawla, and the LLVM community as a whole. Thank you for helping me on this journey and I am looking forward to working with you in the future.
07.10.2024 00:00 — 👍 0    🔁 0    💬 0    📌 0
GSoC 2024: ABI Lowering in ClangIR ClangIR is an ongoing effort to build a high-level intermediate representation(IR) for C/C++ within the LLVM ecosystem. Its key advantage lies in its abilityto retain more source code information. While ClangIR is making progress, itstill lacks certain features, notably ABI handling. Currently, ClangIR lowersmost functions without accounting for ABI-specific calling convention details. ## Goals The “Build & Run SingleSource Benchmarks with ClangIR - Part 2” Google Summer ofCode 2024 builds on my contributions from GSoC 2023 by addressing one of themain issues I encountered: target-specific lowering. It focuses on extendingClangIR’s code generation capabilities, particularly in ABI-lowering for X86-64.Several tests rely on operations and types (e.g., `va_arg` calls and complexdata types) that require target-specific information to compile correctly. The concrete steps to achieve this were: 1. **Implement foundational infrastructure** that can scale to multiplearchitectures while adhering to ClangIR design principles such as CodeGenparity, feature guarding, and AST backreferences. 2. **Handle basic calling convention scenarios** as a proof of concept tovalidate the foundational infrastructure. 3. **Add lowering for a second architecture** to further validate theinfrastructure’s extensibility to multiple architectures. 4. **Unify target-specific ClangIR lowering into the library** , as there are afew isolated methods handling target-specific code lowering like`cir.va_arg`. 5. **Integrate calling convention lowering into the main pipeline** to ensurefuture contributions and continued development of this infrastructure. ## Contributions The list of contribution (PRs) can be foundhere. ### Target Lowering Library The most significant contribution of this project was the development of amodular `TargetLowering` library.This ensures that target-specific MLIR lowering passes can leverage this sharedlibrary for lowering logic. The library also follows ClangIR’s feature guardingprinciples, ensuring that any contributor can refer to the original CodeGen forcontributions, and any unimplemented feature is asserted at specific codepoints, making it easy to track missing functionality. ### Calling Convention Lowering Pass As a proof of concept, the initial development of the `TargetLowering` libraryfocused on implementing a calling convention loweringpass that targets multiplearchitectures. Currently, ClangIR ignores the target ABI during CodeGen toretain high-level information. For example, structs are not unraveled to improveargument-passing efficiency. ABI-specific LLVM attributes are also ignored. Thispass addresses these issues by properly tagging LLVM attributes and rewritingfunction definitions and calls to handle unraveled structs. This was implementedfor both X86-64 and AArch64,demonstrating the library’s multi-architecture support. ## Shortcomings ### Target-Specific Lowering Unification While some target-specific lowering code was moved into the library, it wascopied and pasted rather than properly integrated. This is not ideal forleveraging the library’s multi-architecture features. ### Inclusion in the Main Pipeline This is still a work in progress, as the library is not yet mature enough tohandle most pre-existing ClangIR tests. There are also feature guards withunreachable statements for many unimplemented features. ## Future Work Now that there is a base infrastructure for handling target-agnostic totarget-specific CIR code, there is a large amount of future work to be done,including: * Improving DataLayout-related queries using MLIR’s built-in tools. * Implementing calling convention lowering for additional types, such aspointers. * Extending the TargetLowering library to support more architectures. * Unifying remaining target-specific lowering code from other parts of ClangIR. ## Acknowledgements I would like to thank my Google Summer of Code mentors, Bruno Cardoso Lopes andNathan Lanza, for another great GSoC experience. I also want to thank the LLVMcommunity and Google for organizing the program.
30.09.2024 00:00 — 👍 0    🔁 0    💬 0    📌 0
GSoC 2024: Statistical Analysis of LLVM-IR Compilation Welcome! My name is Andrew and I contributed to LLVM through the 2024 Google Summer of Code Program. My project is called Statistical Analysis of LLVM-IR Compilation. The objective of this project is to provide an analysis of how time is spent in the optimization pipeline. Generally, drastic differences in the percentage of time spent by a pass in the pipeline is considered abnormal. # Background In principle, an LLVM IR bitcode file, or module, contains IR features that determine the behavior of the compiler optimization pipeline. By varying these features, the optimization pipeline, opt, can add significantly to the compilation time or marginally. More specifically, optimizations succeed in less or more time; the user can wait for a microsecond or a few minutes. LLVM compiler developers constantly edit the pipeline, so the performance of these optimizations can vary by compiler version (sometimes significantly). Having a large IR dataset such as ComPile allows for testing the LLVM compilation pipeline on a varied sample of IR. The size of this sample is sufficient to determine outlying IR modules. By identifying and examining such files using utilities which are being added to the LLVM IR Dataset Utils Repo, the causes of unexpected compilation times can be determined. Developers can then modify and improve the compilation pipeline accordingly. # Summary of Work The utilities added in PR37 are intended to write each IR module to a tar file corresponding to a programming language. Each file written to the tar files is indexed by its location in the HF dataset. This allows easy identification of files for tools which can be used for data extraction and analysis in the shell, notably clang. Tar file creation allows for potentially using less storage space then downloading the HF dataset to disk, and it allows code to be written which does not depend on the Python interpreter to load the dataset for access. The Makefile from PR36 is responsible for carrying out the data collection. This data includes text segment size, user CPU instruction counts during compile time (analogous to time), IR feature counts sourced from the LLVM pass `print<func-properties>`, and maximum relative time pass names and percentage counts. The data can be extracted in parallel or serially and is stored in a CSV file. An important data collection command in the Makefile is `clang -w -c -ftime-report $(lang)/bc_files/file$@.bc -o /dev/null`. The output from the command is large, but the part of interest is the first `Pass execution timing report`: ===-------------------------------------------------------------------------=== Pass execution timing report===-------------------------------------------------------------------------=== Total Execution Time: 2.2547 seconds (2.2552 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 2.1722 ( 96.5%) 0.0019 ( 47.5%) 2.1741 ( 96.4%) 2.1745 ( 96.4%) VerifierPass 0.0726 ( 3.2%) 0.0000 ( 0.0%) 0.0726 ( 3.2%) 0.0726 ( 3.2%) AlwaysInlinerPass 0.0042 ( 0.2%) 0.0015 ( 39.2%) 0.0058 ( 0.3%) 0.0058 ( 0.3%) AnnotationRemarksPass 0.0014 ( 0.1%) 0.0005 ( 13.3%) 0.0019 ( 0.1%) 0.0020 ( 0.1%) EntryExitInstrumenterPass 0.0003 ( 0.0%) 0.0000 ( 0.0%) 0.0003 ( 0.0%) 0.0003 ( 0.0%) CoroConditionalWrapper 2.2507 (100.0%) 0.0039 (100.0%) 2.2547 (100.0%) 2.2552 (100.0%) Total A user can visually see the distribution of these passes by using a profiling tool for .json files. The .json file for a given bitcode file is obtained by `clang -c -ftime-trace <file>`. The visualization of this output can be filtered to the passes of interest as in the following image: The CoroConditionalWrapper pass is accounted by the “Total CoroConditionalWrapper” block. Clearly, that pass takes a far smaller amount of time than the others, as accounted for by the pass execution timing report. However, instead of seeing the pass as an insignificant percentage of time, the visualization allows for additional comparisons of the relative timings of each pass. The example image has the optimization passes of interest selected, but the .json file provides information on the entire compilation pipeline as well. Thus, the entire pipeline execution flow can be visualized. # Current Status Currently, there are three PRs that require approval to be merged. There has been ongoing discussion on their contents, so few steps should be left to merge them.In the current state, users of the utilities in PR38 should be able to readily reproduce the quantitative results I had obtained for my GSoC midterm presentation graphs. Users can easily perform outlier analysis as well on the IR files (excluding Julia IR). Some of the results include the following: Scatter Plot of C IR Files: Table of outliers for C IR files: # Future Work It was discussed in PR 37 to consolidate the tar file creation into the dataset file writer Python script. This is a feature I wish to implement in order to speed up the tar file creation process by having the bitcode files written from memory to the tar instead of from memory, to disk, to tar. As mentioned, Julia IR was not analyzed. Modifying the scripts to include Julia IR results is desirable to make complete use of the dataset.Adding additional documentation for demonstration-of-use purposes could help clarify ways to use the tools. Additionally, outlier analysis can be expanded upon by using more advanced outlier detection methods. Not all the data collected in the CSV files was used, so using those extra features–in particular the `print<func-properties>` pass–can allow for improved accuracy in outlier detection. # Acknowledgements I would like to thank my mentors Johannes Doerfert and Aiden Grossman for their constant support during and prior to the GSoC program. Additionally, I would like to acknowledge the work of the LLVM Foundation admins and the GSoC admins. # Links * PR 38 * PR 37 * PR 36 * LLVM IR Dataset Utils Repo * Compile Dataset
23.09.2024 00:00 — 👍 0    🔁 0    💬 0    📌 0
GSoC 2024: Reviving NewGVN This summer I participated in GSoC under the LLVM Compiler Infrastructure. The goal of the project was to improve the NewGVN pass so that it can replace GVN as the main value numbering pass in LLVM. # Background Global Value Numbering (GVN) consists of assigning value numbers such that instructions with the same value number are equivalent. NewGVN was introduced in 2016 to replace GVN. We now highlight a few aspects in which NewGVN is better than GVN. A key advantage of NewGVN over GVN is that it is complete for loops, while GVN is only complete for acyclical code. NewGVN is complete for loops because when it first processes loops, it assumes that only the first iteration will be executed, later corroborating these assumptions—this is known as the optimistic assumption. In practice, the optimistic assumption boils down to assuming that backedges are unreachable and, consequently, that when evaluating phi instructions, the values carried by them can be ignored. For instance, in the example below, `%a` is optimistically evaluated to `0`. This leads to evaluating `%c` to `%x`, which in turn leads to evaluating `%a.i` to `0`. At this point, there are two possibilities: either the assumption was correct and the loop actually only executes once, and the value numbers computed so far are correct, or the instructions in the loop need to be reevaluated. Assume, for this example, that NewGVN could not prove that only one iteration is executed. Then `%a` once again evaluates to `0`, and all other registers also evaluate to the same. Thanks to the optimistic assumption, we were able to discover that `%a` is loop-invariant and, moreover, that it is equal to `0`. define i32 @optimistic(i32 %x, i32 %y){entry: br label %looploop: %a = phi i32 [0, %entry], [%a.i, %loop] ... %c = xor i32 %x, %a %a.i = sub i32 %x, %c br i1 ..., label %loop,label %exitexit: ret i32 %a} On the other hand, GVN fails to detect this equivalence because it would pessimistically evaluate `%a` to itself, and the previously described evaluation steps would never take place. Another advantage of NewGVN is the value numbering of memory operations using MemorySSA. It provides a functional view of memory where instructions that can modify memory produce a new memory version, which is then used by other memory operations. This greatly simplifies the detection of redundancies among memory operations. For example, two loads of the same type from equivalent pointers and memory versions are trivially equivalent. define i32 @foo(i32 %v, ptr %p) {entry:; 1 = MemoryDef(liveOnEntry) store i32 %v, ptr %p, align 4; MemoryUse(1) %a = load i32, ptr %p, align 4; MemoryUse(1) %b = load i32, ptr %p, align 4; 2 = MemoryDef(1) call void @f(i32 %a); MemoryUse(2) %c = load i32, ptr %p, align 4 %d = sub i32 %b, %c ret i32 %d} In the example above (annotated with MemorySSA), `%a` and `%b` are equivalent, while `%c` is not. All three loads are of the same type from the same pointer, but they don’t all load from the same memory state. Loads `%a` and `%b` load from the memory defined by the store (Memory `1`), while `%c` loads from the memory defined by the function call (Memory `2`). GVN can also detect these redundancies, but it relies on the more expensive and less general MemoryDependenceAnalysis. Despite these and other improvements NewGVN is still not widely used, mainly because it lacks partial redundancy elimination (PRE) and because it is bug-ridden. # Implementing PRE Our main contribution was the development of a PRE stage for NewGVN (found here). Our solution relied on generalizing Phi-of-Ops. It performs a special case of PRE where the instruction depends on a phi instruction, and an equivalent value is available on every reaching path. This is achieved in two steps: phi-translation and phi-insertion. Phi-translation consists of evaluating the original instruction in the context of each of its block’s predecessors. Phi operands are replaced by the value incoming from the predecessor. The value is available in the predecessor if the translated instruction is equivalent to a constant, function argument, or another instruction that dominates the predecessor. Phi-insertion occurs after phi-translation if the value is available in every predecessor. At that point, a phi of the equivalent values is constructed and used to replace the original instruction. The full process is illustrated in the following example. Our generalization eliminated the need for a dependent phi and introduced the ability to insert the missing values in cases where the instruction is partially redundant. To prevent increases in code size (ignoring the inserted phi instructions), the insertion is only made if it’s the only one required. The full process is illustrated in the following example. Integrating PRE into the existing framework also allowed us to gain loop-invariant code motion (LICM) for free. The optimistic assumption, combined with PRE, allows NewGVN to speculatively hoist instructions out of loops. On the other hand, LICM in GVN relies on using LoopInfo and can only handle very specific cases. # Missing Features The two main features our PRE implementation lacks are critical edge splitting and load coercion. Critical edge splitting is required to ensure that we do not insert instructions into paths where they won’t be used. Currently, our implementation simply bails in such cases. Load coercion allows us to detect equivalences of loaded values with different types, such as loads of `i32` and `float`, and then coerce the loaded type using conversion operations. The difficulty in implementing these features is that NewGVN is designed to perform analysis and transformation in separate steps, while these features involve modifying the function during the analysis phase. # Results We evaluated our implementation using the automated benchmarking tool Phoronix Test Suite from which we selected a set of 20 C/C++ applications (listed below). | | | ---|---|---|--- aircrack-ng| encode-flac| luajit| scimark2 botan| espeak| mafft| simdjson zstd| fftw| ngspice| sqlite-speedtest crafty| john-the-ripper| quantlib| tjbench draco| jpegxl| rnnoise| graphics-magick The default `-O2` pipeline was used. The only change betweeen compilations was the value numbering pass used. Despite the missing features, we observed that our implementation, on average, performs 0.4% better than GVN. However, it is important to mention that our solution hasn’t been fine-tuned to consider the rest of the optimization pipeline, which resulted in some cases where our implementation regressed compared to both GVN and the existing NewGVN. The most severe case was with jpegxl, where our implementation, on average, performed 10% worse than GVN. It’s important to note that this was an outlier; excluding jpegxl, most regressions were at most 2%. Unfortunately, due to time constraints, we were unable to study these cases in more detail. # Future Work In the future, we plan to implement the aforementioned missing features and fine-tune the heuristics for when to perform PRE to prevent the regressions discussed in the results section. Once these issues are addressed, we’ll upstream our implementation, bringing us a step closer to reviving NewGVN.
16.09.2024 00:00 — 👍 0    🔁 0    💬 0    📌 0
GSoC 2024: Compile GPU kernels using ClangIR Hello everyone! I’m 7mile. My GSoC project this summer is Compile GPU kernels using ClangIR. It’s been an exciting journey in compiler development, and I’m thrilled to share the progress and insights gained along the way here. # Background The ClangIR project aims to establish a new IR for Clang, built on top of MLIR. As part of the ongoing effort to support heterogeneous programming models, this project focuses on integrating OpenCL C language support into ClangIR. The ultimate goal is to enable the compilation of GPU kernels written in OpenCL C into LLVM IR targeting the SPIR-V architecture, laying the groundwork for future enhancements in SYCL and CUDA support. # What We Did Our work involved several key areas: 1. **Address Space Support** : One of the fundamental tasks was teaching ClangIR to handle address spaces, a vital feature for languages like OpenCL. Initially, we considered mimicking LLVM’s approach, but this proved inadequate for ClangIR’s goals. After thorough discussion and an RFC, we implemented a unified address space design that aligns with ClangIR’s objectives, ensuring a clean and maintainable code structure. 2. **OpenCL Language and SPIR-V Target Integration** : We extended ClangIR to support the OpenCL language and the SPIR-V target. This involved enhancing the pipeline to accommodate the latest OpenCL 3.0 specification and implementing hooks for language-specific and target-specific customizations. 3. **Vector Type Support** : OpenCL vector types, a critical feature for GPU programming, were integrated into ClangIR. We leveraged ClangIR’s existing cir.vector type to generate the necessary code, ensuring consistent compilation results. 4. **Kernel and Module Metadata Emission** : We added support for emitting OpenCL kernel and module metadata in ClangIR, a necessary step for proper integration with the SPIR-V target. This included the creation of structured attributes to represent metadata, following MLIR’s preferences for well-defined structures. 5. **Global and Static Variables with Qualifiers** : We implemented support for global and static variables with qualifiers like `global`, `constant`, and `local`, ensuring that these constructs are correctly represented and lowered in the ClangIR pipeline. 6. **Calling Conventions** : We adjusted the calling conventions in ClangIR to align with SPIR-V requirements, migrating from the default `cdecl` to SPIR-V-specific conventions like `SpirKernel` and `SpirFunction`. This also enables most OpenCL built-in functions like `barrier` and `get_global_id`. 7. **User Experience Enhancements** : Finally, we ensured that the end-to-end kernel compilation experience using ClangIR was smooth and intuitive, with minimal manual intervention required. # Results The project successfully met its primary goals. OpenCL kernels from the Polybench-GPU benchmark suite can now be compiled using ClangIR into LLVM IR for SPIR-V. All patches have been merged into the main ClangIR repository, and the project’s progress has been well-documented in the overview issue. I believe the work not only advanced OpenCL support but also laid a solid foundation for future enhancements, such as SYCL and CUDA support in ClangIR. We have successfully compiled and executed all 20 OpenCL C benchmarks from the polybenchGpu repository, passing the built-in result validation. Please refer to our artifact evaluation repository for detailed instructions on how to experiment with our work. # Future Works As we look forward, there are two key areas that require further development: 1. **Function Attribute Consistency** : For example, the `convergent` function attribute is crucial for preventing misoptimizations in SIMT languages like OpenCL. ClangIR currently lacks this attribute, which could lead to issues in parallel computing contexts. Addressing this is a priority to ensure correct optimization behavior. 2. **Support for OpenCL Built-in Types** : Another critical area for future work is the support for OpenCL built-in types, such as `pipe` and `image`. These types are essential for handling data streams and image processing tasks in various specialized OpenCL applications. Supporting these types will significantly enhance ClangIR’s adherence to the OpenCL standard, broadening its applicability and ensuring better compatibility with a wide range of OpenCL programs. # Acknowledgements This project would not have been possible without the guidance and support of the LLVM community. I extend my deepest gratitude to my mentors, Julian Oppermann, Victor Lomüller, and Bruno Cardoso Lopes, whose expertise and encouragement were instrumental throughout this journey. Additionally, I would like to thank Vinicius Couto Espindola for his collaboration on ABI-related work. This experience has been immensely rewarding, both technically and in terms of community engagement. # Appendix * Overview issue of OpenCL C support * Artifact Evaluation Instructions
09.09.2024 00:00 — 👍 0    🔁 0    💬 0    📌 0
GSoC 2024: Half-precision in LLVM libc C23 defines new floating-point types, such as `_Float16`, which corresponds tothe binary16 format from IEEE Std 754, also known as “half-precision,” or FP16.C23 also defines new variants of the C standard library’s math functionsaccordingly, such as `fabsf16` to get the absolute value of a `_Float16`. The “Half-precision in LLVM libc” Google Summer of Code 2024 project aimed toimplement these new `_Float16` math functions in LLVM libc, making it the firstknown C standard library implementation to implement these C23 functions. We split math functions into two categories: basic operations and higher mathfunctions. The current implementation status of math functions in LLVM libc canbe viewed at https://libc.llvm.org/math/index.html#implementation-status. The exact goals of this project were to: 1. Setup generated headers properly so that the `_Float16` type and `_Float16`functions can be used with various compilers and architectures. 2. Add generic implementations of `_Float16` basic operations for supportedarchitectures. 3. Add optimized implementations of `_Float16` basic operations for specificarchitectures using special hardware instructions and compiler builtinswhenever possible. 4. Add generic implementations of as many `_Float16` higher math functions aspossible. We knew we would not have enough time to implement all of them. ## Work done 1. The `_Float16` type can now be used in generated headers, and declarations of`_Float16` math functions are generated with `#ifdef` guards to enable themwhen they are supported. * https://github.com/llvm/llvm-project/pull/93567 2. All 70 planned `_Float16` basic operations have been merged. * https://github.com/llvm/llvm-project/issues/93566 3. The `_Float16`, `float` and `double` variants of various basic operationshave been optimized on certain architectures. * https://github.com/llvm/llvm-project/pull/98376 * https://github.com/llvm/llvm-project/pull/99037 * https://github.com/llvm/llvm-project/pull/100002 4. Out of the 54 planned `_Float16` higher math functions, 8 have been mergedand 9 have an open pull request. * https://github.com/llvm/llvm-project/issues/95250 We ran into unexpected issues, such as: * Bugs in Clang 11, which is currently still supported by LLVM libc and used inpost-commit CI. * Some post-commit CI workers having old versions of compiler runtimes that aremissing some floating-point conversion functions on certain architectures. * Inconsistent behavior of floating-point conversion functions across compilerruntime vendors (GCC’s libgcc and LLVM’s compiler-rt) and CPU architectures. Due to these issues, LLVM libc currently only enables all `_Float16` functionson x86-64 Linux. Some were disabled on AArch64 due to Clang 11 bugs, and allwere disabled on 32-bit Arm and on RISC-V due to issues with compiler runtimes.Some are not available on GPUs because they take `_Float128` arguments, and the`_Float128` type is not available on GPUs. There is work in progress to work around issues with compiler runtimes by usingour own floating-point conversion functions. ## Work left to do * Implement the remaining `_Float16` higher math functions. * Enable the `_Float16` math functions that are disabled on AArch64 once LLVMlibc bumps its minimum supported Clang version. * Enable `_Float16` math functions on 32-bit Arm and on RISC-V once issues withcompiler runtimes are resolved. ## Acknowledgements I would like to thank my Google Summer of Code mentors, Tue Ly and Joseph Huber,as well as other LLVM maintainers I interacted with, for their help. I wouldalso like to thank Google for organizing this program.
31.08.2024 00:00 — 👍 0    🔁 0    💬 0    📌 0
GSoC 2024: GPU Libc Benchmarking Hey everyone! My name is James and I worked on LLVM this summer through GSoC. My project is called GPU Libc Benchmarking. The main objective of this project was to develop microbenchmarking infrastructure for libc on the GPU. # Background The LLVM libc project was designed as an alternative to glibc that aims to be modular, configurable, and sanitizer-friendly. Currently, LLVM libc is being ported to Nvidia and AMD GPUs to give libc functionality (e.g. printf(), malloc(), and math functions) on the GPU. As of March 2024, programs can use GPU libc in offloading languages (CUDA, OpenMP) or through direct compilation and linking with the libc library. # What We Did During this project, we developed a microbenchmarking framework that is directly compiled for and run on the GPU, using libc functions to display output to the user. As this was a short project (90 hours), we mostly focused on developing the infrastructure and writing a few example usages (isalnum(), isalpha(), and sin()). Our benchmarking infrastructure is based on Google Benchmark and measures the average cycles, minimum, maximum, and standard deviation of each benchmark. Each benchmark is run for multiple iterations to stabilize the results. Benchmark writers can measure against vendor implementations of libc functions by passing specific linking flags to the benchmark’s CMake portion and registering the corresponding vendor function from the benchmark itself. Below is an example of our benchmarking infrastructure’s output for `sinf()` Benchmark | Cycles | Min | Max | Iterations | Time / Iteration | Stddev | Threads |----------------------------------------------------------------------------------------------------------Sinf_1 | 764 | 369 | 2101 | 273 | 7 us | 323 | 32 |Sinf_128 | 721 | 699 | 744 | 5 | 913 us | 16 | 32 |Sinf_1024 | 661 | 650 | 689 | 9 | 7 ms | 31 | 32 |Sinf_4096 | 666 | 663 | 669 | 5 | 28 ms | 28 | 32 |SinfTwoPi_1 | 372 | 369 | 632 | 70 | 7 us | 39 | 32 |SinfTwoPi_128 | 379 | 379 | 379 | 4 | 895 us | 0 | 32 |SinfTwoPi_1024 | 335 | 335 | 338 | 5 | 7 ms | 20 | 32 |SinfTwoPi_4096 | 335 | 335 | 335 | 4 | 28 ms | 0 | 32 |SinfTwoPow30_1 | 371 | 369 | 510 | 70 | 7 us | 17 | 32 |SinfTwoPow30_128 | 379 | 379 | 379 | 4 | 894 us | 0 | 32 |SinfTwoPow30_1024 | 335 | 335 | 338 | 5 | 7 ms | 20 | 32 |SinfTwoPow30_4096 | 335 | 335 | 335 | 4 | 28 ms | 0 | 32 |SinfVeryLarge_1 | 477 | 369 | 632 | 70 | 7 us | 58 | 32 |SinfVeryLarge_128 | 487 | 480 | 493 | 5 | 900 us | 14 | 32 |SinfVeryLarge_1024 | 442 | 440 | 447 | 5 | 7 ms | 18 | 32 |SinfVeryLarge_4096 | 441 | 441 | 442 | 4 | 28 ms | 14 | 32 | Users can register benchmarks similar to Google Benchmark, using a macro: uint64_t BM_IsAlnumCapital() { char x = 'A'; return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);}BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumCapital, BM_IsAlnumCapital); # Results This project met its major goal of creating microbenchmarking infrastructure for the GPU. However, the original scope of this proposal included a CPU component that would use vendor tools to measure GPU kernel properties. However, this was removed after discussion with the mentors due to technical obstacles in offloading specific kernels to the GPU that would require major changes to other parts of the code. # Future Work As this was a short project (90 hours), we only focused on implementing the microbenchmarking infrastructure. Future contributors can use the benchmarking infrastructure to add additional benchmarks. In addition, there are improvements to microbenchmarking infrastructure that could be added, such as more options for user input ranges, better random distributions for math functions, and a CPU element that can launch multiple kernels and compare results against functions running on the CPU. The existing code can be found in the LLVM repo. # Acknowledgements This project would not have been possible without my amazing mentor, Joseph Huber, the LLVM Foundation admins, and the GSoC admins. # Links Landed PRs LLVM GitHub LLVM Homepage GSoC Project Page
09.08.2024 00:00 — 👍 0    🔁 0    💬 0    📌 0