Luke Lau's Avatar

Luke Lau

@lukel97.bsky.social

LLVM at Igalia

100 Followers  |  56 Following  |  11 Posts  |  Joined: 23.11.2024  |  2.1911

Latest posts by lukel97.bsky.social on Bluesky

Preview
In Pictures: HK police deploy armoured vehicle on Tiananmen anniversary Police have deployed an armoured vehicle in Hong Kong's commercial heart, amidst an ongoing heavy security presence on the 36th anniversary of the Tiananmen Square crackdown.

Police have deployed an armoured vehicle in Hong Kong's commercial heart, amidst an ongoing heavy security presence on the 36th anniversary of the Tiananmen Square crackdown. In full: buff.ly/f4hVB50

04.06.2025 10:30 โ€” ๐Ÿ‘ 14    ๐Ÿ” 9    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 3
Picture of a presenter showing a slide that details outcomes of RISE funded RISC-V software ecosystem projects.

Picture of a presenter showing a slide that details outcomes of RISE funded RISC-V software ecosystem projects.

I'm delighted to see two of @igalia.com's projects for RISE highlighted at the RISC-V Summit Europe.

Find out more about our work on both LLVM optimisation and testing/CI on the RISE blog (with more to come in the future!):
riseproject.dev/2025/05/08/p...
riseproject.dev/2024/10/15/w...

14.05.2025 10:50 โ€” ๐Ÿ‘ 6    ๐Ÿ” 3    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@camel-cdr.bsky.social rvv-bench is used here!

18.04.2025 10:33 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

We're looking forward to EuroLLVM next week in Berlin. Be sure to check out talks from my colleague @lukel97.bsky.social and myself on:
* Work to further improve RISC-V vector codegen (extending the VL Optimizer), and
* Work done with the support of RISE to improve RISC-V LLVM testing.

12.04.2025 07:30 โ€” ๐Ÿ‘ 9    ๐Ÿ” 4    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
FEX 2503 Tagged Here we are again, another month and some more cool changes with FEX. Letโ€™s dive in and see what has changed!

What if I told you 3DNow! square root recรญprocals are defined for negative numbers?... Also the amazing FEX 2503 is out. Read about some of my work and the work of other FEX maintainers' in the release notes: fex-emu.com/FEX-2503/ #fex #igalia #gaming #linux #arm64

06.03.2025 15:50 โ€” ๐Ÿ‘ 4    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
ccache for LLVM builds across multiple directories TL;DR: ccache base_dir saves the day

Some notes on ccache+LLVM. Summary: if you do a lot of builds across different checkouts/worktrees/builddirs, be sure to set the base_dir option and -DLLVM_USE_RELATIVE_PATHS_IN_DEBUG_INFO=ON muxup.com/2025q1/ccach...

27.02.2025 18:39 โ€” ๐Ÿ‘ 9    ๐Ÿ” 4    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Inside SiFiveโ€™s P550 Microarchitecture RISC-V is a relatively young and open source instruction set. So far, it has gained traction in microcontrollers and academic applications. For example, Nvidia replaced the Falcon microcontrollers โ€ฆ

Hello you fine Internet folks,
Today's article is on SiFive's P550 microarchitecture. The P550 core is one of the fastest RISC-V cores available currently and is claimed to be comparable to ARM's Cortex A75.
Hope y'all enjoy!

old.chipsandcheese.com/2025/01/26/i...

open.substack.com/pub/chipsand...

26.01.2025 22:14 โ€” ๐Ÿ‘ 12    ๐Ÿ” 5    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Executable loading and startup performance on macOS Recently, I fixed a startup performance regression in Node.js on macOS after an extensive investigation. Along the way, I learned a lot about tools on macOS and Node.js compilation workflows that donโ€™

New blog post covering the mysterious 10ms startup regression of Node.js on macOS, the journey of investigating the issue with various performance tools, and figuring out the fix (which also helped making the binary smaller).

joyeecheung.github.io/blog/2025/01...

11.01.2025 22:25 โ€” ๐Ÿ‘ 127    ๐Ÿ” 18    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 2
A Simple ELF - The Ivory Tower The Ivory Tower is a blog about software engineering and development philosophy by Anders Sundman.

A Simple ELF 4zm.org/2024/12/25/a...

27.12.2024 11:18 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
build: build v8 with -fvisibility=hidden on macOS by joyeecheung ยท Pull Request #56275 ยท nodejs/node V8 should be built with -fvisibility=hidden, otherwise the resulting binary would contain unnecessary symbols. In particular, on macOS, this leads to 5000+ weak symbols resolved at runtime, leading...

After two months of chasing, finally found out what's happening behind this mysterious startup time regression on macOS from Node.js v20.x - it's missing -fvisibility=hidden ๐Ÿ˜… (I guess that's what happens when the build configs become dusty enough) github.com/nodejs/node/...

16.12.2024 21:55 โ€” ๐Ÿ‘ 59    ๐Ÿ” 8    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 2
Preview
Abnormally slow loop (25x) under OCaml 5 / macOS / arm64 ยท Issue #13262 ยท ocaml/ocaml Hello, I am using macOS Ventura 13.6.7 with an Apple M2 Max processor. A loop that writes values into an integer array is about 20x slower with OCaml 5 than with OCaml 4. Using Array.set versus Arr...

Recently I came across this treatise by Stephen Dolan

github.com/ocaml/ocaml/...

12.12.2024 00:03 โ€” ๐Ÿ‘ 23    ๐Ÿ” 5    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

256 loads, since itโ€™s an LMUL 8 load with VLEN=256! Iโ€™m not sure how it compares to the scalar equivalent, but my guess is that the vlse8.v is loading one element at a time under the hood

11.12.2024 11:17 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
A screenshot of a terminal:
luke@bananapif3:~/slowest-instr$ cat main.S
	.section .rodata
str:	.asciz "Cycles: %d\n"
foo:	.zero 256 * STRIDE
	.section .text
	.global main

main:
	addi	sp, sp, -8
	sd	ra, 0(sp)

	rdcycle	s1
	rdcycle s2
	sub	s3, s2, s1 	# rdcycle overhead

	la	a0, foo
	li	a1, STRIDE
	vsetvli t0, zero, e8, m8, tu, mu
	rdcycle s1
	vlse8.v	v8, (a0), a1
	rdcycle	s2

	sub	s1, s2, s1
	sub	s1, s1, s3
	la	a0, str
	mv	a1, s1
	call	printf

	ld	ra, 0(sp)
	addi	sp, sp, 8
	ret
luke@bananapif3:~/slowest-instr$ clang main.S -DSTRIDE=65536 -march=rv64gv 
luke@bananapif3:~/slowest-instr$ perf stat -e cycles:u ./a.out 
Cycles: 66640979

 Performance counter stats for './a.out':

        78,064,581      cycles:u                                                           

       0.049648957 seconds time elapsed

       0.000000000 seconds user
       0.049907000 seconds sys

A screenshot of a terminal: luke@bananapif3:~/slowest-instr$ cat main.S .section .rodata str: .asciz "Cycles: %d\n" foo: .zero 256 * STRIDE .section .text .global main main: addi sp, sp, -8 sd ra, 0(sp) rdcycle s1 rdcycle s2 sub s3, s2, s1 # rdcycle overhead la a0, foo li a1, STRIDE vsetvli t0, zero, e8, m8, tu, mu rdcycle s1 vlse8.v v8, (a0), a1 rdcycle s2 sub s1, s2, s1 sub s1, s1, s3 la a0, str mv a1, s1 call printf ld ra, 0(sp) addi sp, sp, 8 ret luke@bananapif3:~/slowest-instr$ clang main.S -DSTRIDE=65536 -march=rv64gv luke@bananapif3:~/slowest-instr$ perf stat -e cycles:u ./a.out Cycles: 66640979 Performance counter stats for './a.out': 78,064,581 cycles:u 0.049648957 seconds time elapsed 0.000000000 seconds user 0.049907000 seconds sys

Trying to find the slowest possible RISC-V instruction. This single vlse8.v with a stride of 65536 bytes takes 66 million cycles on a Banana Pi F3. That's 0.04 seconds @1.6GHz
#risc-v

11.12.2024 09:40 โ€” ๐Ÿ‘ 22    ๐Ÿ” 5    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 0

The maximum possible vl is 2^16 I think, so that would fit in XLEN=32?

06.12.2024 16:28 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

With that said I forgot how confusing the V extension hierarchy can be. After thinking about about EEW=64 on XLEN=32 I think I need to go lie down a bit ๐Ÿ˜ตโ€๐Ÿ’ซ

06.12.2024 16:21 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Otherwise EEW=64 is supported as usual, since thereโ€™s also this bit at the bottom:

> The V extension requires the scalar processor implements the F and D extensions

06.12.2024 16:18 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Is it this bit here?

> The V extension supports all vector load and store instructions (Section Vector Loads and Stores), except the V extension
does not support EEW=64 for index values when XLEN=32.

Iโ€™m interpreting that as index values I.e only indices passed to vluxei64.v and friends

06.12.2024 16:16 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 0

Are you talking about zve32x? That doesnโ€™t include any fp support, but zve32f should mandate f and zve64f should mandate d I think

06.12.2024 04:48 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
'RVV mask tricks'

# broadcast nth bit
vmand.mm v8, in, mNth
vcpop.m t0, v8
sub t0, x0, t0
vmv.v.x v8, t0

# prefix xor
viota.m v8, in
vand.vi v8, v8, 1
vmsne.vi v8, v8, 0
vmor.mm v0, v8, in # can often be omitted

# move nth bit to first
vmand.mm v8, in, mNth
vcpop.m t0, v8
vmv.v.x v8, t0
vmsof.m v0, v8

# move mask to GPR
vmv.x.s t0, v0
# move GPR to mask
vmv.s.x v0, t0
# assuming vl<=64, set SEW=64 before

# these two should really be dedicated instructions
# shift mask up by 1
vslide1up.vx v8, in, x0
vsrl.vi v8, v8, 7
vmadd.vx v0, 2, v8

# shift mask up by 1
vslide1down.vx  v8, in, x0
vadd.vv v0, in, in
vmacc.vx v0, 128, v8

'RVV mask tricks' # broadcast nth bit vmand.mm v8, in, mNth vcpop.m t0, v8 sub t0, x0, t0 vmv.v.x v8, t0 # prefix xor viota.m v8, in vand.vi v8, v8, 1 vmsne.vi v8, v8, 0 vmor.mm v0, v8, in # can often be omitted # move nth bit to first vmand.mm v8, in, mNth vcpop.m t0, v8 vmv.v.x v8, t0 vmsof.m v0, v8 # move mask to GPR vmv.x.s t0, v0 # move GPR to mask vmv.s.x v0, t0 # assuming vl<=64, set SEW=64 before # these two should really be dedicated instructions # shift mask up by 1 vslide1up.vx v8, in, x0 vsrl.vi v8, v8, 7 vmadd.vx v0, 2, v8 # shift mask up by 1 vslide1down.vx v8, in, x0 vadd.vv v0, in, in vmacc.vx v0, 128, v8

Here are some slightly tricky RVV mask patterns.

03.12.2024 21:37 โ€” ๐Ÿ‘ 7    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Even better is being able to measure the numbers yourself without the need for vendor tables. RISC-V support for llvm-exegesis is landing soon IIUC, with RVV not too far behind either.

03.12.2024 03:02 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
RVV benchmark

The RVV Agner Fog is camel-cdr.github.io/rvv-bench-re..., itโ€™s an incredibly useful resource. We use it all the time for LLVM!

03.12.2024 00:52 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@lukel97 is following 20 prominent accounts