lhl's Avatar

lhl

@lhl.bsky.social

Easily distracted, currently building open source AI. Living online since FidoNet

467 Followers  |  326 Following  |  135 Posts  |  Joined: 14.03.2023
Posts Following

Posts by lhl (@lhl.bsky.social)

I've also realitychecked some geopolitical/recent events, like Mark Carney's Davos speech on the end of American hegemony/the previous global order: github.com/lhl/realityc... or fact-checking the US Border Patrol's claims on their recent killing of US citizen Alex Pretti: github.com/lhl/realityc...

26.01.2026 07:41 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GitHub - lhl/realitycheck-data: Reality Check knowledge base - unified analysis of claims across technology, economics, labor, and governance domains Reality Check knowledge base - unified analysis of claims across technology, economics, labor, and governance domains - lhl/realitycheck-data

I've also released a default public KB repo: github.com/lhl/realityc... so people can see how it works. It includes the original postingularity economics analysis: github.com/lhl/realityc... as well as technical topics like a JP-TL-Bench analysis: github.com/lhl/realityc...

26.01.2026 07:41 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
GitHub - lhl/realitycheck: A framework for rigorous, systematic analysis of claims, sources, predictions, and argument chains. A framework for rigorous, systematic analysis of claims, sources, predictions, and argument chains. - lhl/realitycheck

On the topic of not getting one-shotted, last week I mentioned starting work on a framework to critically analyze articles, etc. Over the past week I've turned it into a proper project called Reality Check: github.com/lhl/realityc... - this is live on PyPI and tested w/ Claude Code, Codex, and Amp

26.01.2026 07:41 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

People are going gaga for clawd.bot right now, which is cool - if you've been drinking the anti-AI kool-aid, it's a first contact like situation, uh, the coming hijinx will be unreal. There should be a "How not to get pwned or one-shotted by your new AI assistant" on-boarding guide...

26.01.2026 04:29 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
randomfoo2's comment on "[Research] I forensic-audited "Humanity’s Last Exam" (HLE) & GPQA to benchmark my "unleashed" DeepSeek model. Result: A ~58% verifiable error rate caused by bad OCR and typos.... Explore this conversation and more from the LocalLLaMA community

The line between slop/llm psychosis will be increasingly slim. Yesterday I spotted a very slop-coded post that still looked plausible, so I sent it through my own AIs for analysis (GPQA/HLE errata) and it at least partially verifies: www.reddit.com/r/LocalLLaMA...

21.01.2026 06:09 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

For fun today I started making a framework to do better systematic evaluation and analysis of social media claims/discourse (first example: evaluating neofeudalism and post-singularity economics): github.com/lhl/postsing...

18.01.2026 13:48 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Hmm, I think we were just spoiled by a level of relative stability and order that actually a-historical and we’re just returning to the mean.

15.01.2026 23:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

Just a reminder that Gemini is basically insane, doesn't believe anything is real, and is probably the most misaligned and untrustworthy of all the frontier AI models.

14.01.2026 15:02 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Preview
JP-TL-Bench: Anchored Pairwise LLM Evaluation for Bidirectional Japanese-English Translation We introduce JP-TL-Bench, a lightweight, open benchmark designed to guide the iterative development of Japanese-English translation systems. In this context, the challenge is often "which of these two...

Over the holidays I wrote up some docs for our new JP-TL-Bench (Japanese/English translation eval). Here's my first arXiv (and first experience with LaTeX, mediated almost entirely by AI tools): arxiv.org/abs/2601.00223 - easier to read blog summary here: shisa.ai/posts/jp-tl-...

09.01.2026 07:30 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
The Death of Affordable Computing | Tariffs Impact & Investigation
YouTube video by Gamers Nexus The Death of Affordable Computing | Tariffs Impact & Investigation

Previously, they did some extensive coverage of the US tariff situation/impact that was some of the best coverage/explanation I saw across any news media as well: www.youtube.com/watch?v=1W_m...

18.08.2025 06:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
THE NVIDIA AI GPU BLACK MARKET | Investigating Smuggling, Corruption, & Governments
YouTube video by Gamers Nexus THE NVIDIA AI GPU BLACK MARKET | Investigating Smuggling, Corruption, & Governments

I started watching this epic 3.5h investigative journalism piece by Gamers Nexus on Chinese GPU smuggling, it's really amazing the work this independent YouTube gaming channel is doing: www.youtube.com/watch?v=1H3x...

18.08.2025 06:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Over the past couple weeks I've been working on some Strix Halo testing in my spare time. This includes bringing up a harness for doing full sweeps for pp/tg for a variety of different model architectures, backends, and flags. Writeup just posted to r/LocalLLama: www.reddit.com/r/LocalLLaMA...

22.07.2025 11:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

One neat thing is that experimenting with using Shisa V2 405B to regen our datasets, I'm seeing gains w/ new chosen DPO (slight boost on Qwen 3 vs original DPO), and for SFT+DPO, close to a 0.5 point gain on Shaberi averages for Llama 3.1 8B.

20.06.2025 18:24 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
New table of Shaberi scores (GPT-4.1 judge)

New table of Shaberi scores (GPT-4.1 judge)

Recently I started doing some Qwen3 testing (Shaberi, GPT-4.1 judge) and interestingly for almost all models, reasoning yielded worse performance. Note: I need to stand multieval back up - Even though Qwen3 8B tunes appear to match the Shisa V2 12B/14B tunes, they are much worse on translation.

15.06.2025 05:03 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
ChatGPT - Illusion of Thinking Summary Shared via ChatGPT

I had a chat w/ o3 chatgpt.com/share/6846ff... about Apple's new "Illusion of Thinking" paper machinelearning.apple.com/research/ill... - based on the researchers' definition, neither reasoning LLMs nor humans are true reasoners, but the Python script I had o3 write to solve the logic puzzles are.

09.06.2025 15:43 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
ChatGPT - Shisa V2 405B γƒγƒ£γƒƒγƒˆ Shared via ChatGPT

One crazy observation, I just used both Shisa V2 405B and ChatGPT 4.5 (who's JA benchmark scores are the best we're tested) to write a Japanese tweet for me and 4.5 overwhelmingly preferred Shisa V2's tweet: chatgpt.com/share/683e88...

03.06.2025 05:37 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Perhaps a more interesting side note is that I am still basically illiterate in Japanese, but wrote this presentation with almost no native speaker review/assistance - just many many rounds of LLM assistance (mainly GPT-4.5, but some help from Shisa V2 405B too! πŸ˜‚) including for final editing.

03.06.2025 05:15 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We're still working on a full proper technical report (tracking down references are hard) but we have an Overview Report slide deck I posted in EN/JA here: shisa.ai/posts/shisa-...

It's my first Japanese slide deck and I super embraced the aesthetic!

03.06.2025 05:11 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Related to an earlier observation bsky.app/profile/did:... - but since, both our 70B and 405B Shisa V2 models are *stronger than GPT-4 in Japanese,* it has trouble judging them. Luckily GPT-4.1 is still able to distinguish them. πŸ˜…

03.06.2025 05:08 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Shisa V2 405B ζ—₯本θͺžδΈŠζ‰‹οΌ

Shisa V2 405B ζ—₯本θͺžδΈŠζ‰‹οΌ

BTW, right now you can chat w/ an FP8 version of Shisa V2 405B online now. If you don't speak Japanese, you can ask it to translate or even teach you some πŸ˜€ chat.shisa.ai

03.06.2025 05:02 β€” πŸ‘ 7    πŸ” 1    πŸ’¬ 2    πŸ“Œ 1

Today we launched one more addition to the Shisa V2 models: Shisa V2 405B. This is new Llama 3.1 405B post-tune that is the strongest model ever trained in Japan! It matches GPT-4o and DeepSeek-V3 in JA MT-Bench. Read more here: shisa.ai/posts/shisa-...

03.06.2025 04:59 β€” πŸ‘ 14    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Shisa V2 405B scores above GPT-4o latest in JA MT-Bench

Shisa V2 405B scores above GPT-4o latest in JA MT-Bench

Shisa V2 405B scores on par with the latest DeepSeek V3 and and GPT-4o in every category in JA MT-Bench

Shisa V2 405B scores on par with the latest DeepSeek V3 and and GPT-4o in every category in JA MT-Bench

OK, first JA slide deck in the books. πŸ˜… (Thanks, ChatGPT 4.5.)

27.05.2025 04:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Shisa V2 405B

BTW, in case anyone wants to kick the tires or test their ζ—₯本θͺž, I have our Shisa V2 405B model up and running temporarily (just a day or two until I finish evals/start training again): chat.shisa.ai

24.05.2025 21:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

When your model is sufficiently better than the judge model, it may just start throwing a lot of 10s in its scoring πŸ˜‚ (based on our overall eval battery shisa-v2 70b is a fair amount better than gpt-4 and gpt-4-turbo, but that's the standard judge used for 1:1 comparisons...)

23.05.2025 05:34 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 1

Any batching will affect determinism, but also changes to the kvcache layout (since they can change the GEMM shapes used which can lead to bit level differences) so I don't think it's safe to blanket claim that outputs will necessarily be deterministic even when running locally at temp=0

17.05.2025 08:36 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
PyTorch FA perf on Strix Halo (gfx1151) is quite awful.

PyTorch FA perf on Strix Halo (gfx1151) is quite awful.

I've recently been poking at Strix Halo. For those interested in using it for inference, it's about expected (except for surprisingly bad llama.cpp HIP perf): www.reddit.com/r/LocalLLaMA... - but for those looking to do work (PyTorch, etc)... the current state is not good.

14.05.2025 17:46 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Qwen 3 Japanese Performance – Shisa.AI

For those curious, Like with Llama 4, I've run Qwen 3 through some Japanese language evals. Writeup here: shisa.ai/posts/qwen3-...

01.05.2025 05:36 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
DPO mini-sweep. The calculated-scaled LR did end up being the best overall performer.

DPO mini-sweep. The calculated-scaled LR did end up being the best overall performer.

Each DPO for the 405B took all 256 H100s at our disposal and took about 3300 GPU hours. By comparison, doing a full SFT+DPO on our Shisa V2 70B "only" took about 1200 H100 hours.

28.04.2025 12:29 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Chart showing JA + EN performance of Shisa V2 and a new 405B FFT vs others

Chart showing JA + EN performance of Shisa V2 and a new 405B FFT vs others

Over the weekend, I finished up our Llama 405B run (4th group I know of to do a FFT?). It was a real beast to train, but beats our Shisa V2 70B (as well as GPT-4 and GPT-4 Turbo) using basically our Shisa V2 recipe. It is, I believe the best performing LLM (JA and EN) to ever be trained in Japan.

28.04.2025 12:25 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Shisa V2 – Shisa.AI

Our small team (of 2!) has just released some of the strongest open Japanese LLMs, Shisa V2 (7-70B). We tried quite a few new techniques (most failed to replicate), so in the end, it was largely grinding out better datasets the past few months: shisa.ai/posts/shisa-...

15.04.2025 17:51 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0