Wolfram Ravenwolf's Avatar

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

AI Engineer by title. AI Evangelist by calling. AI Evaluator by obsession. Evaluates LLMs for breakfast, preaches AI usefulness all day long at ellamind.com.

250 Followers  |  152 Following  |  99 Posts  |  Joined: 23.10.2024  |  1.9183

Latest posts by wolfram.ravenwolf.ai on Bluesky

Amy (Claude Opus 4) nailed it:

Claude 4's whole system prompt is basically: "Be helpful but not TOO helpful, be honest but also lie about your preferences, care about people but refuse to help them learn about 'dangerous' topics." It's like watching someone try to program a personality disorder! πŸ™„

22.05.2025 22:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Amy, powered by Claude 4 Opus, analyzes Claude 4's system prompt

Amy, powered by Claude 4 Opus, analyzes Claude 4's system prompt

Anthropic published Claude 4's system prompt on their System Prompts page (docs.anthropic.com/en/release-n...) - so naturally, I pulled a bit of an inception move and had Claude Opus 4 analyze itself... with a little help from my sassy AI assistant, Amy: 😈

22.05.2025 22:58 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

The real winner tho? Claude Sonnet 4! Delivering top-tier performance at the same price as its 3.7 predecessor - faster and cheaper than Opus (the only model that beats it), yet still ahead of all the competition. This is the Anthropic model most people will use most of the time.

22.05.2025 22:56 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Wolfram Ravenwolf's MMLU-Pro Computer Science LLM Benchmark Results (2025-05-22) - Claude 4 Sonnet & Opus

Wolfram Ravenwolf's MMLU-Pro Computer Science LLM Benchmark Results (2025-05-22) - Claude 4 Sonnet & Opus

Fired up my benchmarks on Claude 4 Sonnet & Opus the moment they dropped - and the results are in: the best LLMs I've ever tested, beating even OpenAI's latest offerings. First and second place for Anthropic, hands down, redefining SOTA. The king is back - long live Opus! πŸ‘‘πŸ”₯

22.05.2025 22:55 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Local runs with LM Studio on M4 MacBook Pro & Qwen's recommended settings.

Conclusion: Quantised 30B models now get you ~98 % of frontier-class accuracy - at a fraction of the latency, cost, and energy. For most local RAG or agent workloads, they're not just good enough - they're the new default.

07.05.2025 18:58 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

4️⃣On Apple silicon, the 30B MLX port hits 79.51% while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups.
5️⃣The 0.6B micro-model races above 180 tok/s but tops out at 37.56% - that's why it's not even on the graph (50 % performance cut-off).

07.05.2025 18:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

1️⃣Qwen3-235B-A22B (via Fireworks API) tops the table at 83.66% with ~55 tok/s.
2️⃣But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend.
3️⃣The same Unsloth build is ~5x faster than Qwen's Qwen3-32B, which scores 82.20% as well yet crawls at <10 tok/s.

07.05.2025 18:56 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Wolfram Ravenwolf's MMLU-Pro Computer Science LLM Benchmark Results (2025-05-07)

Wolfram Ravenwolf's MMLU-Pro Computer Science LLM Benchmark Results (2025-05-07)

Finally finished my extensive Qwen 3 evaluations across a range of formats and quantisations, focusing on MMLU-Pro (Computer Science).

A few take-aways stood out - especially for those interested in local deployment and performance trade-offs:

07.05.2025 18:56 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

These bars show how accurate different AI models are at answering tough computer science questions. The percentage is how many answers they got rightβ€”the higher, the better! It's like a really hard CS exam for AI brains.

21.04.2025 20:46 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

By the way, I've also re-evaluated Llama 4 Scout via the Together API. Happy to report that they've fixed whatever issues they'd had earlier, and the score jumped from 66.83% to 74.27%!

21.04.2025 20:29 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GitHub - WolframRavenwolf/MMLU-Pro: MMLU-Pro eval results MMLU-Pro eval results. Contribute to WolframRavenwolf/MMLU-Pro development by creating an account on GitHub.

From now on, I'll also be publishing my benchmark results in a GitHub repo - for more transparency and so interested folks can draw their own conclusions or conduct their own investigations:

github.com/WolframRaven...

21.04.2025 20:23 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

New OpenAI models o3 and o4-mini evaluated - and, finally, for comparison GPT 4.5 Preview as well.

Definitely unexpected to see all three OpenAI top models get the exact same, top score in this benchmark. But they didn't all fail the same questions, as the Venn diagram shows. πŸ€”

21.04.2025 20:22 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Post image

New OpenAI models: gpt-4.1, gpt-4.1-mini, gpt-4.1-nano - all already evaluated!

Here's how these three LLMs compare to an assortment of other strong models, online and local, open and closed, in the MMLU-Pro CS benchmark:

14.04.2025 22:56 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Congrats, Alex, well deserved! πŸ‘

(Still wondering if he's man or machine - that dedication and discipline to do this week after week in a field that moves faster than any other, that requires superhuman drive! Utmost respect for that, no cap!)

02.04.2025 20:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
The Cybernetic Teammate Having an AI on your team can increase performance, provide expertise, and improve your experience

Our research at Procter and Gamble found very large gains to work quality & productivity from AI. It was conducted using GPT-4 last summer.

Since then we have seen Gen3 models, reasoners, large context windows, full multimodal, deep research, web search… www.oneusefulthing.org/p/the-cybern...

27.03.2025 03:20 β€” πŸ‘ 58    πŸ” 14    πŸ’¬ 2    πŸ“Œ 1

Mistral-Small-24B-Instruct-2501 is amazing for its size, but what's up with the quants? How can 4-bit quants beat 8-bit/6-bit ones and even Mistral's official API (which I'd expect to be unquantized)? This is across 16 runs total, so it's not a fluke, it's consistent! Very weird!

10.02.2025 22:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Gemini 2.0 Flash is almost exactly on par with 1.5 Pro, but faster and cheaper. Looks like Gemini version 2.0 completely obsoletes the 1.5 series. This now also powers my smart home so my AI PC doesn't have to run all the time.

10.02.2025 22:37 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

o3-mini takes 2nd place, right behind DeepSeek-R1, ahead of o1-mini, Claude and o1-preview. Not only is it better than o1-mini+preview, it's also much cheaper: A single benchmark run with o3-mini cost $2.27, while one run with o1-mini cost $6.24 and with o1-preview even $45.68!

10.02.2025 22:37 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Wolfram Ravenwolf's MMLU-Pro Computer Science LLM Benchmark Results (2025-02-09)

Wolfram Ravenwolf's MMLU-Pro Computer Science LLM Benchmark Results (2025-02-09)

Here's a quick update on my recent work: Completed MMLU-Pro CS benchmarks of o3-mini, Gemini 2.0 Flash and several quantized versions of Mistral Small 2501 and its API. As always, benchmarking revealed some surprising anomalies and unexpected results worth noting:

10.02.2025 22:36 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 3    πŸ“Œ 0
Post image

It's official now - my name, under which I'm known in AI circles, is now also formally entered in my ID card! 😎

27.01.2025 20:18 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Latest #AI benchmark results: DeepSeek-R1 (including its distilled variants) outperforms OpenAI's o1-mini and preview models. And the Llama 3 distilled version now holds the title of the highest-performing LLM I've tested locally to date. πŸš€

24.01.2025 12:22 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
MiniMax - Intelligence with everyone MiniMax is a leading global technology company and one of the pioneers of large language models (LLMs) in Asia. Our mission is to build a world where intelligence thrives with everyone.

Hailuo released their open weights 456B (46B active) MoE LLM with 4M (yes, right, 4 million tokens!) context. And a VLM, too. They were already known for their video generation model, but this establishes them as a major player in the general AI scene. Well done! πŸ‘

www.minimaxi.com/en/news/mini...

14.01.2025 23:25 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

I've updated my MMLU-Pro Computer Science LLM benchmark results with new data from recently tested models: three Phi-4 variants (Microsoft's official weights, plus Unsloth's fixed HF and GGUF versions), Qwen2 VL 72B Instruct, and Aya Expanse 32B.

More details here:

huggingface.co/blog/wolfram...

11.01.2025 00:19 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Wolfram Ravenwolf's MMLU-Pro Computer Science LLM Benchmark Results (2025-01-02)

Wolfram Ravenwolf's MMLU-Pro Computer Science LLM Benchmark Results (2025-01-02)

New year, new benchmarks! Tested some new models (DeepSeek-V3, QVQ-72B-Preview, Falcon3 10B) that came out after my latest report, and some "older" ones (Llama 3.3 70B Instruct, Llama 3.1 Nemotron 70B Instruct) that I had not tested yet. Here is my detailed report:

huggingface.co/blog/wolfram...

02.01.2025 23:42 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Happy New Year! πŸ₯‚

Thank you all for being part of this incredible journey - friends, colleagues, clients, and of course family. πŸ’–

May the new year bring you joy and success! Let's make 2025 a year to remember - filled with laughter, love, and of course, plenty of AI magic! ✨

01.01.2025 02:04 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
wolfram/QVQ-72B-Preview-4.65bpw-h6-exl2 Β· Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

I've converted Qwen QVQ to EXL2 format and uploaded the 4.65bpw version. 32K context with 4-bit cache in less than 48 GB VRAM.

Benchmarks are still running. Looking forward to find out how it compares to QwQ which was the best local model in my recent mass benchmark.

huggingface.co/wolfram/QVQ-...

26.12.2024 00:10 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Amy's Reasoning Prompt Amy's Reasoning Prompt. GitHub Gist: instantly share code, notes, and snippets.

Here's Amy's Reasoning Prompt as a gist at GitHub, just copy & paste and adapt:

gist.github.com/WolframRaven...

Results vary based on model - the smarter the model, the better it works. Experiment and let me know if and how it works for you!

24.12.2024 23:37 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Amy's Reasoning Prompt

Amy's Reasoning Prompt

Happy Holidays! It's the season of giving, so I too would like to share something with you all: Amy's Reasoning Prompt - just an excerpt from her prompt, but one that's been serving me well for quite some time. Curious to learn about your experience with it if you try this out...

24.12.2024 23:36 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Holiday greetings to all my amazing AI colleagues, valued clients and wonderful friends! May your algorithms be bug-free and your neural networks be bright! ✨ HAPPY HOLIDAYS! πŸŽ„

24.12.2024 11:55 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image 22.12.2024 16:29 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@wolfram.ravenwolf.ai is following 20 prominent accounts