Philipp Schmid's Avatar

Philipp Schmid

@philschmid.bsky.social

Tech Lead and LLMs at @huggingface ๐Ÿ‘จ๐Ÿปโ€๐Ÿ’ป ๐Ÿค— AWS ML Hero ๐Ÿฆธ๐Ÿป | Cloud & ML enthusiast | ๐Ÿ“Nuremberg | ๐Ÿ‡ฉ๐Ÿ‡ช https://philschmid.de

2,775 Followers  |  323 Following  |  75 Posts  |  Joined: 19.11.2024  |  1.5654

Latest posts by philschmid.bsky.social on Bluesky

Scaling test-time compute - a Hugging Face Space by HuggingFaceH4 Discover amazing ML apps made by the community

Code and methods open source in a new library ,โ€œlearn and searchโ€
Blog: huggingface.co/spaces/Huggi...

Learn and Search Repo: github.com/huggingface/...

17.12.2024 07:30 โ€” ๐Ÿ‘ 9    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

- Introduce DVTS, a new method of performance on larger compute budgets by maintaining solution diversity
- Using compute-optimal scaling, a Llama 3 3B outperforms 70B (22x larger) on mathematical reasoning tasks

17.12.2024 07:30 โ€” ๐Ÿ‘ 6    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

- Process Reward Models (PRMs) played a crucial role in the search process by evaluating intermediate solution steps
- Different search strategies work better for different problem difficulties - beam search for harder problems, Best-of-N for simpler ones

17.12.2024 07:30 โ€” ๐Ÿ‘ 6    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

- Test-time compute scaling offers an alternative to training larger models by allowing smaller models to "think longer"
- Explored Best-of-N sampling, beam search, and Diverse Verifier Tree Search (DVTS)
- Llama 3 1B achieved 55% accuracy on the MATH benchmark using optimal search strategies

17.12.2024 07:30 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

By scaling test-time compute, smaller models can match or even surpass the performance of larger models. Llama 3.2 3B can outperform Llama 3.1 70B on MATH-500!๐Ÿคฏ

17.12.2024 07:30 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

How we implemented test-time computing for open models to solve complex math problems like OpenAI o1. ๐Ÿ‘€ย Test-time compute methods use dynamic inference strategies to have LLMs โ€œthink longerโ€ on harder problems, e.g. difficult math problems.

17.12.2024 07:30 โ€” ๐Ÿ‘ 19    ๐Ÿ” 3    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1
Preview
DEVAI-benchmark/DEVAI ยท Datasets at Hugging Face Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

- ๐Ÿ› ๏ธ Cuts down costs to ~2.29% and time to ~2.36% of human evaluation
- ๐Ÿ’ฐ Costs $30 vs $1,297 for human evaluation
- โšกย Reduced time to 118.43 minutes vs 86.5 hours
- ๐Ÿง‘โ€โš–๏ธย LLM achieved a 60-70% alignment rate to humans
- ๐Ÿฅ‡ย Agent achieved a 90% alignment rate to humans

huggingface.co/datasets/DEV...

10.12.2024 09:53 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

The Agent-as-a-Judge is a graph-based agent with tools to locate, read, retrieve, and evaluate files and information for a code project to evaluate the results of other agents by comparing its judgments to human evaluations (alignment rate, judge shift).

Github: github.com/metauto-ai/a...

10.12.2024 09:53 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Paper page - Agent-as-a-Judge: Evaluate Agents with Agents Join the discussion on this paper page

What is better than an LLM as a Judge? Right, an Agent as a Judge! Meta created an Agent-as-a-Judge to evaluate code agents to enable intermediate feedback alongside DevAI a new benchmark of 55 realistic development tasks.

Paper: huggingface.co/papers/2410....

10.12.2024 09:53 โ€” ๐Ÿ‘ 27    ๐Ÿ” 1    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1
Preview
Sora Transform text and images into immersive videos. Animate stories, visualize ideas, and bring your concepts to life.

Sora UI: sora.com

Kudos to OpenAI for shipping this! The UI/UX looks really thorough! ๐Ÿšข

09.12.2024 18:41 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Video thumbnail


OpenAI trained a new Turbo model to make it easier and faster to use. With "storyboards" users get a CapCut/Tiktok/Reel-like text-to-video editor, that can be used to edit and create new short-form content! Social media will be flooded.๐ŸŒŠ

09.12.2024 18:41 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

A big day for AI and sad day for the EU. OpenAI releases Sora, their text-to-video model, with a dedicated UI Studio! Sora will be free for all ChatGPT Pro and Plus subscribers without additional cost. Sora will be available to later today, except if you live in the EU or UK. ๐Ÿคฏ

09.12.2024 18:41 โ€” ๐Ÿ‘ 5    ๐Ÿ” 1    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
QwQ: Reflect Deeply on the Boundaries of the Unknown GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD Note: This is the pronunciation of QwQ: /kwju:/ , similar to the word โ€œquillโ€. What does it mean to think, to question, to understand? These are the deep wa...

Blog: qwenlm.github.io/blog/qwq-32b...
Model: huggingface.co/Qwen/QwQ-32B...
Demo: huggingface.co/spaces/Qwen/...

28.11.2024 08:01 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

- โš ๏ธ notable limitations including language mixing, recursive reasoning loops, and safety considerations
- ๐Ÿ˜ย Released under Apache 2.0 on Hugging Face
- ๐Ÿ‘€ย Full โ€œreasoningโ€ (CoT) available in the demo

28.11.2024 08:01 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

- ๐Ÿ‘จโ€๐Ÿ”ฌย QwQ-32B-Previewย is an experimental research
- ๐Ÿ”ง 32.5B parameters and 32,768 context length
- ๐Ÿ“Š 65.2% on GPQA, 50.0% on AIME, 90.6% on MATH-500, and 50.0% on LiveCodeBench

28.11.2024 08:01 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

First open-weights for OpenAI-o1-like reasoning model! QwQ from the Qwen team is a 32B model that beats OpenAI O1 mini and competes w/ O1 preview and is available under Apache 2.0 on Hugging Face! ๐Ÿคฏ

28.11.2024 08:01 โ€” ๐Ÿ‘ 42    ๐Ÿ” 2    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 2
Preview
HuggingFaceTB/SmolVLM-Instruct ยท Hugging Face Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

Models: huggingface.co/HuggingFaceT...
Blog: huggingface.co/blog/smolvlm

26.11.2024 16:31 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

๐ŸŽฅ Surprising video capabilities with 27.14% on CinePile
๐Ÿ”“ Released under Apache 2.0 on @huggingface.bsky.social
๐Ÿ“ฑ Can run efficiently on laptops and edge devices

26.11.2024 16:31 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿš€ Smallest SOTA vision language model at only 2B parameters
๐Ÿ› ๏ธ Released 3 variants with Base, Synthetic, and Instruct
๐Ÿ’พ Requires only 5GB GPU RAM and achieves 38.8% on MMMU, 81.6% on DocVQA
โšก 3.3-4.5x faster prefill and 7.5-16x faster generation vs Qwen2-VL

26.11.2024 16:31 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

SmolLM can now see! ๐Ÿ‘€ Meet SmolVLM - a tiny 2B but powerful vision language model that runs on your device! Built on top of SmolLM and released under Apache 2.0. ๐Ÿš€

26.11.2024 16:31 โ€” ๐Ÿ‘ 42    ๐Ÿ” 5    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 0
Preview
2:4 Sparse Llama: Smaller Models for Efficient GPU Inference Discover Sparse Llama: A 50% pruned, GPU-optimized Llama 3.1 model with 2:4 sparsity, enabling faster, cost-effective inference without sacrificing accuracy.

Blog: neuralmagic.com/blog/24-spar...
Pruning is not a new technique, but it was much harder to achieve good results and maintain performance across tasks compared to quantization. Let's see if Neural Magic can change that.

26.11.2024 08:24 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

- ๐Ÿ“ˆ Full recovery on fine-tuning tasks (GSM8K, Evol-CodeAlpaca, Ultrachat-200K)
- โšก 1.4-2.1x better multi-query throughput
- ๐ŸŒฑย Pruned using 13B tokens training, 26 hours on 32 H100s
- ๐Ÿ”ง Optimized for NVIDIA Ampere GPUs and newer

26.11.2024 08:24 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

- ๐Ÿ”„ 98.4% original accuracy on on Open LLM Leaderboard v1 with 50% less parameters using 2:4 sparsity pattern
- ๐Ÿš€ 30% higher throughput and 1.8x lower latency with up to 5.0x when combined with quantization
- ๐Ÿ’ป Works with 4-bit quantization (GPTQ) and Sparse-Marlin kernels

26.11.2024 08:24 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

How far can we push LLM optimizations? Turns out, pretty far! A new study achieves 98% accuracy recovery on key benchmarks while removing 50% of Llama 3.1 8B's parameters using pruning. Pruning strategically to remove unnecessary connections in a neural network to make it smaller and faster. ๐Ÿ‘€

26.11.2024 08:24 โ€” ๐Ÿ‘ 21    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Post image

TIL: @huggingface.bsky.social Transformers has native Tensor Parallelism support for better inference on multiple GPUs! This will enable many benefits and optimizations in the future.๐Ÿš€

For now, it supports Llama. Which one would you want to see next?

25.11.2024 15:50 โ€” ๐Ÿ‘ 24    ๐Ÿ” 2    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 1
Post image

Created a visual for how function calling works. Wdyt? ๐Ÿค”

25.11.2024 11:34 โ€” ๐Ÿ‘ 24    ๐Ÿ” 2    ๐Ÿ’ฌ 6    ๐Ÿ“Œ 0
Post image

Blog: blog.dottxt.co/say-what-you...
No-structured outputs can actually improve LLM performance when implemented correctly.

25.11.2024 07:25 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

๐ŸŽฏ JSON generation reached 77% accuracy vs the paper's reported <10%
๐Ÿ”ฎ Examples in prompts should match the exact format expected in the actual tasks
๐Ÿงฐ Structured generation works best when implemented as "running our response parser as a generator"

25.11.2024 07:25 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿ› ๏ธ Key success criteria is to align your prompt, parser, and generator - it's not just about using JSON mode
๐Ÿ“Œ JSON generation requires careful prompt design, including specifying the desired schema.
๐Ÿ“ Good prompts should mimic information for human to understand the task and expected response format

25.11.2024 07:25 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿ“ˆ The "Let Me Speak Freely" poor results came from weak prompts and wrong use of structured prompting
๐Ÿ“Š Structured outputs outperform unstructured on the test GSM8K: 0.78 vs 0.77, Last Letter: 0.77 vs 0.73, Shuffle Object: 0.44 vs 0.41

25.11.2024 07:25 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@philschmid is following 18 prominent accounts