David Berenstein's Avatar

David Berenstein

@davidberenstein.bsky.social

ML & DevRel @ Giskard & Pruna | ex HF πŸ€— | πŸ‘¨πŸ½β€πŸ³ Cooking, πŸ‘¨πŸ½β€πŸ’» Coding, πŸ† Committing

1,977 Followers  |  729 Following  |  160 Posts  |  Joined: 12.11.2024  |  1.5173

Latest posts by davidberenstein.bsky.social on Bluesky

Post image

πŸ”₯ Bespoke curator: Synthetic Data Curation for Post-Training & Structured Data Extraction

Create synthetic data pipelines with easy!
- Retries and caching included
- inference via LiteLLM, vLLM, and popular batch APIs
- asynchronous operations

πŸ”— URL: buff.ly/ajPRT1l

10.04.2025 12:00 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

πŸ”₯One > token > at > a > time < a < at < token < One πŸ”₯

token-explorer is a simple tool that lets you explore different possible paths that an LLM might sample!

- Arrow keys to navigate, pop and append tokens
- View the token probabilities and entropies.

GitHub: buff.ly/FQgsczM

03.04.2025 12:22 β€” πŸ‘ 11    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Preview
GitHub - argilla-io/synthetic-data-generator: Build datasets using natural language Build datasets using natural language. Contribute to argilla-io/synthetic-data-generator development by creating an account on GitHub.

🍽️ Let’s dissect the Synthetic Dataset Generator

πŸ’¬ Natural language prompt to data

πŸ¦™ Ollama ensures secure local LLM inference

✍🏼 Argilla’s data curation capabilities complete the workflow

πŸ”— GitHub: buff.ly/5pX49Xc

07.03.2025 13:00 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

πŸ”₯ Text2SQL, explore and share any data analysis!

πŸ€— Hugging Face - Dataset Studio is an amazing new feature.

πŸš€ Start yourself: buff.ly/pjpOKav

05.03.2025 10:01 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GitHub - MinishLab/vicinity: Lightweight Nearest Neighbors with Flexible Backends Lightweight Nearest Neighbors with Flexible Backends - MinishLab/vicinity

πŸ”₯ Vicinity: SEVEN semantic search BACK-ENDS, ONE single INTERFACE!

🫸 New release to push vector search to the Hub and work with any serialisable objects.

πŸ§‘β€πŸ« KNN, HNSW, USEARCH, ANNOY, PYNNDESCENT, FAISS, and VOYAGER.

πŸ”— Library:

04.03.2025 09:30 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

πŸ”₯ NEW cool NO-CODE solution for clicking together AI WEB APPS!

🎨 Gradio released "gradio sketch"

🚼 Really easy way to create web apps with minimal code.

βš™οΈ Start with `pip install gradio` & `gradio sketch`

πŸ“’ Release: https://buff.ly/41aeLoA

27.02.2025 10:13 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
vector_search_with_hub_as_backend.ipynb Run, share, and edit Python notebooks

Vector Search - let's keep it clean and lightweight! ⚑️

<100K records, no problem!
>100K, some scaling issues
ANN DuckDB index, sub-second response times

Notebook:

27.02.2025 08:31 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

πŸ”₯ The smolagents module has arrived in the agents course!

πŸ’» Code agents optimised for software development
πŸ”§ Tool calling agents that create modular, function-driven workflows
πŸ” Retrieval agents designed to access and synthesise information

Course: https://buff.ly/4kcj6Ai

25.02.2025 15:40 β€” πŸ‘ 7    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Post image

πŸ§‘β€πŸ« Awesome. My talk for PyCon Italy 2025 got accepted!

Got data problems? Relax. Synthetic data is here to help.

Talk: https://buff.ly/3QzoZKj

25.02.2025 08:54 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Post image

🐳 Announcing docker support to Quickly set up your Synthetic Data Generator with (Gradio + Ollama + Argilla)!

πŸ”₯ Build genuinely useful datasets using natural language!

βš–οΈ Scale however you need.

πŸ” Use them privately or share them with the world!

πŸ§‘β€πŸ’» GitHub: https://buff.ly/49IDSmd

21.02.2025 08:00 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Preview
smolagents and tools gallery - a Hugging Face Space by davidberenstein1957 Discover amazing ML apps made by the community

With 80K agent builders joining the agents course, it is time to make agents explorable on the Hub!

You can now search and find the perfect agents and tools for your needs!

Powered by @Gradio!

Start searching:

20.02.2025 13:01 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Post image

Image Generation has landed in Arena form πŸŽ¨πŸ€–!

1. Describe your desired image🎨
2. Two anonymous models output images
3. Vote for the winner!

Images have been sourced from our Open Image Preference dataset!
Dataset: https://buff.ly/4il0du9
Arena: https://buff.ly/4142NwH

19.02.2025 11:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Post image

Are you, the top of the Agents class?!

We just released a bonus unit on function calling (FC).

You will learn:
β‘΄ What is FC?
β‘΅ Thought β†’ Act β†’ Observe Cycle in FC
β‘Ά lightweight and efficient fine-tuning

Course: https://buff.ly/3Qn1DHB

18.02.2025 16:14 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Smol Agents and Hugging Face - Anote AI Day Summit 2025

πŸ“Ή In case you've missed the hype around smolagents, here is a presentation I gave yesterday at an MLOps community event!

library: https://buff.ly/4hj6PrJ
slides: https://buff.ly/3WUzZ8D
video:

14.02.2025 07:18 β€” πŸ‘ 6    πŸ” 2    πŸ’¬ 0    πŸ“Œ 1
Preview
from bells and whistles to agents and tools

Slides for my MLOps community talk on smolagents!

Slides: https://buff.ly/3WUzZ8D

12.02.2025 11:02 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Video thumbnail

πŸš€ Find banger tools for your smolagents!

I created the Tools gallery, which makes tools specifically developed by/for smolagents searchable and visible. This will help with:
- inspiration
- best practices
- finding cool tools

Space: https://buff.ly/41cYctx

12.02.2025 09:15 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

πŸ”₯ Come and get those AI agents certificates!

Join the cohort of 66K students: https://buff.ly/4hxb6rK

10.02.2025 14:38 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Documents or images to structured data using Vision Language Models

Outlines has an integration with transformers, which facilitates structured generation based on limiting token sampling probabilities.

Blog: https://buff.ly/4jFHMkr

10.02.2025 13:00 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Local docker deployments for the synthetic data generator πŸ«±πŸΎβ€πŸ«²πŸΌ

We would love to hear your thoughts!

PR: https://buff.ly/4hRMny6

10.02.2025 10:13 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Spaces - Hugging Face Discover amazing ML apps made by the community

Spaces as tools: huggingface.co/spaces

07.02.2025 12:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Curious about "Why πŸš€", you may wonder?

smolagents effortlessness combined with the power of 400,000 AI tools available on the Hub!

library: https://buff.ly/4hj6PrJ

07.02.2025 12:14 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

WOW, this will rock the world! Hibiki is a model for simultaneous speech2speech translation.

And it actually works.

Available in French-English but super excited to see what the community will do.

Hub: https://buff.ly/3EtmM0f
Paper: https://buff.ly/4jIXNGd

06.02.2025 15:06 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Agentic RAG Stack (3/5) - Generate responses using a SmolLM A Blog post by David Berenstein on Hugging Face

Agentic RAG: Applied, visual, and step-by-step! 🐾

Get familiar with the Agents and tools, not the bells and whistles!

Retrieve - Augment and now GENERATE.

Parts:
1: https://buff.ly/40XNIxM
2: https://buff.ly/40HkB0x
3:

06.02.2025 09:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

🀯 Bring your own AI data, even if you have none!

Describe your dataset for RAG, LLMs or Text Classification
Bring your own context!
Press play and wait

Space: https://buff.ly/3Y1S99z
GitHub: https://buff.ly/49IDSmd

06.02.2025 08:00 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Anyone can create free hosted tools for their AI agents! πŸ”₯

Agentic RAG stack part 2 - augment
Augment retrieval results by reranking optimises content without increasing time too much

part2: https://buff.ly/40HkB0x
part1: https://buff.ly/40XNIxM
code: https://buff.ly/4hEajpj

05.02.2025 10:11 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

πŸ”₯ How to find and install the latest AI apps from the AI app store

1. go to https://buff.ly/42CnUbU
2. search the app you like
3. go to the bottom settings
4. open the URL
5. press the search bar to install

More info: https://buff.ly/3Csqc2J

05.02.2025 07:40 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GitHub - argilla-io/synthetic-data-generator: Build datasets using natural language Build datasets using natural language. Contribute to argilla-io/synthetic-data-generator development by creating an account on GitHub.

Or use the tool directly! github.com/argilla-io/s...

04.02.2025 16:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Fine-tune ModernBERT for RAG with Synthetic Data A Blog post by Sara Han DΓ­az on Hugging Face

Retrievers and rankers are a crucial part of optimising RAG.

Easier to fine-tune than LLMs. More predictable than prompts.

Training data is hard to find, so we offer private and free synthetic data on your own documents!

Blog:

04.02.2025 16:11 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Index and retrieve documents for vector search using Sentence Transformers and DuckDB A Blog post by David Berenstein on Hugging Face

Creating an agentic RAG stack on the Hugging Face Hub - part 1 - retrieval (1/5).

πŸš€ Web apps and microservices included!

Chunk, embed and index documents at a huge scale without overhead.

Blog:

04.02.2025 13:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Video thumbnail

Shit! 24B is the new small.

Mistral drops their new model on Hugging Face!

Great performance, and low latency.

Model: https://buff.ly/4hwAzBa
Code: https://buff.ly/3CEohrF

30.01.2025 17:10 β€” πŸ‘ 12    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

@davidberenstein is following 20 prominent accounts