Dave Willner's Avatar

Dave Willner

@dwillner.bsky.social

Co-Founder at Zentropi. Formerly Head of Trust & Safety at OpenAI, of Community Policy at Airbnb, and of Content Policy Facebook. Strictly cold takes.

9,415 Followers  |  1,786 Following  |  189 Posts  |  Joined: 06.05.2023  |  2.4285

Latest posts by dwillner.bsky.social on Bluesky

Preview
zentropi-ai/cope-a-9b Β· Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

You could also just run that policy using CoPE as the labeler in production - the interpreting model is only 9B parameters and is open sourced, so we can run it for you or you can run it on your own infra! huggingface.co/zentropi-ai/...

01.08.2025 14:50 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Thanks friend!

01.08.2025 14:48 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

this is actually and truly, huge. That workshop was ridiculous to hear about and I think I saw like a thousand lightbulbs turn on in people's heads at the same time

31.07.2025 22:44 β€” πŸ‘ 9    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

This looks absolutely amazing and a quick perusal shows it might actually make running a labeler smooth enough that I might be able to do it once we figure out why my brain is melting

01.08.2025 12:49 β€” πŸ‘ 108    πŸ” 8    πŸ’¬ 5    πŸ“Œ 1

It means a lot to me that you like it πŸ˜€

01.08.2025 14:47 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

The system offers you candidate policy revisions (and data labels you applied it thinks might not follow from your policy), then you read/accept/assess them, deciding whether or not they are closer to what you want.

01.08.2025 14:44 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Here again, you can end up with a policy you *don't want* but it can't really be hallucinated in the traditional sense, since the policy is a set of definitions you're asserting for the purposes of this labeling exercise. There's no ground-truth "true" policy, it's a construct.

01.08.2025 14:44 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

The automatic model improvement system is...very complicated to explain. But it basically uses a larger stock LLM, guided by a combination of a starting policy, a starting set of data labels, and CoPE's understanding of the two to guide and test revisions to the policy text.

01.08.2025 14:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

CoPE itself simply returns 0 or 1 in response to a content policy + an example, which is how it indicates if it assess that example matches that policy. So it can be wrong (and definitely is sometimes) but it can't really confabulate per-se in that way.

01.08.2025 14:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We wrote those policies ourselves along with contributors including @emilycapstick.bsky.social @klyman.bsky.social, and others credited in the model card who aren't on Bluesky. That group also labeled the data using a combined LLM/manual process (too complex for here, covered in the paper).

01.08.2025 14:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We trained CoPE ourselves using a small (~7k) number of already open source examples of abuse data that we labeled from ~70 different policy perspectives. We've got a draft of a paper explaining in detail how we did that which we're going to finish editing and put up on arxiv in the next two weeks

01.08.2025 14:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Preview
zentropi-ai/cope-a-9b Β· Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Great questions. The answer to this is pretty dense, but I will do my best! There's multiple models used here. The one that gets used a lot is CoPE-A-9B, which interprets and applies the policies to specific examples. We've open sourced that model, you can find it huggingface.co/zentropi-ai/...

01.08.2025 14:44 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

It's definitely not a perfect system, but honestly the imperfections here are more around the potential for us getting it to work better than they are around either energy use or privacy, where is neutral-to-good compared to the status quo ante.

01.08.2025 14:26 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

-> We do have the data you give us for policy writing on the site (the auto-optimization is a pretty complex multistep process behind the scenes, we need it to show it to you on the site, etc) but those sets don't need to be all that large, or even real user data.

01.08.2025 14:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
zentropi-ai/cope-a-9b Β· Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

-> The API we provide for people to apply policies to specific pieces of content is zero data retention - we don't keep or train on any of it. We also open sourced the model, so you can run it yourself if you need/want: huggingface.co/zentropi-ai/...

01.08.2025 14:26 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

-> That's small enough that it's not *that much* bigger than traditional black box ML models that are already used very widely by big social networks (but which require humans to look at thousands of horrors repeatedly to train).

01.08.2025 14:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Can't speak to other attempts, but we thought about both issues pretty carefully:

-> The interpreting model (reads and applies the policy, is what you'd actually use enough to be an environment issue) is only 9B parameters (less than 1% the size of GPT 4) so it is energy light to train and to run.

01.08.2025 14:26 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

If there’s a way to build tools that can be useful here using LLMs, I think that’s clearly good no matter what you think of β€œAI”. Ultimately I’m trying to figure out ways to make a community of people I’ve worked in my entire adult life able to do their jobs better with less trauma.

01.08.2025 07:59 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

For context, I was Facebook’s 12th content moderator. Looking at very messed up content was my job for years. Done at the industrial scales of big platforms (not small-forum scale) it is bad for the folks who do it *and* leads to mediocre results. So we don’t have to be β€œsuper” here to be useful.

01.08.2025 07:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

That’s not my claim here - we’ve finetuned a very small model to be pretty good (definitely not perfect) at following clearly written policies and used that to build a system that helps policy writers figure out how to say what they mean much more quickly than puzzling through it alone.

01.08.2025 07:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

OPEN SOURCE SAFETY
OPEN SOURCE SAFETY
OPEN SOURCE SAFETY
OPEN SOURCE SAFETY
OPEN SOURCE SAFETY
OPEN SOURCE SAFETY
OPEN SOURCE SAFETY
OPEN SOURCE SAFETY
OPEN SOURCE SAFETY
OPEN SOURCE SAFETY
OPEN SOURCE SAFETY
OPEN SOURCE SAFETY
OPEN SOURCE SAFETY
OPEN SOURCE SAFETY
OPEN SOURCE SAFETY

31.07.2025 22:46 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

It's not going to magically fix everything. But I am cautiously optimistic that this, and things like it, will let us make real progress for the first time in a while.

31.07.2025 22:14 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Getting better at multi-turn is like multilingual performance - one of those things that keeps coming up, so we will likely work on it at some point. The main constraint with our approach is that you need to express policies as explicit criteria about the content itself.

31.07.2025 21:30 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Zentropi CoPE Demo - a Hugging Face Space by zentropi-ai Enter content and a policy to determine if the content violates the policy. The app returns "1" if the content meets any criteria, and "0" if it does not.

Aaah, got it. We've open sourced our first (and current) version of underlying classification model here - huggingface.co/spaces/zentr...

It's a fine tune of Gemma 2, so it only has 8,000 tokens of context. It can work with multi-turn conversations, but wasn't specifically trained for it.

31.07.2025 21:28 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@samidh.bsky.social and @jbellack.bsky.social have had some similar thoughts. We'd be glad to discuss if you're interested!

31.07.2025 21:16 β€” πŸ‘ 8    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
zentropi-ai/cope-a-9b Β· Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Re: multi-lingual, we've open sourced the labeling model at huggingface.co/zentropi-ai/.... We've also got a paper forthcoming shortly (I need to do my edits) on the training methodology. The hope here is that folks will be able to to help improve the broader project and/or make tailored versions.

31.07.2025 20:59 β€” πŸ‘ 12    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1

Samidh did some simple experiments with a less performant version back in December and it did seem to be probably-workable!

31.07.2025 20:39 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

The workshop was successful enough that we're considering running another one, if you're interested! Re: language - CoPE-A is decent, but definitely worse, outside of English and improving that is on our roadmap. There's no underlying methodology reason it can't be done, we just have to do the work!

31.07.2025 20:32 β€” πŸ‘ 10    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Yay!

31.07.2025 20:01 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Just tested this on a few that I know Reddit’s existing Hatred & Harassment automation has blindspots for; it built a focused, accurate labeler in under 10 minutes / a dozen examples, & the human readable criteria it built could be dropped into a training manual / erratum / used to build a regex

31.07.2025 19:48 β€” πŸ‘ 12    πŸ” 5    πŸ’¬ 2    πŸ“Œ 0

@dwillner is following 20 prominent accounts