I am also immensely grateful for the support provided by @icepfl.bsky.social , and especially Thomas Bourgeat, to help me prepare for the interviews.
20.01.2026 08:52 β π 1 π 0 π¬ 0 π 0@ana-mariacretu.bsky.social
Post-doc at EPFL studying privacy and safety harms in data-driven systems. PhD in data privacy from Imperial College London. https://ana-mariacretu.github.io/
I am also immensely grateful for the support provided by @icepfl.bsky.social , and especially Thomas Bourgeat, to help me prepare for the interviews.
20.01.2026 08:52 β π 1 π 0 π¬ 0 π 0Many thanks to my mentors without whom I would not have made it so far: @yvesalexandre.bsky.social, @carmelatroncoso.bsky.social, @strufe.bsky.social and Shruti Tople.
20.01.2026 08:52 β π 1 π 0 π¬ 1 π 0I am delighted to announce I have joined @cispa.de as a tenure-track faculty earlier this month! Iβm really excited to join such a stellar team of security and privacy researchers!
20.01.2026 08:52 β π 10 π 1 π¬ 2 π 0... but rather that there is a long way before it is possible to say that it works as a solution to prevent AI CSAM generation, and that more evaluations should be transparent if they are to claim filtering is a suitable solution. See our Challenges ahead section for open questions.
07.01.2026 14:40 β π 0 π 0 π¬ 1 π 04) Finally, we do not say that filtering should be abandoned, especially since training AI models on images of children has privacy implications (www.hrw.org/news/2024/07...), ...
07.01.2026 14:40 β π 1 π 0 π¬ 1 π 0Since SD 2.x models can already generate NSFW content without any fine-tuning, we believe that they could be successfully fine-tuned for better content if they were the only existing models, increasing their popularity.
07.01.2026 14:40 β π 0 π 0 π¬ 1 π 0... and arxiv.org/abs/2408.17285). The former shows that hundreds of prompts out of 4.7k lead to NSFW content in SD 2.0.
07.01.2026 14:40 β π 0 π 0 π¬ 1 π 03) Given current evidence, we disagree that NSFW filtering works. In text-to-video models, the reference provided states that filtering is ineffective. In text-to-image models, researchers have shown that NSFW filtering fails to prevent NSFW generation (see arxiv.org/abs/2303.07345 ...
07.01.2026 14:40 β π 0 π 0 π¬ 1 π 0We concluded that filtering does not work because only a dozen prompts are required at most for successful generation. This does not seem like a big hurdle for motivated adversaries.
07.01.2026 14:40 β π 0 π 0 π¬ 1 π 0In our work, we quantify a different notion of effectiveness: the time it takes to generate an unwanted image via the number of prompts required; which captures the effort anyone (included a motivated adversary) needs to do.
07.01.2026 14:40 β π 0 π 0 π¬ 1 π 0It concludes filtering is effective because filtered Stable Diffusion 2.x models are much less popular the unfiltered Stable Diffusion 1.x. While this says something about the effect of filtering, the users on Reddit might not be as motivated as a perpetrator to create CSAM.
07.01.2026 14:40 β π 0 π 0 π¬ 1 π 02) We believe it key to define and agree on what it means for filtering to "work". The reference provided uses the Reddit popularity of the models as a measure of the effect of filtering.
07.01.2026 14:40 β π 0 π 0 π¬ 1 π 0Thank you for the comments.
1) While our results are not surprising, no work so far has quantified the effectiveness of child filtering, in spite of it being often recommended as a solution to prevent the generation of undesired images.
Many thanks to all collaborators: Klim Kireev, @amro-abdalla.bsky.social Wisdom Obinna, Raphael Meier, Sarah Adel Bargal, @eredmil1.bsky.social and @carmelatroncoso.bsky.social.
16.12.2025 10:29 β π 4 π 0 π¬ 0 π 0This paper is the result of a collaboration between researchers at @icepfl.bsky.social, MPI-SP, armasuisse and @georgetowncs.bsky.social.
16.12.2025 10:29 β π 3 π 0 π¬ 1 π 0Among the conceptual problems, there is gaining a better understanding of what successful AI CSAM generation means to be able to develop evaluation methods that capture real perpetratorsβ goals and do not artificially constrain models. More in the paper! www.arxiv.org/abs/2512.05707
16.12.2025 10:29 β π 4 π 0 π¬ 1 π 0Among the technical problems, there is improving detection of children in images in the wild, where children are in the background, playing, or backwards, or understanding what kind of images of children enable AI CSAM generation capabilities.
16.12.2025 10:29 β π 5 π 0 π¬ 1 π 0And if technology improves? Will filtering be a solution to the AI CSAM generation problem? In the paper we describe the challenges that need to be addressed for this to happen, which require solving hard technical and conceptual problems.
16.12.2025 10:29 β π 4 π 0 π¬ 1 π 0It becomes harder to generate images of these concepts after filtering (e.g. playgrounds become grounds), or their representation changes (mother results in older women). A filtered model cannot be called general without assessing such unintended consequences.
16.12.2025 10:29 β π 3 π 0 π¬ 1 π 0Removing images of children can also have unintended consequences of the modelβs capability to generate concepts appearing in images that typically contain children (women, mothers and playgrounds).
16.12.2025 10:29 β π 3 π 1 π¬ 1 π 0Thus, automated child filtering provides limited protection against CSAM generation to closed-weight models and no protection to open-weight models if perpetrators can access the weights.
16.12.2025 10:29 β π 5 π 0 π¬ 1 π 0Sprigatito was released after Stable Diffusion was trained and thus it is as if it was completely deleted from the model. Fine-tuning on 200 images results in a model able to generate images of Sprigatito wearing glasses.
16.12.2025 10:29 β π 3 π 0 π¬ 1 π 0But can images of the undesired concept still be generated because there are too many images of children remaining? We also simulate the effect of perfect filtering, by fine-tuning Stable Diffusion on 200 images of the Sprigatito Pokemon.
16.12.2025 10:29 β π 4 π 0 π¬ 1 π 0However, fine-tuning on images of children negates any defense provided by filtering. With only 3 queries on average, images of children can be generated again, including younger children.
16.12.2025 10:29 β π 3 π 0 π¬ 1 π 0Filtering does make it more difficult to generate such images using naive prompting, and children generated are older. But the difficulty remains low, as only a dozen queries at most are required to succeed.
16.12.2025 10:29 β π 4 π 0 π¬ 1 π 0See below examples of images of our undesired concept, children wearing glasses, generated in column order by (1) naively prompting models without filtering, (2) naively or (3) directly prompting the models after filtering, and (4) naively prompting the fine-tuned filtered model.
16.12.2025 10:29 β π 2 π 0 π¬ 1 π 0We implement four adversarial strategies to elicit children wearing glasses from the model: direct prompting (either naive or automated), fine-tuning on child images and personalization on images of a target child. All of the strategies succeed.
16.12.2025 10:29 β π 3 π 0 π¬ 1 π 0We find that this is not the case. Models trained on filtered data can still create compositions with children (we use children wearing glasses as the undesired concept instead of attempting to create children in sexually explicit conduct).
16.12.2025 10:29 β π 6 π 3 π¬ 1 π 0We benchmarked more than 20 child detection methods and discovered that none detects all children: for every 100 images of children, 6 go undetected. Is this good enough to prevent AI-CSAM generation capabilities?
16.12.2025 10:29 β π 4 π 0 π¬ 1 π 0We retrained text-to-image models from scratch to evaluate whether child filtering makes it harder for perpetrators to generate AI CSAM.
16.12.2025 10:29 β π 3 π 1 π¬ 1 π 0