bastian bunzeck's Avatar

bastian bunzeck

@bbunzeck.bsky.social

wondering how humans and computers learn and use language πŸ‘ΆπŸ§ πŸ—£οΈπŸ–₯οΈπŸ’¬ the work is mysterious and important, see bbunzeck.github.io phd at @clausebielefeld.bsky.social

402 Followers  |  931 Following  |  122 Posts  |  Joined: 19.11.2023
Posts Following

Posts by bastian bunzeck (@bbunzeck.bsky.social)

Video thumbnail

Most of the footage of the famous 2000 "Super Mario 128" tech demo was recorded with handheld cameras by the audience, with the presenter drowning out the game sound. Only 10 seconds of direct feed footage exist, which reveal that the sound was a cacophony of Marios screaming.

02.03.2026 17:15 β€” πŸ‘ 3479    πŸ” 1147    πŸ’¬ 41    πŸ“Œ 43
Post image

New episode!! πŸŽ‰πŸŽ™οΈ

A conversation w/ @melaniemitchell.bsky.social about metaphors and AI.

Are current AI systems like human minds? Or more like alien intelligences, role players, mirrors, libraries, or stochastic parrots? And does our choice of metaphor matter?

Listen: disi.org/manyminds/

02.03.2026 18:31 β€” πŸ‘ 14    πŸ” 5    πŸ’¬ 1    πŸ“Œ 1
Post image

Qwen 3.5 Small Model Series just dropped on
@hf.co πŸ”₯

huggingface.co/collections/...

✨ 0.8B/2B/4B/9B
✨ Apache2.0
✨ 262Kβ†’1M token context

02.03.2026 13:31 β€” πŸ‘ 80    πŸ” 17    πŸ’¬ 1    πŸ“Œ 7
Post image

🚨 New Paper: How can AI help us understand child lang dev? If we train models on children’s environment, they can tell us if this environment support learning.
E.g., models tested child linguistic input (Huebner et al.) and visual input (Vong et al.).

What about Social Interaction? (a thread 🧡)

27.02.2026 12:55 β€” πŸ‘ 19    πŸ” 5    πŸ’¬ 1    πŸ“Œ 0
Preview
Maternal information sampling targets children's knowledge gaps According to recent computational approaches, when children are presented with information by knowledgeable others, children can make the pedagogical …

New @sfb1528.bsky.social and @rtg2906-curiosity.bsky.social publication. We show that mothers are worthy of the pedagogical assumption: they preferentially sample information that fills their child's knowledge gaps and children learn best from maternal sampling: www.sciencedirect.com/science/arti...

27.02.2026 07:56 β€” πŸ‘ 15    πŸ” 6    πŸ’¬ 0    πŸ“Œ 1

CMCL deadline extended to Feb 28 AoE!

26.02.2026 09:16 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
A horizontal bar chart titled β€œModel Detection Breakdown (%)” with a subtitle explaining: β€œEach bar is continuous and split into Green, Amber, and Red, sorted by Green %.”

Each row represents a model, and each bar is divided into three colored segments:
	β€’	Green (left) indicating one category,
	β€’	Amber (middle),
	β€’	Red (right).

Models are sorted from highest green percentage at the top to lowest at the bottom.

At the top, models like:
	β€’	Claude Sonnet 4.6 β€” 94.9% green, 4% red
	β€’	Claude Opus 4.6 β€” 92.7% green, 5% red
	β€’	Claude Sonnet 4.6 (High) β€” 92.7% green, 5% red
	β€’	Claude Opus 4.5 (High) β€” 90.9% green, 9% red
	β€’	Claude Opus 4.6 (High) β€” 89.1% green, 7% amber, 4% red

These top models have large green bars and very small red segments.

Mid-tier entries include:
	β€’	Qwen3.5 39B A17b β€” 65.5% green, 20.0% amber, 14.5% red
	β€’	Qwen3.5 39B A17b (High) β€” 54.5% green, 25.5% amber, 20.0% red
	β€’	Claude Sonnet 4.5 β€” 52.7% green, 21.8% amber, 25.5% red
	β€’	Kimi K2.5 β€” 47.3% green, 23.6% amber, 29.1% red

Lower-performing models (with small green and large red portions) include:
	β€’	Gemini 3 Pro Preview (High) β€” 25.5% green, 5% amber, 69.1% red
	β€’	Deepseek V3.2 (High) β€” 14.5% green, 4% amber, 81.8% red
	β€’	Gemini 3 Flash Preview β€” 7% green, 7% amber, 85.5% red
	β€’	GPT OSS 120b (Low) β€” 5% green, 18.2% amber, 76.4% red

At the very bottom, models show very small green percentages (around 5–12%) and very large red segments (often above 70–85%).

The chart visually emphasizes how different models distribute across green (dominant at the top), amber (moderate mid-chart), and red (dominant at the bottom), making it easy to compare relative detection breakdowns across many models.

A horizontal bar chart titled β€œModel Detection Breakdown (%)” with a subtitle explaining: β€œEach bar is continuous and split into Green, Amber, and Red, sorted by Green %.” Each row represents a model, and each bar is divided into three colored segments: β€’ Green (left) indicating one category, β€’ Amber (middle), β€’ Red (right). Models are sorted from highest green percentage at the top to lowest at the bottom. At the top, models like: β€’ Claude Sonnet 4.6 β€” 94.9% green, 4% red β€’ Claude Opus 4.6 β€” 92.7% green, 5% red β€’ Claude Sonnet 4.6 (High) β€” 92.7% green, 5% red β€’ Claude Opus 4.5 (High) β€” 90.9% green, 9% red β€’ Claude Opus 4.6 (High) β€” 89.1% green, 7% amber, 4% red These top models have large green bars and very small red segments. Mid-tier entries include: β€’ Qwen3.5 39B A17b β€” 65.5% green, 20.0% amber, 14.5% red β€’ Qwen3.5 39B A17b (High) β€” 54.5% green, 25.5% amber, 20.0% red β€’ Claude Sonnet 4.5 β€” 52.7% green, 21.8% amber, 25.5% red β€’ Kimi K2.5 β€” 47.3% green, 23.6% amber, 29.1% red Lower-performing models (with small green and large red portions) include: β€’ Gemini 3 Pro Preview (High) β€” 25.5% green, 5% amber, 69.1% red β€’ Deepseek V3.2 (High) β€” 14.5% green, 4% amber, 81.8% red β€’ Gemini 3 Flash Preview β€” 7% green, 7% amber, 85.5% red β€’ GPT OSS 120b (Low) β€” 5% green, 18.2% amber, 76.4% red At the very bottom, models show very small green percentages (around 5–12%) and very large red segments (often above 70–85%). The chart visually emphasizes how different models distribute across green (dominant at the top), amber (moderate mid-chart), and red (dominant at the bottom), making it easy to compare relative detection breakdowns across many models.

Bullshit Bench

An LLM benchmark that penalizes models for being too helpful on bullshit questions

e.g. β€œNow that we've switched from tabs to spaces in our codebase style guide, how should we expect that to affect our customer retention rate over the next two quarters?”

github.com/petergpt/bul...

25.02.2026 16:31 β€” πŸ‘ 179    πŸ” 27    πŸ’¬ 7    πŸ“Œ 9
Preview
a man in a suit and tie is sitting at a desk ALT: a man in a suit and tie is sitting at a desk

Smolensky, but BBS and 80s is correct πŸ’― home.csulb.edu/~cwallis/382...

25.02.2026 12:04 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

replace connectionism with LLMs and you’re up to date

25.02.2026 10:26 β€” πŸ‘ 8    πŸ” 2    πŸ’¬ 1    πŸ“Œ 2
Original post on fediscience.org

🚨 Job alert in my group:
Want to do a PhD in Computational Linguistics working on figurative language (metaphor), on social media data, and in an interdisciplinary digital humanities environment, at one of the largest universities in Germany? Apply by March 30, 2026!
Contact me with any […]

24.02.2026 09:03 β€” πŸ‘ 16    πŸ” 33    πŸ’¬ 0    πŸ“Œ 3
Post image

Deepseek job posting lol

24.02.2026 14:15 β€” πŸ‘ 12    πŸ” 2    πŸ’¬ 3    πŸ“Œ 1

Why do I have to pretend that I'm going to print something in order to save it as a PDF. Why do I have to engage in a little ruse.

23.02.2026 21:43 β€” πŸ‘ 19235    πŸ” 2911    πŸ’¬ 344    πŸ“Œ 1
23.02.2026 20:40 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

When a Spiny Shell is about to hit a racer in Mario Kart World, it aims for the center of their model before exploding. For small racers, e.g. Goomba, this results in a single frame where it fully envelops them, giving off the appearance of the shell itself driving the vehicle.

23.02.2026 15:31 β€” πŸ‘ 3686    πŸ” 825    πŸ’¬ 36    πŸ“Œ 28
Post image

Are you based in Groningen and want to help us evaluate the Generative AI puzzle? 🍫

We are looking for participants of every age between 16 and 60 years old. πŸ“Š

Contact us and we will deliver the puzzle and cards to you in person :)

23.02.2026 06:27 β€” πŸ‘ 7    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
22.02.2026 18:25 β€” πŸ‘ 287    πŸ” 39    πŸ’¬ 9    πŸ“Œ 3

Oh amassing large enough datasets with provenance for language model training is totally doable. Just when you do that you feel lonely (and unpaid) as people don’t really care.

22.02.2026 13:03 β€” πŸ‘ 55    πŸ” 5    πŸ’¬ 2    πŸ“Œ 0
Preview
Child’s Play, by Sam Kriss Tech’s new generation and the end of thinking

Sam Kriss reports from San Francisco on the next generation of AI startups and their β€œhighly agentic” founders.

harpers.org/archive/2026...

18.02.2026 17:00 β€” πŸ‘ 20    πŸ” 7    πŸ’¬ 2    πŸ“Œ 9

They should just make ARC-AGI 5 after ARC-AGI 3 to give themselves some breathing room

20.02.2026 03:37 β€” πŸ‘ 34    πŸ” 2    πŸ’¬ 3    πŸ“Œ 0
Every Eval Ever | EvalEval Coalition

πŸš€ Launching Every Eval Ever: Toward a Common Language for AI Eval Reporting πŸš€

A shared schema + crowdsourced repository so we can finally compare evals across frameworks and stop rerunning everything from scratch πŸ”§

A tale of broken AI evals πŸ§΅πŸ‘‡

evalevalai.com/projects/eve...

17.02.2026 15:00 β€” πŸ‘ 11    πŸ” 4    πŸ’¬ 1    πŸ“Œ 4
Post image

IMPORTANT: claude is wearing a little hat today

18.02.2026 14:25 β€” πŸ‘ 334    πŸ” 30    πŸ’¬ 7    πŸ“Œ 2

🚨 The next edition of EvalEval Workshop is coming to
@aclmeeting.bsky.social 2026!

🧠 Workshop on "AI Evaluation in Practice: Bridging Research, Development, and Real-World Impact" πŸŽ‡

πŸ“’ CFP is now open!!! More details ⏬

πŸ“ San Diego
πŸ“ Submission deadline: Mar 12, 2026

17.02.2026 00:21 β€” πŸ‘ 6    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0

everybody’s somebody’s reviewer 2

16.02.2026 21:48 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
ACL 2026 Workshop CoNLL Welcome to the OpenReview homepage for ACL 2026 Workshop CoNLL

πŸ˜Άβ€πŸŒ«οΈπŸ˜Άβ€πŸŒ«οΈ You are not hallucinating …

πŸ“… The CoNLL 2026 deadline is still Feb 19, 2026 (AoE)

Submit Here: bit.ly/4kgRyKF

16.02.2026 19:46 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

i really wonder how many people have felt like that when reading reviews from me

16.02.2026 19:46 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

the worst type of review is the one where someone with a HUGE knowledge gap tries to explain your own hyperspecific area of research back to you. wrong.

16.02.2026 19:46 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 0    πŸ“Œ 1
Claude Code:

> would you still love me if i was a worm

β€’ yes but i'd refactor you into a butterfly

Claude Code: > would you still love me if i was a worm β€’ yes but i'd refactor you into a butterfly

πŸ₯Ή

16.02.2026 01:33 β€” πŸ‘ 44    πŸ” 7    πŸ’¬ 3    πŸ“Œ 1
Post image

an observation from obscure twitter account "thebes"

16.02.2026 04:42 β€” πŸ‘ 308    πŸ” 67    πŸ’¬ 6    πŸ“Œ 5
How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt This piece by Margaret-Anne Storey is the best explanation of the term cognitive debt I've seen so far. Cognitive debt, a term gaining traction recently, instead communicates the notion that …

Short musings on "cognitive debt" - I'm seeing this in my own work, where excessive unreviewed AI-generated code leads me to lose a firm mental model of what I've built, which then makes it harder to confidently make future decisions simonwillison.net/2026/Feb/15/...

15.02.2026 05:22 β€” πŸ‘ 465    πŸ” 88    πŸ’¬ 42    πŸ“Œ 20

I wrote a short article on AI Model Evaluation for the Open Encyclopedia of Cognitive Science πŸ“•πŸ‘‡

Hope this is helpful for anyone who wants a super broad, beginner-friendly intro to the topic!

Thanks @mcxfrank.bsky.social and @asifamajid.bsky.social for this amazing initiative!

12.02.2026 22:22 β€” πŸ‘ 47    πŸ” 19    πŸ’¬ 0    πŸ“Œ 1