If you have suggestions for topics to cover in the next iteration of the course, please share them in this thread!
05.08.2025 17:43 β π 0 π 0 π¬ 0 π 0@anishathalye.bsky.social
If you have suggestions for topics to cover in the next iteration of the course, please share them in this thread!
05.08.2025 17:43 β π 0 π 0 π¬ 0 π 0Lecture videos: www.youtube.com/@MissingSeme..., Notes: missing.csail.mit.edu
05.08.2025 17:43 β π 0 π 0 π¬ 0 π 0Missing Semester has grown past 100K subscribers on YouTube. Appreciate all the engagement and support!
We plan to teach another iteration of the course in January 2026, revising the curriculum and covering new topics like AI IDEs and vibe coding.
Incidentally, this is how I first got interested in ML. github.com/anishathalye...
21.06.2025 15:19 β π 1 π 0 π¬ 0 π 0My favorite way to measure progress in AI: finding papers obsoleted by ChatGPT prompts
21.06.2025 15:16 β π 0 π 0 π¬ 1 π 0Code/binary here: github.com/anishathalye...
17.06.2025 17:07 β π 1 π 0 π¬ 0 π 0Ever get blinded when writing code late at night and you alt-tab from your dark-mode terminal to your browser? Made this little macOS utility to solve this little problem, just updated for the latest macOS.
No thanks to AI for hallucinating BrightnessKit.framework.
We did a workshop at AIUC that: (1) implements a RAG app on top of Cursor's docs, (2) reproduces the widely-publicized failure from last week, and (3) shows how to automatically catch and reproduce this failure. All slides/code are open-sourced here: github.com/cleanlab/aiu... (5/5)
24.04.2025 18:21 β π 0 π 0 π¬ 0 π 0Whatβs the solution? I believe that one ingredient will be intelligent systems that evaluate the output of these LLMs in real-time and keep them in check, building on and combining techniques like LLM-as-a-judge, using per-token logprobs, and statistical methods. (4/5)
24.04.2025 18:21 β π 0 π 0 π¬ 1 π 0Why do such failures occur? These next-token-prediction models are nondeterministic and can be fragile. And theyβre not getting consistently better over timeβOpenAIβs latest models like o3 and o4-mini show higher hallucination rates compared to previous versions. (3/5)
24.04.2025 18:21 β π 0 π 0 π¬ 1 π 0Itβs been over a year since the well-publicized failures of Air Canadaβs support bot and NYCβs MyCity bot. And these AIβs are still failing spectacularly in production, with the most recent debacle being Cursorβs AI going rogue and triggering a wave of cancellations. (2/5)
24.04.2025 18:21 β π 0 π 0 π¬ 1 π 0We reproduced (and fixed!) Cursorβs rogue customer support AI. (1/5)
24.04.2025 18:20 β π 0 π 0 π¬ 1 π 0I wonder if there's anything special in the Cursor Tab completion model or system prompt that induces this behavior.
16.04.2025 22:04 β π 0 π 0 π¬ 0 π 0Coincidence, or genius growth hack? Cursor self-propagating through developer set-up instructions.
16.04.2025 22:02 β π 0 π 0 π¬ 1 π 02/2
It works surprisingly well in practice.
cleanlab.ai/blog/rag-eva...
Hoping to see more of these real-time reference-free evaluations to give end users more confidence in the outputs of AI applications.
Is AI any good at evaluating AI? Is it turtles all the way down? We benchmarked evaluation models like LLM-as-a-judge, HHEM, Prometheus across 6 RAG applications. 1/2
07.04.2025 23:04 β π 1 π 0 π¬ 1 π 0And some repos are even organically suggested by ChatGPT. (3/3)
17.02.2025 18:03 β π 0 π 0 π¬ 0 π 0Some of this might be through web search / tool use, but for at least some, knowledge about the projects is actually part of LLM model weights. (2/3)
17.02.2025 18:03 β π 0 π 0 π¬ 1 π 0A substantial portion of traffic for some of my open-source projects comes from ChatGPT these days. Sometimes even a majority, beating traffic from Google. Time to prioritize LLM optimization over search engine optimization. (1/3)
17.02.2025 18:02 β π 1 π 0 π¬ 1 π 0