Yoav Goldberg @yoavgo - Bluesky Profile

whats the difference in your view?

12.12.2025 17:17 — 👍 0 🔁 0 💬 0 📌 0

i discuss this in the gist text. this is the more correct way to frame it imo (env provides observations, which agent interprets as rewards based on its goals), and it also opens up possible variations in how to think about learning from the env.

06.12.2025 00:44 — 👍 0 🔁 0 💬 0 📌 0

rl-wrong-about-rewards.md GitHub Gist: instantly share code, notes, and snippets.

I complain a lot about RL lately, and here we go again.

The CS view of RL is wrong in how it thinks about rewards, already at the setup level. Briefly, the reward computation should be part of the agent, not part of the environment.

More at length here:

gist.github.com/yoavg/3eb3e7...

05.12.2025 23:37 — 👍 13 🔁 2 💬 2 📌 0

yes it sucks to be the ICLR organizers today, totally agree

28.11.2025 00:01 — 👍 2 🔁 0 💬 0 📌 0

given that the data is already out and a large jsonl file is rumored to be floating around (which seems very plausible to me), i think the moral thing to do now would be to make the breached data publicly available for all rather than trying to hide it.

27.11.2025 23:32 — 👍 2 🔁 0 💬 1 📌 0

RL is ok. but the jump from
A) people can be thought of as agents who observe and environment, act, observe the outcome and update their beliefs

to:

B) lets model all things as a POMDP with a numeric reward function!

is just way too big for me

27.11.2025 20:13 — 👍 4 🔁 0 💬 1 📌 0

the fascinating (to me) quality of hard-core RL researchers (e.g Sutton) is the ability to have an all encompassing view of RL as the basis of intelligence, while at the same time working on super low level stuff like tabular TD algorithms, and yet strongly believe these are actually the same thing

27.11.2025 16:32 — 👍 20 🔁 0 💬 1 📌 0

מסכים לגמרי

19.11.2025 02:07 — 👍 1 🔁 0 💬 0 📌 0

והאמייל הזה (אני מניח שלמטרת צילומים רשמיים למטרה כשלהי של ארגון כלשהו) נתפס כצינזור? (אני פשוט מופתע כי אני לא הייתי חושב על זה)

18.11.2025 17:58 — 👍 0 🔁 0 💬 1 📌 0

בתור אחד שרואה עצמו כעשוי לכתוב משהו כזה בטעות ולא מבין מה העניין, אשמח אם תוכלו להסביר מה כל כך מקפיץ פה?

18.11.2025 16:24 — 👍 1 🔁 0 💬 1 📌 0

לא הבנתי מה הרפרנס לשטר של כסף וגם לא ראיתי את הירוק דווקא כרפרור לצהל.. אבל כאמור אולי זה כי אני באמת לא מעצב אז אני לא חושב במובנים האלו

18.11.2025 15:39 — 👍 3 🔁 0 💬 1 📌 0

האם את מאמינה בעיקרון? כי נשמע מציוצים אחרים שלא ממש, ואז, מה אכפת לך בעצם עד כמה זה מדוייק? אני אישית כן מאמין, ואכן הייתי שמח אם זה ישתפר להבא, ומאמין שאכן כך יהיה כי זה כולה בולט שמקוצר באופן לא ברור על פוסטר.

18.11.2025 15:37 — 👍 0 🔁 0 💬 2 📌 0

לא הסלוגן האידאלי אבל גם באמת לא כזה מופרך. הכרה מפורשת בזכות של ישראל להתקיים כישות ציונית, לצד המדינות הערביות כולל הפלסטינית.

18.11.2025 15:11 — 👍 3 🔁 0 💬 1 📌 0

כלא מעצב, זה נראה לי קצת חובבני אבל סהכ ממש בסדר. מה הבעיה? מה זה ירוק לא נכון?

18.11.2025 15:10 — 👍 5 🔁 0 💬 1 📌 0

what's the latest-and-greatest attempt to reverse-engineer and document the inner-working of claude-code?

17.11.2025 10:23 — 👍 1 🔁 0 💬 0 📌 0

(hmm i guess we can amend to "increase in the proportion of knowledge we believe to be true")

17.11.2025 07:00 — 👍 0 🔁 0 💬 1 📌 0

i think memory is never "free", in the sense that the real bottleneck is not storage, but the ability to retrieve the right thing, while not retrieving a wrong (out of date) thing by mistake.

but assuming we do delete facts, is deleting considered learning in your definition?

17.11.2025 06:59 — 👍 0 🔁 0 💬 1 📌 0

is "increase" necessary? or is "change" enough? (although i guess that in an ideal form, you dont "forget" a wrong fact but add the fact that it is wrong, so you may consider it as increasing...)

16.11.2025 20:11 — 👍 1 🔁 0 💬 1 📌 0

yes, following instructions in prompt is not learning. but if a wrapping systems stores items to inject in future prompts, then you can consider the system as learning.

16.11.2025 20:00 — 👍 0 🔁 0 💬 1 📌 0

it will be in-context-induction, and the storing and retention from external memory would be learning.

16.11.2025 19:34 — 👍 1 🔁 0 💬 1 📌 0

the storage, if it happens, is the learning part. the inference process is not learning.

16.11.2025 19:14 — 👍 0 🔁 0 💬 1 📌 0

or as i wrote two years ago:

gist.github.com/yoavg/59d174...

16.11.2025 17:27 — 👍 1 🔁 0 💬 0 📌 0

i dont think it is a very useful view. at a very minimum we see extremely elaborate neighbor-matching and interpolation mechanisms, so the "glorified" part should be elaborated on and studied.

16.11.2025 17:25 — 👍 0 🔁 0 💬 2 📌 0

i agree, where is "storing" in the above case?

16.11.2025 17:04 — 👍 0 🔁 0 💬 0 📌 0

ah, cool!

16.11.2025 15:54 — 👍 0 🔁 0 💬 0 📌 0

indeed kNN is also not learning. its just a classification method. if you want to consider kNN as a learning method, then the learning part is just "store these pairs as is".

16.11.2025 15:54 — 👍 1 🔁 0 💬 0 📌 0

i am not sure that it is (or rather, if everything is retrieval, then this term is useless)

16.11.2025 15:20 — 👍 0 🔁 0 💬 2 📌 0

if we want to study the phenomena, a non-misleading name may be better than a misleading one

16.11.2025 15:10 — 👍 2 🔁 0 💬 1 📌 0

to me "learning" *requires* that something is stored for later use. again i dont care *where* it is stored, but *that* it is stored.

16.11.2025 15:09 — 👍 0 🔁 0 💬 1 📌 0

this term is more accurate, but does not help in what i have in mind, which is to have a better name to the process that happens in ICL. some suggested "induction", which is OK but also not perfect (because the model both induces and applies).

16.11.2025 15:07 — 👍 0 🔁 0 💬 1 📌 0

Yoav Goldberg

Latest posts by yoavgo.bsky.social on Bluesky

@yoavgo is following 20 prominent accounts