Alon Jacoby's Avatar

Alon Jacoby

@alon-j.bsky.social

PhD student @ Penn alonj.github.io

184 Followers  |  530 Following  |  10 Posts  |  Joined: 13.11.2024
Posts Following

Posts by Alon Jacoby (@alon-j.bsky.social)

NeurIPS Poster Quantifying Uncertainty in the Presence of Distribution ShiftsNeurIPS 2025

Uncertainty estimation fails under distribution shifts. Why? Partly because in stats, even Bayesian stats, we treat x as given. But intuitively data makes different models plausible. For reliable uncertainty, we need to account for it explicitly. Come chat with me about it tomorrow at my poster

03.12.2025 00:59 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1
Preview
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs). Despite LLMs advancements in recent times, their performance consistency across different...

It's also a good reminder that even really impressive models can be surprisingly susceptible to very simple surface-level perturbations.
The original FlenQA paper here -
arxiv.org/abs/2402.14848

07.05.2025 14:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

The Phi 4 Reasoning technical report is a good reminder that current models still suffer massive performance degradation when reasoning tasks get longer - even at just 3K tokens!
They use FlenQA (w/ @moshlevy.bsky.social ) to show their model improves here massively.
arxiv.org/abs/2504.21318

07.05.2025 14:07 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

✨New paper✨

Linguistic evaluations of LLMs often implicitly assume that language is generated by symbolic rules.
In a new position paper, @adelegoldberg.bsky.social, @kmahowald.bsky.social and I argue that languages are not Lego sets, and evaluations should reflect this!

arxiv.org/pdf/2502.13195

20.02.2025 15:06 β€” πŸ‘ 69    πŸ” 20    πŸ’¬ 1    πŸ“Œ 3

Obviously, be sensible. If you're not willing to send your code to 3rd parties (OpenAI, Google, etc), don't use `-s` (or `--summary`). Everything else is done locally.

02.02.2025 23:20 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GitHub - alonj/pydift Contribute to alonj/pydift development by creating an account on GitHub.

If you specify '-s' when running the script, an LLM will summarize the diff (3 models implemented, but you can easily add more). If this is useful to you, because like me - you need a worse version of git - check out - github.com/alonj/pydift
or via
`pip install pydift`

02.02.2025 23:20 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Sometimes I want to track small changes in code without too much hassle, so I made pydift: replace "python script.py" with "pydift script.py", and diffs from previous runs will be saved automatically.

02.02.2025 23:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

How does one figure out exactly which samples were seen in training for a given OLMo checkpoint? Where is that information shared or stored?
Also, there used to be a csv of checkpoints in the OLMo repo, but it's gone (guessing since OLMo 2)...
Help will be appreciated

15.12.2024 13:48 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 1
NeurIPS Poster Class Distribution Shifts in Zero-Shot Learning: Learning Robust RepresentationsNeurIPS 2024

This is one of a few neat ideas in
@yulislavutsky.bsky.social 's work to learn robust representations in @neuripsconf.bsky.social '24. Definitely worth reading if you're also interested in robustness: neurips.cc/virtual/2024...

11.12.2024 00:22 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Say we collected a multi-hop reasoning QA dataset. Inevitably, the samples will have some attributes that we didn't/can't control for (domain, length of text, difficulty, etc).
By taking small enough sub-samples, also inevitably, sometimes the minority attributes become majority.

11.12.2024 00:22 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We've come to expect LLMs to be generalist models which are accurate in zero-shot settings - such as QA in different domains, reasoning types, or even low resource languages.
How can we ensure that models are accurate on samples from classes rarely seen in training?

11.12.2024 00:22 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

I'm on my way to #NeurIPS2024. On Friday I'm going to present my latest paper with Yuval Benjamini. The gist is in the comments, and come chat with me to hear more!

10.12.2024 22:06 β€” πŸ‘ 7    πŸ” 4    πŸ’¬ 1    πŸ“Œ 1

Yes, hi, hello.

04.12.2024 19:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0