Check out our new blogpost and policy brief on our recently updated lab website!
βAre we actually capturing the bubble of risk for cybersecurity evals? Not really! Adversaries can modify agents by a small amount and get massive gains.
@profsanjeevarora.bsky.social
Director, Princeton Language and Intelligence. Professor of CS.
Check out our new blogpost and policy brief on our recently updated lab website!
βAre we actually capturing the bubble of risk for cybersecurity evals? Not really! Adversaries can modify agents by a small amount and get massive gains.
Would it make sense to also track how often this happened in pre-2023 cases? Humans "hallucinate" by making cut-and-paste mistakes, or other types of errors.
26.05.2025 02:32 β π 0 π 0 π¬ 1 π 1The paper seems to reflect reflect a fundamental misunderstanding about how LLMs work. One cannot (currently) tell an LLM to "ignore pretraining data from year X onwards". The LLM doesn't have data stored neatly inside it in sortable format. It is not like a hard drive.
22.04.2025 02:21 β π 3 π 0 π¬ 0 π 0Great comment by my colleague @randomwalker.bsky.social
16.03.2025 19:06 β π 2 π 0 π¬ 0 π 0Understanding and extrapolating benchmark results will become essential for effective policymaking and informing users. New work identifies indicators that have high predictive power in modeling LLM performance. Excited for it to be out!
11.03.2025 20:07 β π 11 π 3 π¬ 1 π 0What are 3 concrete steps that can improve AI safety in 2025? π€β οΈ
Our new paper, βIn House Evaluation is Not Enoughβ has 3 calls-to-actions to empower evaluators:
1οΈβ£ Standardized AI flaw reports
2οΈβ£ AI flaw disclosure programs + safe harbors.
3οΈβ£ A coordination center for transferable AI flaws.
1/π§΅
Congratulations ! great result.
27.02.2025 17:25 β π 4 π 0 π¬ 1 π 0A new path forward for open AI (note the space between the two words). Looking forward to seeing how it enables great research in the open.
29.01.2025 23:19 β π 4 π 0 π¬ 1 π 0x.com/parksimon080...
Can VLMs do difficult reasoning tasks? Using new dataset for evaluating Simple-to-Hard generalization (a form of OOD generalization) we study how to mitigate the dreaded "modality gap" VLM vs its base LLM.
(note: the poster, Simon Park, applied to PhD programs this spring)
SimPO: new method from Princeton PLI for improving chat models via preference data. Simpler than DPO and widely adopted within weeks by top models in the chatbot arena. Excellent and elementary account by author
@xiamengzhou.bsky.social (she's also on job market!). tinyurl.com/pepcynaxFully
I'll be giving a talk on my two recent preference learning works (led by Angelica Chen and @noamrazin.bsky.social) in the AI Tinkerers Paper Club today (11/26) at noon ET. Excited to share this talk with a broader audience! paperclub.aitinkerers.org/p/join-paper...
26.11.2024 12:55 β π 5 π 1 π¬ 0 π 1Interesting thread from Geoffrey Irving about the fragility of interpreting LLMs' latent reasoning (whether self-reported, or recovered by some mechanistic interpretability idea). I have been pessimistic about trusting latent reasoning.
25.11.2024 14:50 β π 2 π 0 π¬ 0 π 0