Jiahai Feng @fjiahai - Bluesky Profile

Latest posts by fjiahai.bsky.social on Bluesky

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

19.12.2024 17:17 — 👍 8 🔁 3 💬 2 📌 0

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task?

We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵

26.11.2024 22:37 — 👍 44 🔁 6 💬 3 📌 1

🙋‍♂️

24.11.2024 17:56 — 👍 1 🔁 0 💬 0 📌 0

@fjiahai is following 19 prominent accounts

Cassidy Laidlaw
@cassidylaidlaw

PhD student at UC Berkeley studying RL and AI safety. https://cassidylaidlaw.com

Jimmy Miller
@jimmyhmiller

Compiler engineer and co-host of the future of coding podcast https://jimmyhmiller.com https://futureofcoding.org/episodes/

Nora Belrose
@norabelrose

AI, philosophy, spirituality Head of interpretability research at EleutherAI, but posts are my own views, not Eleuther’s.

David Sussillo
@sussillodavid

Neural reverse engineer, scientist at Meta Reality Labs, Adjunct Prof at Stanford.

Doudna Lab
@doudna-lab

News from Jennifer Doudna's lab at UC Berkeley, Innovative Genomics Institute. Tweets from lab members and not Jennifer Doudna unless signed JD. Tweets represent personal views only. doudnalab.org

Jonathan Frankle
@jfrankle.com

Chief AI Scientist at Databricks. Founding team at MosaicML. MIT/Princeton alum. Lottery ticket enthusiast. Working on data intelligence.

David Duvenaud
@davidduvenaud

Machine learning prof at U Toronto. Working on evals and AGI governance.

Ananya Kumar
@ananyak

Research scientist at OpenAI working on reasoning and RL. Previously PhD student at Stanford University working with Percy Liang and Tengyu Ma.

Aditi Raghunathan
@adtraghunathan

Assistant Professor at CSD CMU. https://www.cs.cmu.edu/~aditirag/

Nicolas Papernot
@nicolaspapernot

Security and Privacy of Machine Learning at UofT, Vector Institute, and Google 🇨🇦🇫🇷🇪🇺 Co-Director of Canadian AI Safety Institute (CAISI) Research Program at CIFAR. Opinions mine

Rishi Sreedhar
@rishisr33dhar

Quantum Curious | Tensor Network Algorithms researcher at SandBoxAQ

Martin Bauer
@martinmbauer

I'm a theoretical physicist at Durham University

Sabine Hossenfelder
@hossenfelder

German Physicist

Noam Brown
@polynoamial

Researching reasoning at OpenAI | Co-created Libratus/Pluribus superhuman poker AIs, CICERO Diplomacy AI, and OpenAI o-series / 🍓

Luke Zettlemoyer
@lukezettlemoyer

Professor at UW; Researcher at Meta. LMs, NLP, ML. PNW life.

Sophia Sanborn
@naturecomputes

Searching for principles of neural representation | Neuro + AI @ enigmaproject.ai | Stanford | sophiasanborn.com

Jack Gallant
@gallantlab.org

Cognitive, Systems and Computational Neuroscientist, Professor at UC Berkeley, and lab head.

Alex Lew
@alexlew

Theory & practice of probabilistic programming. Current: MIT Probabilistic Computing Project; Fall '25: Incoming Asst. Prof. at Yale CS

Bao Pham
@baopham

PhD Student at RPI. Interested in Hopfield or Associative Memory models and Energy-based models.