๐Ÿ˜'s Avatar

๐Ÿ˜

@pkydrm.bsky.social

research scientist @MosaicML x @Databricks re: rlhf, humans in the loop, and figuring out what it means to have a good model ๐Ÿค–๐Ÿง‘โ€๐ŸŽจโœจ

1,261 Followers  |  83 Following  |  38 Posts  |  Joined: 13.11.2024  |  2.1362

Latest posts by pkydrm.bsky.social on Bluesky

What are your favorite recent papers on using LMs for annotation (especially in a loop with human annotators), synthetic data for task-specific prediction, active learning, and similar?

Looking for practical methods for settings where human annotations are costly.

A few examples in thread โ†ด

23.07.2025 08:10 โ€” ๐Ÿ‘ 74    ๐Ÿ” 23    ๐Ÿ’ฌ 14    ๐Ÿ“Œ 3
Post image 21.07.2025 14:20 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

I am once again pitching my romantic comedy:

- two academics start dating
- discover they are each other's terrible reviewer
- hijinks ensue

Working title: Love is Double-Blind

18.06.2025 10:55 โ€” ๐Ÿ‘ 2636    ๐Ÿ” 350    ๐Ÿ’ฌ 99    ๐Ÿ“Œ 66

I'm extremely curious -- would you want digital tools that would help with this (e.g. planning, time organization) or embodied AI (e.g. physical assistance in-home, transportation)?

16.04.2025 17:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

i wish i could shout this from the rooftops. relatedly, there's no need for robots to be limited by the human form.

similar/tangential thing came up in the 2010s with respect to self-driving: just because people only sense using their eyes doesn't mean cars have to only use cameras!

09.04.2025 15:47 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

The Wikimedia Foundation, which owns Wikipedia, says its bandwidth costs have gone up 50% since Jan 2024 โ€”ย a rise they attribute to AI crawlers.

AI companies are killing the open web by stealing visitors from the sources of information and making them pay for the privilege

02.04.2025 09:12 โ€” ๐Ÿ‘ 5687    ๐Ÿ” 2660    ๐Ÿ’ฌ 68    ๐Ÿ“Œ 178

we are living in an empirical world and we are empirical girls

25.03.2025 20:39 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

No labels, no problem! I am so excited for this release. We have been working on it for many months, and it's motivated by a common customer roadblock: insufficient labeled examples.

25.03.2025 20:39 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

has anyone successfully gotten very involved with their local library system and, if so, how does one do so?

i know there are volunteer opportunities and it is my dream to one day organize a crafting circle, but i'm talking about how the library actually organizes / functions / prioritizes things!

22.01.2025 20:42 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@jfrankle.com @ericajiyuen.bsky.social

19.12.2024 16:26 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

and a big shout out to my collaborators: Erica Ji Yuen, Kartik Sreenivasan, Yue (Andy) Zhang, Sam Havens, Michael Carbin, Matei Zaharia, Jonathan Frankle

19.12.2024 16:25 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Benchmarking Domain Intelligence

3/3 ๐Ÿ”‘ Want to see how different models perform on enterprise tasks? Full analysis in the blog here: databricks.com/blog/benchma...!

19.12.2024 16:25 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

๐Ÿ“Š DIBS measures real enterprise needs. We tested 14 models & found:

- Academic benchmarks mask enterprise gaps
- No single model wins across all tasks
- Open models are competitive on key capabilities
- Some enterprise tasks show clear paths forward, others are more complex

2/3

19.12.2024 16:25 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

๐Ÿงต Super proud to finally share this work I led last quarter - the
@databricks.bsky.social Domain Intelligence Benchmark Suite (DIBS)! TL;DR: Academic benchmarks โ‰  real performance and domain intelligence > general capabilities for enterprise tasks. 1/3

19.12.2024 16:25 โ€” ๐Ÿ‘ 5    ๐Ÿ” 4    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 1

@jfrankle.com @ericajiyuen.bsky.social

19.12.2024 16:24 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

And of course a big shout out to my collaborators: Erica Ji Yuen, Kartik Sreenivasan, Yue (Andy) Zhang, Sam Havens, Michael Carbin, Matei Zaharia, and Jonathan Frankle for their help!

19.12.2024 16:23 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Benchmarking Domain Intelligence

3/3 ๐Ÿ”‘ Want to see how different models perform on enterprise tasks? Full analysis in the blog here: databricks.com/blog/benchma...!

19.12.2024 16:21 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

๐Ÿ“Š DIBS measures real enterprise needs. We tested 14 models & found:
- Academic benchmarks mask enterprise gaps
- No single model wins across all tasks
- Open models are competitive on key capabilities
- Some enterprise tasks show clear paths forward, others are more complex

2/3

19.12.2024 16:20 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

very demure, very mindful, very 2019-era mujoco humanoid learning to walk

12.12.2024 14:00 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

"technology built to address people's needs" is the north star.

side note: it would be amazing to see this attitude in the physical, embodied world as well. it's amazing to see how older adults in dense, walkable areas have such different lifestyles than those in car-centric suburbs.

12.12.2024 13:33 โ€” ๐Ÿ‘ 0    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

would love to be added :-)

11.12.2024 19:33 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

brat tulu is amazing

10.12.2024 23:52 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

this is incredible research, and beautiful. would love to know more about what it's like to meaningfully interact with genie 2, or similar models, e.g. to modify the outputs of such a model in the service of a design vision.

05.12.2024 19:31 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
24.11.2024 15:35 โ€” ๐Ÿ‘ 1090    ๐Ÿ” 289    ๐Ÿ’ฌ 18    ๐Ÿ“Œ 10

i know some labs are already starting to do this; i hope more continue to. it is challenging, complex technical work and we should think of it as a first-class contribution in the field. 5/5

26.11.2024 14:09 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

๐Ÿคž we can start to more broadly value thoughtful, direction-setting benchmark work. it requires technical contributions, a keen sense of how people might meaningfully interact with a system, and the discernment to recognize where progress might yet be made. 4/5

26.11.2024 14:09 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

i think as a field, we have a problematic tendency to focus on magnitude-related problems, like new architectures or training paradigms or other ways to maximize performance on whatever benchmarks we can. maybe this is because it is more akin to the training/experience many of us have. 3/5

26.11.2024 14:09 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

in the LLM space, at this time, benchmarks/evaluations set the direction of that vector. it's extremely hard to make good benchmarks, and historically under-rewarded in the field. 2/5

26.11.2024 14:09 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

i often talk about the importance of aligning both the magnitude AND direction of a workstream vector. 1/5

26.11.2024 14:09 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

i do not study this, but i did just finish reading the anxious generation and so i'm very grateful that there are so many people who do indeed study such important things!

22.11.2024 00:51 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@pkydrm is following 20 prominent accounts