's Avatar

@n0riskn0r3ward.bsky.social

404 Followers  |  93 Following  |  13 Posts  |  Joined: 01.11.2024  |  1.5515

Latest posts by n0riskn0r3ward.bsky.social on Bluesky

To be clear, I don't mean the resulting loss is lower. I mean that after benchmarking models trained using different optimizers but same training data, the model I got from using schedule_free_adamw was the champ by enough of a margin that I think it's plausible it wasn't random chance

01.12.2024 04:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

In my limited experience testing it (it got released ~this week) the schedule_free_adamw optimizer that's now in axolotl has outperformed the various adamw variants for me. The new adopt optimizer on the other hand hasn't delivered for me.

01.12.2024 04:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Enjoy static.googleusercontent.com/media/resear...

27.11.2024 15:47 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Have you read the google paper about using an optimization algorithm to make the optimal cookie though... because I not only read it I baked dem cookies and can confirm, algo work real good

27.11.2024 15:25 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Yeah, giant pain but I really wanted to know...

27.11.2024 12:46 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

So while it's a bit task specific, there's more than enough context provided to the LLM's in prompt to understand the task + how the output will be evaluated. Rubric is pass/fail on each dimension with room for the LLM to overweight failures like - leaving out key info in the final judgement

27.11.2024 12:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Call it a summary/instruction following eval. The judge prompt has a 21 point custom rubric for grading the outputs and the original prompt for producing the summaries has a similarly lengthy description of the kinds of things I want included vs excluded, style guidelines, what to emphasize etc.

27.11.2024 12:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

Nice! All about the custom eval. A lot of work but so so worth it. I recently built my own eval as well (not for code, and primarily to evaluate performance of different fine tuning ablations/ideas).

bsky.app/profile/n0ri...

27.11.2024 12:25 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

For context these are o1-preview judgements from a custom LLM as a judge prompt I spent an unreasonable amount of time crafting. Posted this to that other site but going forwards I will share more here.

27.11.2024 12:22 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Sonnet is still King πŸ‘‘ for summarization:

Sonnet 3.6 vs 4o 11-20 (n=210):
Claude Sonnet 3.6: 54% (113 wins)
GPT-4o (11/20): 44% (92 wins)
Ties: 2% (5)

Sonnet 3.6 vs Gemini Exp 11-21 (n=202):
Claude Sonnet 3.6: 60% (122 wins)
Gemini-exp-1121: 38% (76 wins)
Ties: 2% (4)

27.11.2024 12:22 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1

Distillation is the way. Sample efficiency of larger model when training + inf cost of smaller distilled model + retain the option to quantized the smaller model to fp8 for speed boost on some GPUs + option for better spec decoding from extra small distilled model = really good practical option IMO

26.11.2024 13:34 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@myexplodingpen.bsky.social is there any way to read your article without subscribing to medium? Haven't been able to find much from Microsoft on the topic or anyone else discussing this but can't read your piece...

24.11.2024 16:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Instructions unclear but just in case all I have to do to get into the Stanford NLP phd program is reply to this thread I figure I better go ahead and reply.

22.11.2024 00:38 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@n0riskn0r3ward is following 16 prominent accounts