Thom Lake's Avatar

Thom Lake

@thomlake.bsky.social

Principal Scientist at Indeed. PhD Student at UT Austin. AI, Deep Learning, PGMs, and NLP.

703 Followers  |  410 Following  |  6 Posts  |  Joined: 13.11.2024  |  1.787

Latest posts by thomlake.bsky.social on Bluesky

Post image

I'm at #Neurips2024 this week!

My work (arxiv.org/abs/2406.17692) w/ @gregdnlp.bsky.social & @eunsol.bsky.social exploring the connection between LLM alignment and response pluralism will be at pluralistic-alignment.github.io Saturday. Drop by to learn more!

11.12.2024 17:39 β€” πŸ‘ 28    πŸ” 6    πŸ’¬ 0    πŸ“Œ 0

Due to the split between the inputs statements and query, the resulting model isn't a generic sequence processor like RNNs or transformers. However, if you were to process a sequence by treating each element as a new query, you'd get something that looks a lot like a transformer.

02.12.2024 16:37 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

MemNets first encode each input sentence/statement with a position embedding independently. These are the "memories". Finally, you encode the query and apply cross-attention between that and the memories. Rinse and repeat for some fixed depth. No for-loop over time here.

02.12.2024 16:35 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

The recurrence there is referencing depth-wise weight tying (see Section 2.2).

> Layer-wise (RNN-like): the input and output embeddings are the same across different layers

02.12.2024 15:49 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
End-To-End Memory Networks We introduce a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of Memory Network (Weston et al., 2015) but unlike the model in that wo...

Memory networks were earlier, attention only, and had position embeddings, but were not word/token level: arxiv.org/abs/1503.08895

They were later elaborated with the key-value distinction which is, AFAIK, where this terminology arises: arxiv.org/abs/1606.03126

02.12.2024 06:32 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
A scatter plot comparing language models by performance (y-axis, measured in average performance on 10 benchmarks) versus training computational cost (x-axis, in approximate FLOPs). The plot shows OLMo 2 models (marked with stars) achieving Pareto-optimal efficiency among open models, with OLMo-2-13B and OLMo-2-7B sitting at the performance frontier relative to other open models like DCLM, Llama 3.1, StableLM 2, and Qwen 2.5. The x-axis ranges from 4x10^22 to 2x10^24 FLOPs, while the y-axis ranges from 35 to 70 benchmark points.

A scatter plot comparing language models by performance (y-axis, measured in average performance on 10 benchmarks) versus training computational cost (x-axis, in approximate FLOPs). The plot shows OLMo 2 models (marked with stars) achieving Pareto-optimal efficiency among open models, with OLMo-2-13B and OLMo-2-7B sitting at the performance frontier relative to other open models like DCLM, Llama 3.1, StableLM 2, and Qwen 2.5. The x-axis ranges from 4x10^22 to 2x10^24 FLOPs, while the y-axis ranges from 35 to 70 benchmark points.

Excited to share OLMo 2!

🐟 7B and 13B weights, trained up to 4-5T tokens, fully open data, code, etc
🐠 better architecture and recipe for training stability
🐑 staged training, with new data mix DolminoπŸ• added during annealing
🦈 state-of-the-art OLMo 2 Instruct models

#nlp #mlsky

links belowπŸ‘‡

26.11.2024 20:59 β€” πŸ‘ 68    πŸ” 12    πŸ’¬ 1    πŸ“Œ 1

πŸ‘‹

25.11.2024 14:44 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
NLP at UT Austin Join the conversation

A starter pack for the NLP and Computational Linguistics researchers at UT Austin!
go.bsky.app/75g9JLT

22.11.2024 17:18 β€” πŸ‘ 22    πŸ” 7    πŸ’¬ 0    πŸ“Œ 0

@thomlake is following 20 prominent accounts