Built a *tiny-Mixtral model* (~172M, 8 experts) from scratch
with
- Grouped Query Attention,
- Rolling Buffer KV Cache
- Sparse MoEs
- Rotary Positional Embeddings
Trained it on TinyStories.
github.com/kabir2505/ti...
@kabir25.bsky.social
20| ML https://github.com/kabir2505/Deep-Learning-papers
Built a *tiny-Mixtral model* (~172M, 8 experts) from scratch
with
- Grouped Query Attention,
- Rolling Buffer KV Cache
- Sparse MoEs
- Rotary Positional Embeddings
Trained it on TinyStories.
github.com/kabir2505/ti...
Logged the full summary here: www.notion.so/kabir25/LLMS...
08.04.2025 15:09 β π 0 π 0 π¬ 0 π 0 β’ Different errors, different signals: Internal states can even help predict what kind of error a model will make β factual, reasoning, etc.
⒠Hidden knowledge: Sometimes, models internally know the right answer⦠but still generate the wrong one externally.
Takeaways:
β’ Truth is token-specific: Truthfulness signals are concentrated in certain tokens β probing those can significantly boost error detection.
β’ Generalization is tough: These probing techniques donβt generalize across datasets, which means LLMs hold multiple fragmented notions of truth.
Read a super interesting paper recently β βLLMs Know More Than They Showβ. (openreview.net/forum?id=KRn...) It dives into how large language models actually encode way more truthfulness internally than they let on in their outputs.
08.04.2025 15:09 β π 0 π 0 π¬ 1 π 0which city in Maharashtra?
23.03.2025 11:25 β π 1 π 0 π¬ 1 π 0hello! are you hiring undergrad ml research interns?
20.03.2025 12:53 β π 1 π 0 π¬ 1 π 0implemented the Llama architecture from scratch in pytorch
github.com/kabir2505/De...
let's implement llama today π
09.03.2025 07:12 β π 0 π 0 π¬ 0 π 0Not enough ml/dl folks on my feed
06.03.2025 05:59 β π 1 π 0 π¬ 0 π 0hahahaha same
06.03.2025 03:20 β π 0 π 0 π¬ 0 π 0implemented wgan & wgan-gp in torch
github.com/kabir2505/De...
github.com/kabir2505/De...
onto some more gan models & vaes :)
Spent the day revisiting dropout, so I figured Iβd turn it into a blog - kabir25.notion.site/Dropout-16e3...
01.01.2025 13:52 β π 0 π 0 π¬ 0 π 0Implemented *instruction fine-tuning* on a GPT-2 model on a small dataset & mine claimed Robert Frost wrote Pride and Prejudiceπ
github.com/kabir2505/pr...
today's agenda..
21.12.2024 04:27 β π 1 π 0 π¬ 0 π 0my notes on the gpt-3 paper: kabir25.notion.site/GPT3-1603fc0...
20.12.2024 15:37 β π 1 π 0 π¬ 1 π 0good morning!!
20.12.2024 15:36 β π 2 π 0 π¬ 0 π 0this is insane, Huge congrats! π
19.12.2024 12:09 β π 1 π 0 π¬ 0 π 0If you are into ML theory (RL or not) with a proven track record, and you are interested in an industry research position, PM me. Feel free to spread the word.
19.12.2024 00:55 β π 74 π 31 π¬ 2 π 0today's read :)
18.12.2024 14:33 β π 0 π 0 π¬ 0 π 1NeurIPS FOMO is real π« wish I could teleport..
14.12.2024 16:03 β π 0 π 0 π¬ 0 π 0Built πππ₯π§ from scratch in pytorch. Took a bit to understand π Μ²πΜ²π Μ²(Masked Language Modeling) and π‘Μ²π¦Μ²π£Μ² (Next Sentence Prediction) but totally worth the grind.
Code: github.com/kabir2505/De...
Notes: kabir25.notion.site/BERT-1533fc0...
bad recs=zero vibes= productivity tanked!
06.12.2024 09:30 β π 1 π 0 π¬ 0 π 0tackling my first nlp kaggle competition, any suggestions or references?
06.12.2024 04:55 β π 0 π 0 π¬ 0 π 0notes on bert: kabir25.notion.site/BERT-1533fc0...
Still a work in progress..
Diving into BERT today :)
05.12.2024 12:31 β π 0 π 0 π¬ 0 π 1