Made a list of resources for open source language models with @soldaini.net ahead of the tutorial tomorrow at 930 AM.
github.com/allenai/awes...
@llm360.bsky.social
Working on fully open-source LLMs and training data. We believe in community-owned AI. https://www.llm360.ai
Made a list of resources for open source language models with @soldaini.net ahead of the tutorial tomorrow at 930 AM.
github.com/allenai/awes...
We've added you to the list!
02.12.2024 07:31 β π 6 π 0 π¬ 0 π 0We've added you to the list!
25.11.2024 09:30 β π 7 π 0 π¬ 0 π 0Can we join your list?
22.11.2024 01:28 β π 1 π 0 π¬ 1 π 0We've added you to the list!
22.11.2024 01:27 β π 0 π 0 π¬ 0 π 0Great, yes, added!
22.11.2024 01:26 β π 1 π 0 π¬ 0 π 0Thanks Stella! We've added eleuther to the list.
21.11.2024 02:15 β π 0 π 0 π¬ 0 π 0Thanks! We've added you to the list.
21.11.2024 02:15 β π 1 π 0 π¬ 0 π 0We've made a starter pack for researchers/organizations working on open-source LLMS.
Please let us know if we missed you or if you'd like to be added!
go.bsky.app/FELkyDr
Thank you!
19.11.2024 23:00 β π 0 π 0 π¬ 0 π 0ππThe global deduplication process was hairy π - and we want to share every detail.
The TxT360 dedup pipeline can be recreated and used for other datasets. We include our tips and tricks in a tell-all write up in the release blog:
llm360-txt360.hf.space
huggingface.co/spaces/LLM36...
Building on FineWebβs global deduplication findings, we introduce a strategic upsampling recipe which outperforms FineWeb using TxT360. Full details are in the Upsampling Experiment section of the release blog.
19.11.2024 22:42 β π 3 π 1 π¬ 1 π 0πͺπ οΈLLM360 is committed to making open source AI accessible, transparent, and reproducible.
High-quality data is the first step toward better open source models...and we are excited to join the party contributing the first globally deduplicated dataset containing 5.7T tokens!
Banner image showing the TxT360 project.
π’π’ Check out:
TxT360: a globally deduplicated dataset for LLM pretraining
π 99 Common Crawls
π 14 Curated Sources
π¨βπ³ recipe to easily adjust data weighting and train the most performant models
Dataset:
huggingface.co/datasets/LLM...
Blog:
llm360-txt360.hf.space
Can we join?
19.11.2024 22:30 β π 1 π 0 π¬ 1 π 0