Join the OSCAR Project Discord Server!
Check out the OSCAR Project community on Discord - hang out with 365 other members and enjoy free voice and text chat.
๐ We're working on many new features for you, currently we're focusing on improving language identification, so if you want to help or contribute, please join our community ๐ฌ on Discord: https://t.co/toLKAPje4E
10.08.2023 15:50 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
โจ Colossal OSCAR 1.0 has also been made possible thanks to the continuous support of Inria, the ALMAnaCH and CommonCrawl. Specially thanks to the contributions of @ujj.bsky.social, Rua Ismail, @sobamchan.bsky.social, Sebastian Nagel and Benoรฎt Sagot.
10.08.2023 15:49 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Terms of Use โ Common Crawl
As Colossal OSCAR 1.0 is based on Common Crawl, our annotations are distributed under CC0 (Creative Commons Zero) license, however for the textual comments users agree to the Common Crawl Terms of use ๐
๐ https://commoncrawl.org/terms-of-use/
10.08.2023 15:46 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Colossal OSCAR 1.0 is just a partial annotation of the WET files of 10 Common Crawl snapshots, the original data is included only for convenience, and specially for researchers looking for data in lower resource languages. ๐ฃ๏ธ
10.08.2023 15:45 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Colossal OSCAR 1.0 is by far our largest release so far, being almost 10 times as big as previous releases. We're still working on statistics and documentation so please bear with us while we finish these for you in the coming days and weeks. ๐ค๐งโ๐ฌ๐
10.08.2023 15:44 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
๐ฃ The OSCAR Project and DFKI are happy to announce the release of Colossal OSCAR 1.0 ๐, which is now available on the Hugging Face Hub ๐ค at https://huggingface.co/datasets/oscar-corpus/colossal-oscar-1.0
Colossal OSCAR 1.0 was put together by @pjox.bsky.social as part of the OpenGPT-X collaboration.
10.08.2023 15:44 โ ๐ 6 ๐ 1 ๐ฌ 1 ๐ 2