's Avatar

@iseeaswell.bsky.social

TUSL discord link: https://discord.gg/z3ya9EUS2U

38 Followers  |  28 Following  |  23 Posts  |  Joined: 11.12.2024  |  1.9031

Latest posts by iseeaswell.bsky.social on Bluesky

It could also have been short for ่’™ๅค็ฏ†

09.11.2025 12:33 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

In 1443, king Sejongโ€˜s announcement said that Hangul imitates โ€œๅค็ฏ†โ€ script; however there is no reference to that script anywhere else so no one knows what it is. Butโ€ฆ doesnโ€™t it look suspiciously like โ€œ่’™ๅคโ€ (Mongol), giving more evidence that he was referring to โ€˜Phags-pa script?

08.11.2025 12:54 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Join the TUSL: Tech for Under-Served Languages: Grassroots community Discord Server! This is a community for people doing NLP for under-served and low-resource languages! Anyone is welcome. | 111 members

updated link: discord.gg/kDNWDhHv

06.08.2025 08:53 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

All are welcome. Please make this space your own, and add channels at will.

17.06.2025 17:46 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Our first task is to massively expand SMOL through community contribution. Anyone who contributes significant volunteer translations or post-edits will get on the Arxiv paper in the next refresh!

17.06.2025 17:46 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

this is a space for grassroots collaboration. It doubles as a directory of speakers of such languages, so you can directly talk with and collaborate with community members.

17.06.2025 17:46 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Join the Tech for Under-Served Languages: Grassroots community Discord Server! This is a community for people doing NLP for under-served and low-resource languages! Anyone is welcome. | 35 members

Working on Low Resource Languages? Want to help with SMOL? join our new discord! discord.gg/YFTv7tkh

17.06.2025 17:46 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

@colinacherry.bsky.social

19.02.2025 17:36 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

By the way, GATITOS has now officially moved to the SMOL Huggingface repo

19.02.2025 17:36 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Finally, if you are a speaker of any SMOL languages, please take a look at the data and tell me what you think. Despite the quality checks, I am sure that some of the deliveries have quality issues, and I would love to understand and/or fix any affected sources. We are in this together!

19.02.2025 17:36 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

I would also like to thank FAIR for being an academic leader in open-sourcing work with low-resource languages, including NLLB and Flores. Thank you for helping make the academic community feel collaborative!

19.02.2025 17:36 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

I would like to thank our native-language consultants and translators -- too numerous to name -- for their invaluable help along the way. Several entire languages in SMOL only exist because of volunteer contributions!

19.02.2025 17:36 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

SMOL also provides factuality ratings for 671 documents, with well-researched justifications.

19.02.2025 17:36 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

SMOL has two sub-sources: SMOL-Doc, a document-level set, and SMOL-Sent, a sentence-level source. They join the token-level GATITOS to hit at three levels of granularity!

19.02.2025 17:36 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

And thatโ€™s just OOTB finetuningโ€”we know that the community can think of more clever ways to train on SMOL. Multiway parallel data is tricky to deal with without overfitting.

19.02.2025 17:36 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Finetuning of Gemini 2.0 Flash on SMOL yields average improvements of about +4.0 ChrF, with some languages -- including Ewe, Kokborok, Manipuri, Ga, and Dombe -- seeing gains of over +20 ChrF.

19.02.2025 17:36 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

SMOL comprises sentences and documents carefully selected for the biggest โ€œBang for Buckโ€ ratio. It includes 6.1M translated tokensโ€”and if youโ€™ve been in this field a while you know thatโ€™s a lot!

19.02.2025 17:36 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿ˜ผSMOL DATA ALERT! ๐Ÿ˜ผAnouncing SMOL, a professionally-translated dataset for 115 very low-resource languages! Paper: arxiv.org/pdf/2502.12301
Huggingface: huggingface.co/datasets/goo...

19.02.2025 17:36 โ€” ๐Ÿ‘ 14    ๐Ÿ” 8    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1

Is Dravidian negation illa (เฎ‡เฎฒเฏเฎฒเฏˆ etc.) cognate to Semitic *lฤ (ู„ูŽุง etc.?) there was a lot of trade in that region so it seems likely to me.

04.02.2025 21:58 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

I want a โ€œDuolingo for linguistsโ€that doesnโ€™t attempt to teach you useful everyday language but just speed runs you through the grammar and so on

24.01.2025 08:40 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1

Google translate gives โ€œๅœจ่ฟ™้‡Œ๏ผŒ่ฏดๅพ—ๅพˆๅคš๏ผŒไฝ†ๅฌๅˆฐ็š„ๅดๅพˆๅฐ‘โ€, which has the suspicious property that the passive is marked with ๅพ— in the first clause and ็š„ in the second. More credence to the theory that they are cognate with all Irish, specifically the Cork dialect, which is the oldest and purest form

27.12.2024 18:14 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Interesting, my brain didnโ€™t consider that because that construction feels like an adjective rather than a verb, but it does seem to have more or less the same meaning!

27.12.2024 18:11 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

I often find myself wanting to use the Irish impersonal aspect in Chinese, a.k.a. ่ฏดtar/ๅฌtear for "it is spoken"/"it is heard", so โ€ๅœจ่ฟ™้‡Œ่ฏดtarๅพˆๅคš๏ผŒ่€Œๅฌtearๅพˆๅฐ‘โ€œ โ€œhere, much is said, but little is heard". (While we're at it we could give Irish a nice Rechtschreibreform imported from Cyrl: "ๅฌtear" --> "ๅฌtัŒar")

27.12.2024 06:54 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@iseeaswell is following 20 prominent accounts