Proud to work with John Pavlopoulos and @antonisa.bsky.social on this publication!
Check out the data and code here: github.com/andhmak/rule...
4/4
@antoniosdimakis.bsky.social
PhD fellow at Archimedes Unit, Athena Research Center | PhD student at the National and Kapodistrian University of Athens Interested in NLP for low-resource languages/terms, tokenization, and linguistics
Proud to work with John Pavlopoulos and @antonisa.bsky.social on this publication!
Check out the data and code here: github.com/andhmak/rule...
4/4
Regions clustered based on the embeddings of their proverbs. Normalized proverbs produce much more meaningful groupings.
We implement our method for Greek, and experiment on a proverb dataset. We therefore very cheaply extend NLU coverage of models pre-trained on just the standard to almost every Greek dialect.
After normalizing we even find cultural insights which were previously obscured!
3/4
Table showing normalization quality for different setups, with the full setup obtaining good scores.
"Dialect Normalization using Large Language Models and Morphological Rules"
By applying rule-based, linguistically informed transformations to the input before passing it to a LLM, with targeted few-shot prompting, we can obtain high-quality normalized outputs.
2/4
Example of a dialectal sentence being normalized incorrectly when using a base LLM, and the same sentence normalized correctly using our method.
How can we make models understand dialectal input, even in dialects with very little data available?
Our work indicates that Rule-Based Normalization can significantly help.
If you're at #ACL2025, check out our poster on Monday at 6pm! aclanthology.org/2025.finding...
1/4