sapienzanlp/bookcoref ยท Datasets at Hugging Face
Weโre on a journey to advance and democratize artificial intelligence through open source and open science.
We hope our contribution can become a useful benchmark for the challenging setting of Coreference Resolution on books! We release our code and dataset:
GitHub: github.com/sapienzaNLP/bookcoref
HuggingFace: huggingface.co/datasets/sapienzanlp/bookcoref
Thank you for your attention! (6/6)
21.07.2025 14:24 โ ๐ 0 ๐ 1 ๐ฌ 0 ๐ 0
Longdoc is the best performing model in both off-the-shelf and fine-tuned settings. Although BOOKCOREF Silver enables fine-tuned models to achieve a better score, 67 CoNLL F1 points are still far from current SOTA scores on small- or medium-sized datasets. (5/6)
21.07.2025 14:24 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Our pipeline achieves 80.5 CoNLL F1 on our manual annotated split, BOOKCOREF Gold, warranting the use of BOOKCOREF Silver as a fine-tuning dataset.
We test current SOTA Coreference Resolution systems on our benchmark, both off-the-shelf and after fine-tuning. (4/6)
21.07.2025 14:24 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
We manually annotate 3 books for our test set and develop a 3-step pipeline to auto-annotate long books from text and character lists. We initialize coreferential clusters via explicit character mentions, refine them with an LLM, and expand using a local coreference model. (3/6)
21.07.2025 14:24 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Coreference Resolution datasets currently focus on small- to medium-sized text, preventing the development of robust long-document Coreference Resolution systems. We introduce BOOKCOREF, a large-scale dataset obtained through our BOOKCOREF Pipeline to fill this gap. (2/6)
21.07.2025 14:24 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
๐จ Our paper "BOOKCOREF: Coreference Resolution at Book Scale" was accepted at #ACL2025 main conference!
Kudos to all my co-authors: @giulianomartinelli.bsky.social, @perelluis13.bsky.social and Roberto Navigli, within the Sapienza NLP group.
Paper: arxiv.org/abs/2507.12075
๐งตA brief thread: (1/6)
21.07.2025 14:24 โ ๐ 3 ๐ 1 ๐ฌ 1 ๐ 0
Hi, would also love to be added if there's still space! Thank you!
25.11.2024 07:47 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
http://cljournal.org
Computational Linguistics, established in 1974, is the official flagship journal of the Association for Computational Linguistics (ACL).
Google Chief Scientist, Gemini Lead. Opinions stated here are my own, not those of Google. Gemini, TensorFlow, MapReduce, Bigtable, Spanner, ML things, ...
PhD in ML/AI | Researching Efficient ML/AI (vision & language) ๐ & Interpretability | @SapienzaRoma @EdinburghNLP | https://alessiodevoto.github.io/ | ex @NVIDIA
PhD in AI at Sapienza University of Rome. Information Extraction, NLP, and such deviousness.
Researcher & faculty member @DPKM dedicated to the field of AI, with the focus on knowledge technologies (knowledge graphs, semweb, RAG) & their use in e-gov, skills matching, research ecosystem, digital humanities and education. Partner @km-a.bsky.social.
Researcher at @fbk-mt.bsky.social | inclusive and trustworthy machine translation | #NLP #Fairness #Ethics | she/her
Human Centered AI, AI for Accessibility ๐ฅ๐ฆพ
Postdoc at UT Austin ISchool, previously a PhD at UMD CLIP
she/her
PhD student in NLP at Sapienza | Prev: Apple MLR, @colt-upf.bsky.social , HF Bigscience, PiSchool, HumanCentricArt #NLProc
www.santilli.xyz
Computer Scientist with a thing for language. UniTn PhD student working in the LanD group at FBK. Focusing on NLG and misinformation.
https://drusso98.github.io/
Secular Bayesian.
Professor of Machine Learning at Cambridge Computer Lab
Talent aficionado at http://airetreat.org
Alum of Twitter, Magic Pony and Balderton Capital
PhD Student at UC San Diego | LLM Agents, Reinforcement Learning, Human-AI Collaboration, Multi-Agent Systems
PhD student @ EdinburghNLP | undergrad+masters @ Georgia Tech
Ph.D. candidate @ UMD CS Clip lab | ex Intern @ Meta FAIR & Microsoft | Multilingual and Multimodal NLP. Machine Translation and Speech Translation. https://h-j-han.github.io/
PhD student in NLP at Cambridge | ELLIS PhD student
https://lucasresck.github.io/
PhD student @ UAEU & KU Leuven
Postdoc at Princeton PLI. Formerly PhD at Stanford CS. Working on behavioral machine learning. https://kawine.github.io/
PhD candidate in CS at Northeastern University | NLP + HCI for health | she/her ๐โโ๏ธ๐ง
๐