Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Wei Pang, Kevin Qinghong Lin, @edwardjian.bsky.social, Xi He, @philiptorr.bsky.social
tl;dr: great stuff!
arxiv.org/abs/2505.21497
@edwardjian.bsky.social
CS PhD student at University of Waterloo. Visiting Researcher at ServiceNow Research. Working on Al and DB.
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Wei Pang, Kevin Qinghong Lin, @edwardjian.bsky.social, Xi He, @philiptorr.bsky.social
tl;dr: great stuff!
arxiv.org/abs/2505.21497
π Excited to share that UI-Vision has been accepted at ICML 2025! π
We have also released the UI-Vision grounding datasets. Test your agents on it now! π
π€ Dataset: huggingface.co/datasets/Ser...
#ICML2025 #AI #DatasetRelease #Agents
Huge thanks to my wonderful coauthor
@shravannayak.bsky.social and the amazing team at @servicenowresearch.bsky.social and @mila-quebec.bsky.social.
We want UI-Vision to be the go-to benchmark for desktop GUI agents.
π’ Data, benchmarks & code coming soon!
π‘Next: scaling training data & models for long-horizon tasks.
Letβs build, benchmark & push GUI agents forward π
Interacting with desktop GUIs remains a challenge.
π±οΈ Models struggle with click & drag actions due to poor grounding and limited motion understanding
π UI-TARS leads across models!
π§ Closed models (GPT-4o, Claude, Gemini) excel at planning but fail to localize.
Detecting functional UI regions is tough!
π€ Even top GUI agents miss functional regions.
π Closed-source VLMs shine with stronger visual understanding.
π Cluttered UIs bring down IoU.
π Weβre the first to propose this task.
Grounding UI elements is challenging!
π€ Even top VLMs struggle with fine-grained GUI grounding.
π GUI agents like UI-TARS (25.5%) & UGround (23.2%) do better but still fall short.
β οΈ Small elements, dense UIs, and limited domain/spatial understanding are major hurdles.
We propose three key benchmark tasks to evaluate GUI Agents
πΉ Element Grounding β Identify a UI element from the text
πΉ Layout Grounding β Understand UI layout structure & group elements
πΉ Action Prediction β Predict the next action given a goal, past actions & screen state
UI-Vision consists of
β
83 open-source desktop apps across 6 domains
β
450 human demonstrations of computer-use workflows
β
Human annotated dense bounding box annotations for UI elements and rich action trajectories
Most GUI benchmarks focus on web or mobile.
π₯οΈ But what about desktop software, where most real work happens?
UI-Vision fills this gap by providing a large-scale benchmark with diverse and dense annotations to systematically evaluate GUI agents.
π Super excited to announce UI-Vision: the largest and most diverse desktop GUI benchmark for evaluating agents in real-world desktop GUIs in offline settings.
π Paper: arxiv.org/abs/2503.15661
π Website: uivision.github.io
π§΅ Key takeaways π