Sebastian Joseph's Avatar

Sebastian Joseph

@sebajoe.bsky.social

CS Ph.D. Student at UT Austin

10 Followers  |  3 Following  |  6 Posts  |  Joined: 02.06.2025  |  1.4399

Latest posts by sebajoe.bsky.social on Bluesky

AstroVisbench Β· AstroVisBench

My amazing co-authors: Syed Murtaza Husain, Stella Offner, @stephajuneau.bsky.social, Paul Torrey, Adam Bolton, Juan Frias, @niall2.bsky.social, @gregdnlp.bsky.social, and @jessyjli.bsky.social.
Full support from @nsfsimonscosmicai.bsky.social.

🌐: astrovisbench.github.io
πŸ“„: arxiv.org/abs/2505.20538

02.06.2025 15:41 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

We think this dataset is a great target for AI for science efforts. It zeroes in on an important part of the scientific workflow that is achievable near term and aims to produce tools used by astronomers, not aiming to replace them or automate all of science.

02.06.2025 15:41 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Even the best LLMs struggle to execute scientific workflows.

SOTA models including Gemini 2.5 Pro, Claude Opus 4, o3-mini and QwQ crash 30-60% of the time and only produce visualizations without error in less than 16% of the cases.

02.06.2025 15:41 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We generate code from a model, run it, and evaluate the following:

Processing tasks: we compare key variable values.
Visualizations: we use a VLM judge (well correlated w/ pro astronomers) that compares a visualization’s scientific utility to that of the ground truth.

02.06.2025 15:41 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

We created AstroVisBench from expert-curated jupyter notebooks for astronomy tasks, from which we constructed 432 sets of processing and plotting tasks. It tests a diverse set of visualizations and long-tail API use.

02.06.2025 15:41 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

How good are LLMs at πŸ”­ scientific computing and visualization πŸ”­?

AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results.

SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧡

02.06.2025 15:41 β€” πŸ‘ 9    πŸ” 2    πŸ’¬ 1    πŸ“Œ 4

@sebajoe is following 3 prominent accounts