Juan Rodriguez's Avatar

Juan Rodriguez

@joanrod.bsky.social

AI Researcher. Working on Multimodal AI at ServiceNow, Mila joanrod.github.io

24 Followers  |  24 Following  |  9 Posts  |  Joined: 29.11.2024  |  1.6625

Latest posts by joanrod.bsky.social on Bluesky

Post image

We’re really excited to release this large collaborative work for unifying web agent benchmarks under the same roof.

In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet

12.12.2024 17:55 β€” πŸ‘ 20    πŸ” 11    πŸ’¬ 1    πŸ“Œ 2
Post image

LLMs have a lot of potential for science, but scientists can be particularly sensitive to factuality, nuances, and hallucinations. The new ScholarQABench benchmark in this paper looks pretty useful for the community to monitor progress on LLMs for science. arxiv.org/html/2411.14199

25.11.2024 01:20 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Also, we are currently at NeurIPS in Vancouver! We will be presenting this work in the RBFM workshop on Saturday! Come say hi, and let’s spark some collaborations! πŸš€

10.12.2024 18:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

This was a monumental collaboration, and a huge thank you to all the co-authors, ServiceNow Research, Mila, and all the institutions involved for their incredible support! πŸ™

10.12.2024 18:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We hope this effort aids the community in building more robust models for these tasks while emphasizing the importance of open and transparent data usage and release.

10.12.2024 18:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We evaluated several VLM modelsβ€”both open and closed sourceβ€”on BigDocs-Bench to build a leaderboard.

πŸ“Š Models trained on BigDocs outperformed all models on BigDocs-Bench tasks and delivered rebust performance on established benchmarks.
βœ… Human evaluations confirmed their strong performance!

10.12.2024 18:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

To validate the quality of the BigDocs datasets, we trained several VLMs on BigDocs-7.5M and evaluated their performance on document-specific and general VLM benchmarks.

The results? Training on BigDocs provides significant boosts compared to training on other datasets! πŸ“ˆβœ¨

10.12.2024 18:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

We introduce BigDocs-Bench, a set of benchmarks that focus on:

πŸ“„ Document Understanding
🌐 Web and GUI reasoning
πŸ‘¨β€πŸ’» Code Generation

We also tackle complex outputs like SVG, LaTeX code, Markdown, and HTML, including very long and structured formats. Here are some examples

10.12.2024 18:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0


By sharing this journey, we aim to bring more transparency to how datasets are builtβ€”especially as data remains the most opaque aspect of model performance in today’s fast-moving AI landscape. 🌟

10.12.2024 18:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Building BigDocs was no small feat! We curated a large-scale dataset from diverse, license-friendly sources and documented the entire process.

10.12.2024 18:34 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸŽ‰ Excited to introduce BigDocs!
An open, transparent multimodal dataset designed for:
πŸ“„ Documents
🌐 Web content
πŸ–₯️ GUI understanding
πŸ‘¨β€πŸ’» Code generation from images
We’re also launching BigDocs-Bench:
➑️ Document, Web, GUI Visual reasoning
➑️ Converting images into JSON, Markdown, LaTeX, SVG, and more!

10.12.2024 18:34 β€” πŸ‘ 16    πŸ” 8    πŸ’¬ 1    πŸ“Œ 2

@joanrod is following 18 prominent accounts