Harsh Trivedi's Avatar

Harsh Trivedi

@harsh3vedi.bsky.social

πŸ€– Building AI agents & interactive environments: 🌍 AppWorld (https://appworld.dev) #NLProc PhD @stonybrooku. Past intern Allen AI & visitor CILVR at NYU. 🐦 https://x.com/harsh3vedi 🌐 https://harshtrivedi.me/

775 Followers  |  231 Following  |  7 Posts  |  Joined: 18.11.2024  |  1.8503

Latest posts by harsh3vedi.bsky.social on Bluesky

Post image

Our AI & Scientific Discovery Workshop (@ NAACL 2025) broadly welcomes papers on all aspects of the scientific discovery process through the lens of AI / NLP.

Paper submission deadline: Jan 30/2025 (about 2 weeks).
We're excited to see you there!

15.01.2025 21:08 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 3

Hey Marc! Thanks for this starter pack. Can you please add me to it as well?

29.11.2024 15:57 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
AppWorld: Reliable Evaluation of Interactive Agents in a World of Apps and People. Happening at 11 AM EST online on Dec 2, 2024

AppWorld: Reliable Evaluation of Interactive Agents in a World of Apps and People. Happening at 11 AM EST online on Dec 2, 2024

🚨 Happening next Monday, 2 Dec, @cohere.com ! ✨
πŸ‘‹ Anyone can join remotely at this link:
πŸ‘‰ cohere.com/events/coher...
πŸ™ Thank you @sebruder.bsky.social for helping arrange it!!
πŸ“… Upcoming talks: appworld.dev/talks

29.11.2024 15:43 β€” πŸ‘ 8    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
A plot: the x axis is baseline score of rankers, in ndcg@10. y axis is delta of model score after an expansion is applied.

There are three sets of results, one dataset for each shift type: TrecDL (no shift), FiQA (domain shift), ArguAna (query shift).  For each set of result, the chart shows a scatter plot with a trend line. We observe the same trend for all: as the baseline score increases, the delta when using expansion decreases. 

On TREC DL, worst models have a base score of ~40, and improve by 10 points w/expansion. the best models have a score of >70, and their performance decreases by -5 points w/expansion.

On FiQA, worse models have a base score of ~15, and improve by 5 points w/expansion. the best models have a score of ~45, and their performance decreases by -3 point w/expansion.

On ArguAna, worst models have a base score of ~25, and improve by >20 points w/expansion. the best models have a score of >55, and their performance decreases by -1 point w/expansion.

A plot: the x axis is baseline score of rankers, in ndcg@10. y axis is delta of model score after an expansion is applied. There are three sets of results, one dataset for each shift type: TrecDL (no shift), FiQA (domain shift), ArguAna (query shift). For each set of result, the chart shows a scatter plot with a trend line. We observe the same trend for all: as the baseline score increases, the delta when using expansion decreases. On TREC DL, worst models have a base score of ~40, and improve by 10 points w/expansion. the best models have a score of >70, and their performance decreases by -5 points w/expansion. On FiQA, worse models have a base score of ~15, and improve by 5 points w/expansion. the best models have a score of ~45, and their performance decreases by -3 point w/expansion. On ArguAna, worst models have a base score of ~25, and improve by >20 points w/expansion. the best models have a score of >55, and their performance decreases by -1 point w/expansion.

Using LLMs for query or document expansion in retrieval (e.g. HyDE and Doc2Query) have scores going πŸ“ˆ

But do these approaches work for all IR models and for different types of distribution shifts? Turns out its actually more πŸ“‰ 🚨

πŸ“ (arxiv soon): orionweller.github.io/assets/pdf/L...

15.09.2023 18:57 β€” πŸ‘ 42    πŸ” 6    πŸ’¬ 3    πŸ“Œ 3

Great opportunity to see how (your) new coding agent methods stack up real world user tasks

21.11.2024 23:51 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Post image

Meet TΓΌlu 3, a set of state-of-the-art instruct models with fully open data, eval code, and training algorithms.
We invented new methods for fine-tuning language models with RL and built upon best practices to scale synthetic instruction and preference data.
Demo, GitHub, paper, and models πŸ‘‡

21.11.2024 17:15 β€” πŸ‘ 111    πŸ” 31    πŸ’¬ 2    πŸ“Œ 7

another starter pack, this time for folks (past & current) from Ai2 (@ai2.bsky.social) 😍

go.bsky.app/Qjyc97J

21.11.2024 16:10 β€” πŸ‘ 22    πŸ” 5    πŸ’¬ 2    πŸ“Œ 0

I thought to create a Starter Pack for people working on LLM Agents. Please feel free to self-refer as well.

go.bsky.app/LUrLWXe

#LLMAgents #LLMReasoning

20.11.2024 14:08 β€” πŸ‘ 15    πŸ” 5    πŸ’¬ 11    πŸ“Œ 0

🚨 We are refreshing the 🌎 AppWorld (appworld.dev) leaderboard with all the new coding and/or tool-use LMs.

❓ What would you like to be included?

πŸ”Œ Self-plugs are welcome!!

x.com/harsh3vedi/s...

21.11.2024 14:11 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 0    πŸ“Œ 1

Hi Nikolai! Mind adding me to this starter pack? Thanks!

21.11.2024 13:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
EMNLP 2024 Tutorial: Language Agents: Foundations, Prospects, and Risks Deformable Neural Radiance Fields creates free-viewpoint portraits (nerfies) from casually captured videos.

Had a great time doing the language agent tutorial (language-agent-tutorial.github.io) with Yu Su, Shunyu Yao and Tao Yu πŸ˜€ #EMNLP2024

Check out our slides here: tinyurl.com/language-age...

18.11.2024 18:28 β€” πŸ‘ 33    πŸ” 5    πŸ’¬ 0    πŸ“Œ 0

Hi! Can you please add me to this list? Thank you!

18.11.2024 22:56 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Hi Michael! Can you please add me to this list? Thank you!

18.11.2024 22:55 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Hi Maria! Can you please add me to the list? Thank you!

18.11.2024 22:53 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@harsh3vedi is following 20 prominent accounts