Shikhar Murty's Avatar

Shikhar Murty

@shikharmurty.bsky.social

Final year PhD Student in Computer Science @Stanford Work on: - Compositionality, syntax (language structure) - Web Agents: Synthetic data, tree search, exploration (language interpretation)

469 Followers  |  124 Following  |  24 Posts  |  Joined: 19.11.2024  |  1.7777

Latest posts by shikharmurty.bsky.social on Bluesky

β€œcasual interception” as defined in \citep{}…

14.02.2025 23:41 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Ever dreamed of AI agents learning through interacting with the open world unsupervisedly? Our latest preprint introduces NNetNav-Live which collects training data through exploration on real websites and hindsight labeling, which produces a SOTA OSS agent.

06.02.2025 19:22 β€” πŸ‘ 4    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

controlling a browser / computer!
but requires a bit more tooling to set it up.

06.02.2025 19:00 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Please check out our paper for more details: arxiv.org/pdf/2410.02907

And our code if you want a NNetNav-ed model for your own domain:
github.com/MurtyShikhar...

Done with collaborators: @zhuhao.me, Dzmitry Bahdanau and @chrmanning.bsky.social

06.02.2025 17:42 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

We find that cross-website robustness is limited, and almost always, performance goes up from incorporating in-domain nnetnav data. This makes it even more important to work on unsupervised learning for agents - how are you going to collect human data for *any* website? [6/n]

06.02.2025 17:42 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We use this data for SFT-ing LLama3.1-8b. Our best models outperform zero-shot GPT-4 on both WebArena and WebVoyager, and reach SoTA performance among unsupervised methods for both datasets [5/n]

06.02.2025 17:42 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
stanfordnlp/nnetnav-live Β· Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

We use NNetNav to collect around 10k workflows for over 20 websites including 15 live websites, and 5 self-hosted websites.

Data is available on πŸ€—: huggingface.co/datasets/sta...
huggingface.co/datasets/sta...
[4/n]

06.02.2025 17:42 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Main ideas behind NNetNav exploration
1 complex goals have intermediate subgoals thus complex trajectories must have meaningful sub-trajectories
2 Use an LM instruction relabeler + judge to test if trajectory-so-far is meaningful. If yes, continue exploring, otherwise prune [3/n]

06.02.2025 17:42 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

NNetNav uses a structured exploration method to efficiently search and collect traces on live-websites, which are retroactively labeled into instructions, finding a strikingly diverse set of workflows for any website (e.g. like this plot) [2/n]

06.02.2025 17:42 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image Post image

Want to make a browser agent for *any* domain like banking or healthcare?
We propose methods for training LLMs with open-ended, unsupervised interaction on live websites:
βœ… OSS SoTA on WebVoyager
βœ… world's smallest high-performing web-agent
Try it here: nnetnav.dev

06.02.2025 17:42 β€” πŸ‘ 9    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

going to stay off twitter for my own mental health. something has gone horribly wrong with that platform.

28.12.2024 22:07 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Couldn't make it to NeurIPS due to work, but do check out our workshop happening in West Ballroom B. Lots of cool things to come, including a very fun panel!

15.12.2024 20:29 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Come visit our poster "MoEUT: Mixture-of-Experts Universal Transformers" on Friday at 4:30 in East Exhibit Hall A-C #1907 on #NeurIPS2024. With Kazuki Irie, JΓΌrgen Schmidhuber, Christopher Potts and @chrmanning.bsky.social.

12.12.2024 22:46 β€” πŸ‘ 14    πŸ” 5    πŸ’¬ 1    πŸ“Œ 0
NeurIPS 2024 TutorialsNeurIPS 2024

The extraordinary recent takeover of ML/AI by #NLP is well-known but insufficiently reflected on.

Look at the @neuripsconf.bsky.social tutorials in 2024!

neurips.cc/virtual/2024...

14 tutorials; 6 have "LLM" in the title; 4 more cover foundation models, with large NLP coverage. That's > 70% 😲

09.12.2024 19:29 β€” πŸ‘ 64    πŸ” 14    πŸ’¬ 1    πŸ“Œ 0
Post image

🚨 Thrilled to share that Compositional Generalization Across Distributional Shifts with Sparse Tree Operations received a spotlight award at #NeurIPS2024! 🌟 I'll present a poster on Tuesday and give an invited lightning talk at the System 2 Reasoning Workshop on Sunday. πŸ§΅πŸ‘‡

09.12.2024 15:06 β€” πŸ‘ 12    πŸ” 4    πŸ’¬ 1    πŸ“Œ 1
AgentLab diagram.

The image describes AgentLab, a framework for efficient parallel experiments with agents. It highlights:

Core Agent Features:

Dynamic Prompting and a Unified LLM API for interacting with large language models.
BrowserGym Platform:

A tool for testing agents on benchmarks like WebArena, WorkArena, MiniWoB, and others.
Key Features:

Reproducibility, a Unified Leaderboard, an analysis tool called Xray, and a Dataset for sharing agent traces.
Blue elements represent AgentLab components.

AgentLab diagram. The image describes AgentLab, a framework for efficient parallel experiments with agents. It highlights: Core Agent Features: Dynamic Prompting and a Unified LLM API for interacting with large language models. BrowserGym Platform: A tool for testing agents on benchmarks like WebArena, WorkArena, MiniWoB, and others. Key Features: Reproducibility, a Unified Leaderboard, an analysis tool called Xray, and a Dataset for sharing agent traces. Blue elements represent AgentLab components.

🧡-1
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.

03.12.2024 21:02 β€” πŸ‘ 18    πŸ” 15    πŸ’¬ 2    πŸ“Œ 0

Folks, I'm not going to be at Neurips this year, but we have an *awesome* workshop that i'm super proud of.

Go attend, and use the link below to ask all of your burning questions about LLM reasoning, agents and compositionality!

03.12.2024 19:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Join Slido: Enter #code to vote and ask questions Participate in a live poll, quiz or Q&A. No login required.

🎊Excited for #neurips2024 and our "System 2 Reasoning at Scale" workshop. We have an excited lineup of speakers who will answer your most burning questions about AI and reasoning πŸš€

πŸ”₯Got spicy questions? Submit & vote here:
app.sli.do/event/dJNU63...

03.12.2024 17:43 β€” πŸ‘ 4    πŸ” 3    πŸ’¬ 1    πŸ“Œ 1

I also wear the AI agents researcher hat. Can't say i'm similarly impressed by reviewers in that community...

27.11.2024 23:32 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

ACL syntax track reviewers >> almost any other conference.

These folks care about their sub-field and i learn something new every time!

27.11.2024 19:44 β€” πŸ‘ 12    πŸ” 2    πŸ’¬ 1    πŸ“Œ 1

Now, reviewers are upset if we only finetune sub 10B parameter models!

26.11.2024 22:28 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

for more context: we are training the probe on sentences from PTB / BLIMP

25.11.2024 05:52 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

thx for sharing, though semantic parsing almost certainly benefits from modeling syntax :)

25.11.2024 03:49 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

SRL probe still rewards hidden states that model dependency relations, no? would like a probe thats agnostic to how well the underlying network models syntax

24.11.2024 22:38 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

could i get added? thx for making this!!

24.11.2024 05:25 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

What is a probing task that is purely about semantics?
Context: I have a probe trained to predict dependency relations, and would like to train another one on a semantics only task (for research purposes)

24.11.2024 05:00 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 3    πŸ“Œ 0

To be fair, after some prompt engineering:

German:
(S
(NP (DT Der) (NN Mann))
(VP (VB mag)
(NP (JJ schwarze) (NNS Katzen))))

Japanese:
(S
(NP (NN Otoko) (PP wa))
(VP
(NP (JJ kuro) (NN neko) (PP ga))

21.11.2024 06:04 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Asked GPT-4o to draw parse trees in two languages:

21.11.2024 05:49 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Hot take (since it's still just friends on this platform):

It's crazy how the classic "sample and rerank" baseline from machine translation and IR, got re-branded as "scaling up inference-time compute".

21.11.2024 05:06 β€” πŸ‘ 14    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

nothing but blue skies, for posting puns

20.11.2024 22:54 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@shikharmurty is following 20 prominent accounts