Sergey Feldman's Avatar

Sergey Feldman

@sergeyf.bsky.social

ML/AI at AI2 http://semanticscholar.org, http://alongside.care, http://data-cowboys.com

311 Followers  |  405 Following  |  34 Posts  |  Joined: 01.07.2023  |  1.6817

Latest posts by sergeyf.bsky.social on Bluesky

Screenshot of the Ai2 Paper Finder interface

Screenshot of the Ai2 Paper Finder interface

Meet Ai2 Paper Finder, an LLM-powered literature search system.

Searching for relevant work is a multi-step process that requires iteration. Paper Finder mimics this workflow β€” and helps researchers find more papers than ever πŸ”

26.03.2025 19:07 β€” πŸ‘ 117    πŸ” 23    πŸ’¬ 6    πŸ“Œ 9
Ai2 ScholarQA logo, with a red sign that says "Updated!"

Ai2 ScholarQA logo, with a red sign that says "Updated!"

Hope you’re enjoying Ai2 ScholarQA as your literature review helper πŸ₯³ We’re excited to share some updates:

πŸ—‚οΈ You can now sign in via Google to save your query history across devices and browsers.
πŸ“š We added 108M+ paper abstracts to our corpus - expect to get even better responses!

More below…

05.03.2025 18:21 β€” πŸ‘ 11    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0
Ai2 ScholarQA logo

Ai2 ScholarQA logo

Can AI really help with literature reviews? 🧐
Meet Ai2 ScholarQA, an experimental solution that allows you to ask questions that require multiple scientific papers to answer. It gives more in-depth and contextual answers with table comparisons and expandable sections πŸ’‘
Try it now: scholarqa.allen.ai

21.01.2025 19:30 β€” πŸ‘ 33    πŸ” 12    πŸ’¬ 1    πŸ“Œ 6
Post image 08.01.2025 06:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
The CEO using AI to fight insurance-claim denials says he wants to remove the 'fearfulness' around getting sick Claimable has helped patients file hundreds of health-insurance appeals. Its CEO says its success rate of overturning denials is about 85%.

Building off the story I shared yesterday about fighting potential Insurance Company AI with AI: Claimable uses AI to tackle insurance claim denials. With an 85% success rate, it generates tailored appeals via clinical research and policy analysis. 🩺 #HealthPolicy

13.12.2024 16:17 β€” πŸ‘ 15    πŸ” 6    πŸ’¬ 1    πŸ“Œ 3

(3) they also studied multiple rounds of the above. iterative self improvement. saturation happens after 2 or 3 rounds. I'm surprised it's not 1!

(4) Ensemble Heuristic: Simple verification ensemble heuristics can improve performance

6/6

13.12.2024 03:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0


(2) CoT Verification is More Stable than MC: "Some MC verification incurs non-positive gap even for medium-sized models such as Qwen-1.5 14/32B, while CoT verification always has a positive gap for medium/large-sized models"

5/n

13.12.2024 03:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Results
(1) Small Models can not Self-improve. Models such as Qwen-1.5, 0.5B, Qwen-2 0.5B and Llama-2 7B, gap(f ) is non-positive for nearly all verification methods, even though the models have non-trivial generation accuracy

4/n

13.12.2024 03:35 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(3) Then they compute the gap which is the average accuracy diff between the filtered generations (those that are correct after step 2 according to self-verification) and the original 128 responses.

3/n

13.12.2024 03:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(2) For each of the 128, they sample one verification for each response of one of 3 styles: (a) correct vs incorrect, (b) CoT + score 1 to 10, or (c) "Tournament" style, which you can find in the paper.

2/n

13.12.2024 03:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image Post image

Super awesome paper that directly addresses questions I've had for a while: arxiv.org/abs/2412.02674

Their experiments:

(1) They get 128 responses from a LLM for some prompt. p = 0.9, t = 0.7, max length of 512 and 4-shot in-context samples

1/n

13.12.2024 03:35 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Check out our #NeurIPS2024 poster (presented by my collaborators Jacob Chen and Rohit Bhattacharya) about β€œProximal Causal Inference With Text Data” at 5:30pm tomorrow (Weds)!

neurips.cc/virtual/2024...

11.12.2024 01:10 β€” πŸ‘ 12    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0

Can you imagine how good BESTERSHIRE sauce tastes?!?!

09.12.2024 20:39 β€” πŸ‘ 20    πŸ” 2    πŸ’¬ 4    πŸ“Œ 0

Windows has issue:

Person: fuck this I'm going to Linux

Narrator: and they quickly learned to hate two operating systems.

26.11.2024 16:48 β€” πŸ‘ 9970    πŸ” 756    πŸ’¬ 368    πŸ“Œ 93

Thanks!

26.11.2024 02:31 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

If you know papers or blog posts that address these, I'd be happy to have the links. Thanks!

22.11.2024 17:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

(7) Others found a good recipe for distilling: first fine-tune the biggest model on small gold data, then use that fine-tuned model to make silver data. Does that work for IR distilling? If we fine-tune a 405b before using it as the silver data source, what should we use as gold? How much do I need?

22.11.2024 17:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(6) You can get better LLM labels if you do all pair comparisons on the passage set (citation needed, but I read a few papers showing this). Obviously much more expensive. Should I spend my fixed computer/money budget on all-pairs O(few_queries * passages^2) or pointwise O(more_queries * passages)?

22.11.2024 17:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(5) Does the type of base model to be distilled matter much? Should I distill roberta-large or some modern 0.5b LM?

22.11.2024 17:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(4) From our experience at AI2, LLM-generated search queries are weirdly out of distribution and non-human in various ways. Does this matter? Do we have to get human queries?

22.11.2024 17:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(3) Can we do better than human labeled data because we have no gaps in the labels? And can get more data at will?

22.11.2024 17:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(2) How to distill well? Do we use the same loss functions that we used when obtaining gold data from human labelers?

22.11.2024 17:59 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

(1) Say I have 10000 queries and 100 passages/docs for each query, labeled or ranked by the best LLM (with optimized prompt or fine-tuning), how close can we get to the LLM's performance? Result is a plot with number of distilled model parameters on the x-axis and NDCG vs LLM on y-axis.

22.11.2024 17:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

Here are some research questions I'd like to get answers to. We are using LLMs to make training data for smaller, portable search or retrieval relevance models. (thread)

22.11.2024 17:59 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Image of an email from a student asking if sources "from the late 1900s" are acceptable.

Image of an email from a student asking if sources "from the late 1900s" are acceptable.

I will never recover from this student email.

27.11.2023 21:48 β€” πŸ‘ 9507    πŸ” 2366    πŸ’¬ 386    πŸ“Œ 464

#mlsky

20.11.2023 23:22 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

www.semanticscholar.org/paper/Ground...

I really like this paper. They study whether LLMs do reasonable things like ask follow-up questions and acknowledge what the users are saying. The answer is "not really".

20.11.2023 23:22 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

bsky.app/profile/did:...

07.11.2023 23:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

An actually useful task for GPT-4: formatting my bibliography.

24.10.2023 16:03 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Do you use ChatGPT or similar tools to complete tasks? : r/ProlificAc

Do crowdworkers use ChatGPT to write their responses? I'm still not sure, but when I asked on Reddit, I got a flood of fascinating responses from the workers themselves, including some practical tips for researchers looking to prevent this.

20.10.2023 17:31 β€” πŸ‘ 13    πŸ” 6    πŸ’¬ 3    πŸ“Œ 1

@sergeyf is following 20 prominent accounts