Danny To Eun Kim's Avatar

Danny To Eun Kim

@teknology.bsky.social

PhD student @CMU LTI NLP | IR | Evaluation | RAG https://kimdanny.github.io

1,035 Followers  |  422 Following  |  22 Posts  |  Joined: 12.11.2024  |  2.0889

Latest posts by teknology.bsky.social on Bluesky


Post image

We tend to conflate "autonomy" with "reliability" in AI agents. But autonomy without trust is catastrophically dangerous.

Our new paper formalizes UQ for LLM agents, proposes a new lens: agent uncertainty as a conditional uncertainty reduction process.
๐Ÿ“„ huggingface.co/papers/2602....

07.02.2026 16:33 โ€” ๐Ÿ‘ 4    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐ŸŽญ How do LLMs (mis)represent culture?
๐Ÿงฎ How often?
๐Ÿง  Misrepresentations = missing knowledge? spoiler: NO!

At #CHI2026 we are bringing โœจTALESโœจ a participatory evaluation of cultural (mis)reps & knowledge in multilingual LLM-stories for India

๐Ÿ“œ arxiv.org/abs/2511.21322

1/10

02.02.2026 21:38 โ€” ๐Ÿ‘ 45    ๐Ÿ” 21    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2

#ChatGPT began to put ads in their response.
Check our paper on โ€œhow fair ranking can positively impact the LLM response and content/ad exposureโ€.
dl.acm.org/doi/10.1145/...

17.01.2026 06:20 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

#chatGPT began to put ads in their response.
Check out our paper on โ€œAds detection and integration in the era of LLMsโ€.
ceur-ws.org/Vol-4038/pap...

17.01.2026 06:16 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

as AI increasingly supports shopping and ads, itโ€™s worth remembering that retrieval often shapes who gets exposure in final generated output. in a recent paper, @teknology.bsky.social uses methods from fair ranking to assess and address exposure bias in downstream generation.

841.io/doc/fairrag....

31.12.2025 14:00 โ€” ๐Ÿ‘ 9    ๐Ÿ” 3    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
advertisement generation and detection in RAG

advertisement generation and detection in RAG

Excited to present at #CLEF2025 #Touchรฉ Lab (Session 2) shared task "Advertisement in RAG"๐Ÿ‡ช๐Ÿ‡ธ!
@webis.de
๐Ÿ—“๏ธSept 9 (Tue)
โฒ๏ธ5:20PM (CEST) / 11:20AM (EST)
๐Ÿ“Florentino Sanz Room
๐Ÿง https://arxiv.org/abs/2507.00509
Join us for insights on #RAG + advertising!

09.09.2025 00:02 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
Preview
an aerial view of tokyo at night with lots of lights ALT: an aerial view of tokyo at night with lots of lights

Some exciting news! ๐Ÿค— After 3 amazing years at TREC, the Tip-of-the-Tongue (ToT) shared task will be a core task at NTCIR-19 in 2026. The new track will focus on tip-of-the-tongue information needs in English and East Asian languages.

More details coming soon. See you all in Tokyo next year!

01.09.2025 16:12 โ€” ๐Ÿ‘ 5    ๐Ÿ” 3    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Gentle reminder ๐Ÿ“ข
All run submissions for the Tip-of-the-Tongue (ToT) Track are due next week Wednesday (Aug 27).

More info: trec-tot.github.io/guidelines
#TREC2025 #TRECToT #TREC2025ToT

19.08.2025 16:45 โ€” ๐Ÿ‘ 2    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1

This year's TREC Tip of the Tongue (ToT) track will be amazing! Based on our rigorous experiments on synthetic ToT query generation presented at #SIGIR2025, we extended the track to open domain ToT queries.
We provide codes for baseline systems, and submissions are due by August 27th!

04.08.2025 17:52 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image Post image Post image

To Eun Kim just presented the work on "Tip of the Tongue Query Elicitation for Simulated Evaluation" at #SIGIR2025. The approach will be used in the #TREC2025 Tip-of-the-Tongue track, and we had some sweets at the poster :)

The paper is available online: dl.acm.org/doi/10.1145/...

15.07.2025 14:30 โ€” ๐Ÿ‘ 12    ๐Ÿ” 3    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Hello TREC-ToTers!

We have released the test queries for the TREC 2025 Tip-of-the-Tongue (TREC-ToT) Track. Please see the guidelines for more information: trec-tot.github.io/guidelines. Run submission deadline will tentatively be in August. #TREC2025 #TRECToT #TREC2025ToT

Please spread the word!

13.07.2025 16:47 โ€” ๐Ÿ‘ 3    ๐Ÿ” 3    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1

โ“How do LLMs respond to fair ranking in RAG?
๐Ÿคฉ See how fair ranking boosts downstream utility while promoting fairer attribution of cited sources.
Catch our oral presentation at #ICTIR2025!
#SIGIR2025 @841io.bsky.social

12.07.2025 13:32 โ€” ๐Ÿ‘ 7    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
Dory from finding nemo with the quote: "I remember it like it was yesterday. Of course, I dont remember yesterday."

Dory from finding nemo with the quote: "I remember it like it was yesterday. Of course, I dont remember yesterday."

Do not forget to participate in the #TREC2025 Tip-of-the-Tongue (ToT) Track :)

The corpus and baselines (with run files) are now available and easily accessible via the ir_datasets API and the HuggingFace Datasets API.

More details are available at: trec-tot.github.io/guidelines

27.06.2025 14:46 โ€” ๐Ÿ‘ 11    ๐Ÿ” 7    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
An overview of the work โ€œResearch Borderlands: Analysing Writing Across Research Culturesโ€ by Shaily Bhatt, Tal August, and Maria Antoniak. The overview describes that We  survey and interview interdisciplinary researchers (ยง3) to develop a framework of writing norms that vary across research cultures (ยง4) and operationalise them using computational metrics (ยง5). We then use this evaluation suite for two large-scale quantitative analyses: (a) surfacing variations in writing across 11 communities (ยง6); (b) evaluating the cultural competence of LLMs when adapting writing from one community to another (ยง7).

An overview of the work โ€œResearch Borderlands: Analysing Writing Across Research Culturesโ€ by Shaily Bhatt, Tal August, and Maria Antoniak. The overview describes that We survey and interview interdisciplinary researchers (ยง3) to develop a framework of writing norms that vary across research cultures (ยง4) and operationalise them using computational metrics (ยง5). We then use this evaluation suite for two large-scale quantitative analyses: (a) surfacing variations in writing across 11 communities (ยง6); (b) evaluating the cultural competence of LLMs when adapting writing from one community to another (ยง7).

๐Ÿ–‹๏ธ Curious how writing differs across (research) cultures?
๐Ÿšฉ Tired of โ€œculturalโ€ evals that don't consult people?

We engaged with interdisciplinary researchers to identify & measure โœจcultural normsโœจin scientific writing, and show thatโ—LLMs flatten themโ—

๐Ÿ“œ arxiv.org/abs/2506.00784

[1/11]

09.06.2025 23:29 โ€” ๐Ÿ‘ 72    ๐Ÿ” 30    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 5
TREC 2025 Tip-of-the-Tongue (ToT) Track Tip of the tongue: The phenomenon of failing to retrieve something from memory, combined with partial recall and the feeling that retrieval is imminent.

Hello TREC-ToTers! ๐Ÿ‘‹๐Ÿฝ

Excited to announce the release of TREC 2025 Tip-of-the-Tongue (TREC-ToT) Track guidelines: trec-tot.github.io/guidelines. We will release test queries in July and run submission deadline will be in August. #TREC2025 #TRECToT #TREC2025ToT

Please register to participate:

09.05.2025 21:02 โ€” ๐Ÿ‘ 4    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1

Related paper here!

bsky.app/profile/841i...

29.04.2025 21:29 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Ever trusted a metric that works great on average, only for it to fail in your specific use case?

In our #NAACL2025 paper (w/ @841io.bsky.social), we show why global evaluations are not enough and why context matters more than you think.

๐Ÿ“„ aclanthology.org/2025.finding...
#NLP #Evaluation

(๐Ÿงต1/9)

29.04.2025 17:10 โ€” ๐Ÿ‘ 23    ๐Ÿ” 5    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2
Preview
Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation Modern language models frequently include retrieval components to improve their outputs, giving rise to a growing number of retrieval-augmented generation (RAG) systems. Yet, most existing work in RAG...

If you're interested in OpenAI including shopping results, you might also be interested in @teknology.bsky.social's paper relating retrieval diversity/fairness and generation by downstream RAG models. This has implications for individuals selling products online.
arxiv.org/abs/2409.11598

28.04.2025 19:34 โ€” ๐Ÿ‘ 9    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1

If you're working on a recall-oriented task or with ranking systems evaluated across varied users, content, or intents, check it out. 5/5

dl.acm.org/doi/10.1145/...

07.04.2025 16:15 โ€” ๐Ÿ‘ 1    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
A ven diagram showing that the recall and robustness, each of which has many different conceptions, interest when thinking about recall as "totality" and robustness as "worst-case performance".  It's in this intersection that lexicographic recall (lexirecall) lives.

A ven diagram showing that the recall and robustness, each of which has many different conceptions, interest when thinking about recall as "totality" and robustness as "worst-case performance". It's in this intersection that lexicographic recall (lexirecall) lives.

๐Ÿ“ข New Paper: "Recall, Robustness, and Lexicographic Evaluation" (ACM TORS)
F Diaz, M Ekstrand (@md.ekstrandom.net), B Mitra (@bmitra.bsky.social)

For IR, NLP, and ML researchers working on ranking systems evaluated for recall and robustness. ๐Ÿงต 1/5 dl.acm.org/doi/10.1145/...

07.04.2025 16:15 โ€” ๐Ÿ‘ 14    ๐Ÿ” 6    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Here's an overview of TREC 2024 TOT track runs with the test queries:
trec.nist.gov/pubs/trec33/...

07.03.2025 16:29 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Yes! Thats exactly the case of TOT retrieval for academics :)

05.03.2025 22:08 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Overview Tip of the tongue: The phenomenon of failing to retrieve something from memory, combined with partial recall and the feeling that retrieval is imminent.

These approaches powered the TREC 2024 TOT track test queries and will continue into the 2025 track (trec-tot.github.io).
Joyful collaboration with Yifan He @841io.bsky.social Jaime Arguello, and @bmitra.bsky.social !

#SIGIR #TREC #TOT

05.03.2025 01:37 โ€” ๐Ÿ‘ 4    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
GitHub - kimdanny/llm-tot-query-elicitation Contribute to kimdanny/llm-tot-query-elicitation development by creating an account on GitHub.

๐Ÿ“‚ Our Code & Data

๐Ÿ”—LLM-Elicitation: github.com/kimdanny/llm...
๐Ÿ”—Human query collection interface with visual stimuli set: github.com/kimdanny/hum...

05.03.2025 01:36 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

โšก๏ธMulti-Domain Coverage
Combining both methods allows TOT query evaluation in multiple domains. We tested simulated evaluation in Movie, Landmark, and Person domains. Moreover, we build a broader, more inclusive TOT test collection.

05.03.2025 01:36 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Human TOT query elicitation interface

Human TOT query elicitation interface

Solution2๏ธโƒฃ: Human-Elicitation
We designed an interface with visual prompts to induce a TOT state in human participants. Their queries closely match authentic TOT queries and captures genuine TOT experiences in a controlled setting.

05.03.2025 01:35 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
System rank correlation as a validation method for synthetic TOT queries.

System rank correlation as a validation method for synthetic TOT queries.

Solution1๏ธโƒฃ: LLM-Elicitation
We built a TOT user simulator to produce synthetic queries. Results show high system rank correlation and linguistic similarity compared to real queries. This scalable simulated evaluation method overcomes data scarcity by simulating new queries on demand.

05.03.2025 01:35 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿค”Why the Problem?
TOT query data collection relies heavily on community question answering websites (e.g., Reddit). This causes data availability issues and domain bias (most TOT queries end up being about movies or books).

05.03.2025 01:33 โ€” ๐Ÿ‘ 4    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿ‘…Tip-of-the-Tongue (TOT) search is a complex form of known-item search, shaped by the expression of partial recall, personal context, and uncertain memories. However, TOT research has long been hindered by the scarcity of high-quality TOT queries.

05.03.2025 01:33 โ€” ๐Ÿ‘ 5    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Tip of the Tongue Query Elicitation for Simulated Evaluation Tip-of-the-tongue (TOT) search occurs when a user struggles to recall a specific identifier, such as a document title. While common, existing search systems often fail to effectively support TOT scena...

๐ŸšจNew Breakthrough in Tip-of-the-Tongue (TOT) Retrieval Research!

We address data limitations and offer a fresh evaluation method for these complex queries.

Curious how TREC TOT track test queries are created? Check out this thread ๐Ÿงต and our paper ๐Ÿ“„: arxiv.org/abs/2502.17776

05.03.2025 01:32 โ€” ๐Ÿ‘ 17    ๐Ÿ” 7    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1

@teknology is following 20 prominent accounts