Aaron Tay's Avatar

Aaron Tay

@aarontay.bsky.social

I'm librarian + blogger from Singapore Management University. Social media, bibliometrics, analytics, academic discovery tech.

3,251 Followers  |  330 Following  |  2,353 Posts  |  Joined: 05.07.2023  |  1.519

Latest posts by aarontay.bsky.social on Bluesky

Really love the Japanese show "The Last Three Times We'll Meet"...

07.12.2025 14:27 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

People who do that end up doing the "ai search" or at least semantic search route

06.12.2025 14:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

If you worry about huge GPT style LLMs destroying the environment then you might be okay with encoder embeddings cos they usually much smaller than decoder style GPTs but any use of GPT type models in search is a no from you (6)

06.12.2025 12:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Im not saying there's a right answer here, you have to decide for yourself what exactly you are objecting to. Eg if you worry "gen ai" is making students lazy you might be okay with any AI as long all they do is ranking of results (5)

06.12.2025 12:48 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Is it "ai" if like openalex it is purely lexical search but some of the metadata is extracted or assigned using "AI" (eg assignment of topics" (4)

06.12.2025 12:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

But maybe you go, embeddings from encoder models are ok. But what if i told you the systems uses GPT style LLMs to judge relevancy & generate ranking. Is that where you draw the line? (3)

06.12.2025 12:38 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

If a search engine is not doing lexical search but "semantic search" (dense embedding matching) to create a ranking list of hits. Is that "ai"? Many would say ok even though embeddings are transformer based & trained using "fill in blank" cloze test, cousin of GPT style autocomplete training
(2)

06.12.2025 12:33 β€” πŸ‘ 4    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

Watching knee jerk reactions to Scholar labs makes me again wonder what exactly is a "AI powered search engine"? Or how much "AI" can you tolerate before you say im out (1)

06.12.2025 12:27 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Google Scholar is going to be hard to beat as a discovery tool, due to its special access to full text for matching. I love OpenAlex but their main focus seems to be as a open source of bibliometrics data not a discovery tool. For that to happen others will need to build on it. But...

06.12.2025 12:23 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I wouldn't be so sure OpenAlex, Lens and other "open alternatives" are much better for the humanities. The last I looked at law it was worse than Google Scholar. Scholarly infrastructure is very journal & STEM centric, parsers fail on foot notes etc

06.12.2025 12:18 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

People will talk about trying Openalex, Lens.org etc. Unfortunately it's all the same (probably worse than GS), we live in a very journal centric academic infrastructure Unfortunately

06.12.2025 12:16 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I've seen many libguides classify openalex and lens.org as "AI search engines" . Shrugs. Depends on what you define as "AI"

06.12.2025 12:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

"AI when used effectively doesn't work like that. It is excellent at asking not only questions we want answered, but the ones we didn't even know we want answered" ... Really insightful (2)

06.12.2025 07:18 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Our assumption about the claim forms our idea of the type of evidence we are looking for, the retrieval of which often sinks us further into our assumptions...(1)

06.12.2025 07:17 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Advanced search Advanced search filters are a powerful hidden feature in the Find Papers workflow . You can use them to search within a particular journal, restrict ...

Do note that some like Elicit don't have this prob because they use learned sparse retrieval methods (think keyword + semantic expansion) that work like keyword search and are stored in inverted index Thats why they allow usual filters support.elicit.com/en/articles/... elicit.com/blog/semanti...

06.12.2025 06:29 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Eg Searching for "diamond rings" with price < $50. 3 million items pass the filter but brute force over 3M vectors is extremely slow
But HNSW lands you in expensive jewelry territory, far from any cheap items then you must traverse vast portions of the graph to reach the cheap-items region = hard

06.12.2025 04:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

To add on if your low correlation query + filter search results in objectively low number of canadiates (high selectivity) say less than 1,000 you might as well skip HNSW & just brute force the dot products. Problem is when even with low correlation queries you still get say a million canadiates(9)

06.12.2025 04:19 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

A lot of fascinating things like "low correlation between filter & query makes things harder. Eg searching for diamond rings but filter is for low price is much tricker than when fitter is for high price because they are rare see weaviate.io/blog/speed-u... and www.elastic.co/search-labs/... (8)

06.12.2025 04:13 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

The solution people are using seems to be ACORN. Roughly speaking when a node say B is filtered out the search will look at all neighbours of B and dynamically calculate closeness. This allows it to create a possible path A to C. You can do clever things like when you want trigger such "jumping" (7)

06.12.2025 03:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

You can create specific HNSW graphs for each filter but then you would need maintain x HNSW graphs for x indepedent filter conditions ! (6)

06.12.2025 03:31 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Or you could just filter out papers in the HNSW graph that dont fit the filter (pub year>2024) then do HNSW but then you would have broken connections which can prevent you from finding the right answer. Eg A -> B -> C in closeness but B doesn't meet the filter so A can't " see C" (5)

06.12.2025 03:24 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

You could just do normal HNSW and check when it's done which ones fits pub year>2024 but many won't fit so if you want top 10, how many should you get from the HNSW process? (4)

06.12.2025 03:21 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

But now imagine you have a filter condition say pub year > 2024. How. Can you do both HNSW (which only connects by embedding closesness) and satisfy filter condition? Say you want top 10 closest. (3)

06.12.2025 03:00 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Hierarchical Navigable Small Worlds (HNSW) creates a multi layer connected graph with higher layers being more coarse and lower layers more granular and you traverse the nodes getting closer and closer to "closest doc". (3)

share.google/d1qtwozxASfz...

06.12.2025 02:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Such techniques require converting queries into embeddings and finding document embeddings that are semantically closest (typically using dot product). But how do you do that quickly? A common technique is HNSW Hierarchical Navigable Small World (2)

06.12.2025 02:53 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

As academic search starts to do semantic search via dense embeddings you may notice many seem to lack prefilters or have a limited set compared to their lexical search versions. This is not a UI issue but a technical issue that is slowly being solved (1)

06.12.2025 02:49 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Yeah, definitions are always messy, but I doubt anyone who knows even the minimal amount of Info retrieval would call BM25 "Semantic". I usually don't like to get into definitional debates but sometimes misunderstanding can lead to wrong behavior.

04.12.2025 13:16 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I guess a lot of this is harmless. But sometimes can have consequences. Eg if you think semantic scholar has semantic search and type in natural language (is there a open access citation advabtage) you get way less hits than you should! Others are not too bad cos they drop stop words

04.12.2025 13:05 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

This type of non-boolean lexical search (most popular is bm25) is actually common in web search & most contexts outside academia search. Academia search is boolean + ranking with tf-idf/bm25. Eg Google up to maybe 10 years ago was mostly lexical. Scite ranking i think is bm25

04.12.2025 12:59 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Another misconception is all lexical search = boolean. So if we see a search doesn't return all the query terms (after stemming), we call it "semantic" or non-lexcial. Actually, they prob just doing bm25/tf-idf retrieval scoring which gives a score based on token matches but dont need all matches

04.12.2025 12:56 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@aarontay is following 20 prominent accounts