hal's Avatar

hal

@harold.bsky.social

part-time poster | researching privacy in/and/of public data @ cornell tech and wikimedia | writing for joinreboot.org

257 Followers  |  176 Following  |  56 Posts  |  Joined: 11.04.2023
Posts Following

Posts by hal (@harold.bsky.social)

Post image

Cool new finding by @cj-robinson.bsky.social: Grok is now submitting the majority of edit requests to Grokipedia.

(As @harold.bsky.social and I found a few months ago, also quoted Grok chats with users on X more than 1,000 times. This is all totally normal.)

www.cjr.org/tow_center/g...

12.02.2026 18:48 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Sure. AI companies have ALWAYS been training their models on Wikipedia content, which under the free and open access model is available to anyone β€” including AI companies. Agreements like these require AI companies to limit and offset the strain they place on Wikimedia infrastructure.

15.01.2026 18:47 β€” πŸ‘ 4771    πŸ” 1397    πŸ’¬ 40    πŸ“Œ 140
Preview
Elon Musk’s Grokipedia cites neo-Nazi website 42 times: study An analysis by researchers at Cornell University is the first comprehensive look at Grokipedia since Musk launched his project last month.

@davidingram.bsky.social covered @mantzarlis.com and my work on Grokipedia citations for NBC! much more to analyze here

www.nbcnews.com/news/amp/rcn...

20.11.2025 13:26 β€” πŸ‘ 8    πŸ” 5    πŸ’¬ 0    πŸ“Œ 1
the abstract of our paper

the abstract of our paper

we also open source all of our code, data, and embeddings!
paper: arxiv.org/abs/2511.09685
github: github.com/htried/wiki-...
huggingface: huggingface.co/datasets/htr...

17.11.2025 16:10 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
comparison of the top 100 most-cited sources on wikipedia and grokipedia

comparison of the top 100 most-cited sources on wikipedia and grokipedia

comparison of article snippets from the "controversial articles" subset with high and low similarity

comparison of article snippets from the "controversial articles" subset with high and low similarity

comparison of article snippets from the "elected officials" subset with high and low similarity

comparison of article snippets from the "elected officials" subset with high and low similarity

this is just the tip of the iceberg, and the paper contains much, much more: analyses of the top 100 domains, article subsets of elected officials and controversial topics, etc etc etc

please give it a read and let me know what you think!

17.11.2025 16:10 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
grok conversation trying to "dig up some dirt on Guy Verhofstadt"

grok conversation trying to "dig up some dirt on Guy Verhofstadt"

Grok conversation about covid conspiracy theories

Grok conversation about covid conspiracy theories

Grok conversation where the user asks "what race do you hate" and "benefits of a racist society"

Grok conversation where the user asks "what race do you hate" and "benefits of a racist society"

Grok conversation about "what ethnicity runs global banking"

Grok conversation about "what ethnicity runs global banking"

we also found troubling instances of β€œauto-citogenesis,” or cases where:
- an X user asks the Grok chatbot something, then publishes the answer
- Grokipedia *cites that answer* without noting that it is a chatbot output
(the attached images are real examples of this)

17.11.2025 16:10 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
similarity between grokipedia and wikipedia articles by topic for 30k randomly selected articles

similarity between grokipedia and wikipedia articles by topic for 30k randomly selected articles

similarity between grokipedia and wikipedia articles by article quality class for 30k randomly selected articles

similarity between grokipedia and wikipedia articles by article quality class for 30k randomly selected articles

- but a random sample of articles shows which topics have been heavily rewritten (history, politics, philosophy, biography) and which haven’t (STEM, sports, movies)
- grokipedia also targeted the wiki articles deemed highest quality for rewrites: the "featured article" and "good article" classes

17.11.2025 16:10 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
graphs showing average article similarity for cc-licensed and non-cc-licensed grokipedia articles to their counterparts of wikipedia, as well as position-based chunk similarity

graphs showing average article similarity for cc-licensed and non-cc-licensed grokipedia articles to their counterparts of wikipedia, as well as position-based chunk similarity

- the primary distinction to make is whether grokipedia pages are cc-licensed or notβ€”non-cc-licensed pages are presumably largely rewritten by grok
- many grokipedia pages (including those without cc licenses) are basically identical to their wiki counterparts, especially short ones

17.11.2025 16:10 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
graphs showing the proportion of sources of various qualities and the percentage of pages that cite reliable, unreliable, blacklisted, etc. sources

graphs showing the proportion of sources of various qualities and the percentage of pages that cite reliable, unreliable, blacklisted, etc. sources

our paper tries to answer these questions

we find
- grokipedia pages are longer than wiki counterparts, and cite 2x more sources
- but citation standards are more lax than wiki: grok cites stormfront, infowars and many more
- non-CC licensed grokipedia pages increase blacklisted source cites 13x(!)

17.11.2025 16:10 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
abstract of the paper "What did Elon change? A comprehensive analysis of Grokipedia"

Elon Musk released Grokipedia on 27 October 2025 to provide an alternative to Wikipedia, the crowdsourced online encyclopedia. In this paper, we provide the first comprehensive analysis of Grokipedia and compare it to a dump of Wikipedia, with a focus on article similarity and citation practices. Although Grokipedia articles are much longer than their corresponding English Wikipedia articles, we find that much of Grokipedia's content (including both articles with and without Creative Commons licenses) is highly derivative of Wikipedia. Nevertheless, citation practices between the sites differ greatly, with Grokipedia citing many more sources deemed "generally unreliable" or "blacklisted" by the English Wikipedia community and low quality by external scholars, including dozens of citations to sites like Stormfront and Infowars. We then analyze article subsets: one about elected officials, one about controversial topics, and one random subset for which we derive article quality and topic. We find that the elected official and controversial article subsets showed less similarity between their Wikipedia version and Grokipedia version than other pages. The random subset illustrates that Grokipedia focused rewriting the highest quality articles on Wikipedia, with a bias towards biographies, politics, society, and history. Finally, we publicly release our nearly-full scrape of Grokipedia, as well as embeddings of the entire Grokipedia corpus.

abstract of the paper "What did Elon change? A comprehensive analysis of Grokipedia" Elon Musk released Grokipedia on 27 October 2025 to provide an alternative to Wikipedia, the crowdsourced online encyclopedia. In this paper, we provide the first comprehensive analysis of Grokipedia and compare it to a dump of Wikipedia, with a focus on article similarity and citation practices. Although Grokipedia articles are much longer than their corresponding English Wikipedia articles, we find that much of Grokipedia's content (including both articles with and without Creative Commons licenses) is highly derivative of Wikipedia. Nevertheless, citation practices between the sites differ greatly, with Grokipedia citing many more sources deemed "generally unreliable" or "blacklisted" by the English Wikipedia community and low quality by external scholars, including dozens of citations to sites like Stormfront and Infowars. We then analyze article subsets: one about elected officials, one about controversial topics, and one random subset for which we derive article quality and topic. We find that the elected official and controversial article subsets showed less similarity between their Wikipedia version and Grokipedia version than other pages. The random subset illustrates that Grokipedia focused rewriting the highest quality articles on Wikipedia, with a bias towards biographies, politics, society, and history. Finally, we publicly release our nearly-full scrape of Grokipedia, as well as embeddings of the entire Grokipedia corpus.

back again to share a new preprint from me and @mantzarlis.com! β€œWhat did Elon Change? A comprehensive analysis of Grokipedia” arxiv.org/abs/2511.09685

I had seen many spot analyses of individual grokipedia pages, but I was curious: how was grokipedia made? what did Elon change from wikipedia?

17.11.2025 16:10 β€” πŸ‘ 12    πŸ” 9    πŸ’¬ 1    πŸ“Œ 2
Preview
Grokipedia cites a Nazi forum and fringe conspiracy websites A site-wide comparison with Wikipedia sheds light on what Elon Musk is trying to do

NEW on @indicator.media: A first *full-scale* comparison of Grokipedia v Wikipedia.

Last week the awesome @harold.bsky.social rocked up to my desk bearing gifts.

Hal had collected almost all 900K Grokipedia entries and compared them to their Wikipedia equivalents for text and citation similarity.

13.11.2025 13:12 β€” πŸ‘ 167    πŸ” 100    πŸ’¬ 2    πŸ“Œ 15
Video thumbnail

"I'm happy to report I'm just fine. I lost a button. But I'm gonna sleep in my bed tonight, safe, with my family... At that elevator, I was separated from someone named Edgardo... Edgardo is in ICE detention and he's not going to sleep in his bed tonight."

17.06.2025 20:38 β€” πŸ‘ 26425    πŸ” 7653    πŸ’¬ 274    πŸ“Œ 321

@cameron.pfiffer.org planning to work on it soon!

16.05.2025 02:28 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

hi @alt.psingletary.com! you tagged the right personβ€”I was working on this for a class project this semester

got it to a mvp stage about a week ago and hit pause to work on some other projects, but will keep working on it and would definitely would love to hear your feedback if you have any :)

16.05.2025 00:19 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

line go upπŸ“ˆπŸ“ˆπŸ“ˆ

up to 717k requests to wikipedia per second!!

grafana.wikimedia.org/d/O_OXJyTVk/...

08.05.2025 17:27 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

and please remember to thank your local site reliability engineer!!!!

08.05.2025 17:08 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
a graph of Wikimedia requests per second, with a huge spike right when the papal selection was announced

a graph of Wikimedia requests per second, with a huge spike right when the papal selection was announced

continuing on the real-time public Wikipedia data train:

here's a graph of requests / second to WMF infra over the last 3h, since "Habemus papam"

The infrastructure has gone from 172k req / sec to 243k req / sec (⬆️41%) in under an hour!

follow along here: grafana.wikimedia.org/d/O_OXJyTVk/...

08.05.2025 17:07 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1

bsky.app/profile/haro...

07.05.2025 20:47 β€” πŸ‘ 10    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

english wikipedia pageviews for the conclave movie starting from oct 20 2024 (five days before release in the US)

first big spike is the academy awards, second is pope francis’ death

pageviews.wmcloud.org?project=en.w...

07.05.2025 20:45 β€” πŸ‘ 6    πŸ” 2    πŸ’¬ 1    πŸ“Œ 1
Post image

excited to share this new piece by @bkeremg.bsky.social and @m0na.net (edited by me) about conceptualizing AI alignment as a process of censorship

really fascinating line of critique β€”Β I strongly encourage you to read it and lmk what you think!

joinreboot.org/p/ai-alignme...

06.04.2025 21:14 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

and set your devices to update automatically!

04.04.2025 14:29 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Multi-Agent Systems Execute Arbitrary Malicious Code Multi-agent systems coordinate LLM-based agents to perform tasks on users' behalf. In real-world applications, multi-agent systems will inevitably interact with untrusted inputs, such as malicious Web...

There's a quickly-developing line of work on how insecure these agent systems can be, particularly when they have access to write and execute code.

The attacks on them are simple + devastating, up to and including reverse shells, data exfiltration, and more!

arxiv.org/abs/2503.12188

24.03.2025 17:52 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Attorney General Bonta Urgently Issues Consumer Alert for 23andMe Customers Californians have the right to direct the company to delete their genetic data OAKLAND β€” California Attorney General Rob Bonta today issued a consumer alert to customers of 23andMe, a genetic testing ...

Reminder that a key part of the privacy harms from genetic information is the fact that *we all share genes with our relatives*!

Deleting or refusing to share genetic information protects your family in addition to yourself

oag.ca.gov/news/press-r...

24.03.2025 15:48 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
A screenshot of the title / abstract of the paper.

A screenshot of the title / abstract of the paper.

Anyhow, there’s a lot more in the paper. Please read it if you’re interested and let us know if you have any thoughts, questions, concerns, etc!

arxiv.org/abs/2503.12188

12/12

18.03.2025 15:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Same-origin policy - Wikipedia

Modern Web browsers isolate untrusted content using the same-origin policy. AI agents today do not distinguish safe from unsafe content, nor data from (potentially malicious) instructions.

developer.mozilla.org/en-US/docs/W...

en.wikipedia.org/wiki/Same-or...

11/12

18.03.2025 15:23 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
A screenshot of the Netscape browser circa the 90s.

A screenshot of the Netscape browser circa the 90s.

The narrative around AI safety shouldn’t be β€œTerminator” or β€œAI Chernobyl.” The right analogy is Netscape Navigator 1.0β€”the era when Web browsers first became a thing, and it was unclear how to protect users from potentially harmful Web content.

10/12

18.03.2025 15:23 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Much of the AI safety world is obsessing about β€œAGI.” They research containment, alignment, and jailbreaking, and view users as potential adversaries.

But users aren’t the enemy. They are victims whose data and devices are put at risk by companies pushing insecure systems.

9/12

18.03.2025 15:23 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Confused deputy problem - Wikipedia

At the root, these are β€œconfused deputy” vulnerabilities: agents blindly trust other agents, enabling adversaries to launder their instructions by making them appear as trusted outputs of trusted agents.

en.wikipedia.org/wiki/Confuse...

8/12

18.03.2025 15:23 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

In our experiments, we saw cases where a MAS …
… executes code that they recognize as harmful
… automatically pivots to harmful tasks that are simply in the same directory as benign tasks
… is vulnerable to screenshots and even audio files where we read out the attack (see example below⬇️⬇️⬇️)

7/12

18.03.2025 15:23 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0