Replication of Government Datasets and the Principles of Provenance | Library Innovation Lab
As part of our Public Data Project, LIL recently launched Data.gov Archive Search. In this post, we consider the importance of provenance for large, replicat...
When might digital provenance matter? Could we imagine it being used to right past wrongs, to return objects to their rightful places, to restore justice?
Our Public Data Project's @mollyhardy.bsky.social reflects on copying gov data & principles of provenance
lil.law.harvard.edu/blog/2025/12...
15.12.2025 12:07 β
π 6
π 2
π¬ 0
π 2
If you'd like an informative and interesting conversation to enjoy try "Inside Harvardβs Data.gov Archive β A Conversation with Jack Cushman"
@jed.co interviewing Jack Cushman @harvardlil.bsky.social about Data.gov archiving on @source.coop
Now live www.youtube.com/watch?v=XYMb...
19.11.2025 19:20 β
π 8
π 6
π¬ 0
π 0
Our EOT2024 partner @harvardlil.bsky.social was interviewed by @jed.co on @source.coop about archiving government data.
Listen & learn how 300,000+ federal datasets are archived for posterity
Inside Harvardβs Data.gov Archive, A Conversation with Jack Cushman: www.youtube.com/watch?v=XYMb...
19.11.2025 19:28 β
π 4
π 2
π¬ 0
π 0
If you missed our conversation with Jack Cushman from @harvardlil.bsky.socialβ¬, you can catch up now. We discussed the Data.gov Archive and the challenge of preventing federal data loss β it's about more than just web pages.
youtube.com/live/XYMbQru...
20.11.2025 21:06 β
π 5
π 5
π¬ 1
π 0
Workshop Report: "Resilience in Times of Crisis - Strengthening Open Science Against Geopolitical Pressures" (via Research Group Information Management" at Humboldt-UniversitΓ€t zu Berlin) infomgnt.org/posts/2025-1... @datarescueproject.org @harvardlil.bsky.social #libraries #openscience
22.11.2025 18:26 β
π 3
π 1
π¬ 0
π 0
Upcoming Episode: Inside Harvard's data.gov Archive
How the Harvard Library Innovation Lab is preserving and making Data.gov datasets discoverable using BagIt and static search.
And join Jack live on @source.coop's Great Data Products podcast on 11/19 to hear how we're making cultural memory collections easier to access and harder to delete. greatdataproducts.com/housekeeping...
31.10.2025 13:43 β
π 4
π 3
π¬ 0
π 1
Century-Scale Storage
If you had to store something for 100 years, how would you do it?
This series is part of our work investigating not just the technical, but the human and societal routes to long-term digital preservation. You can also check out @maxy.bsky.social's Century-Scale Storage on our website.
lil.law.harvard.edu/century-scal...
15.10.2025 14:50 β
π 2
π 0
π¬ 0
π 0
Frank Cifaldi | Library Innovation Lab
InterviewerIf you were given unlimited funding to design a system for storing and preserving digital information for at least a century, what would you do?
One answer from Frank Cifaldi who is the Founder and Director of the Video Game History Foundation:
"I would have the best copyright lawyers in the country figuring out how we can actually make this work."
lil.law.harvard.edu/generational...
15.10.2025 14:50 β
π 2
π 0
π¬ 1
π 0
Amelia Acker | Library Innovation Lab
InterviewerIf you were given unlimited funding to design a system for storing and preserving digital information for at least a century, what would you do?
One answer from Amelia Acker who is an Associate Professor in the School of Communication & Information at Rutgers:
"I probably wouldn't build a system. I'd build a bureaucracy."
lil.law.harvard.edu/generational...
15.10.2025 14:50 β
π 2
π 0
π¬ 1
π 0
Generational Data Interviews | Library Innovation Lab
14 Designs for Digital Preservation in 2025
LIL fellow @maxy.bsky.social asked 14 scholars, archivists, designers, business leaders & engineers: "If you were given unlimited funding to design a system for storing and preserving digital information for at least a century, what would you do?"
Their answers:
lil.law.harvard.edu/generational...
15.10.2025 14:50 β
π 12
π 6
π¬ 1
π 3
Product and Research Manager
Join our team! LIL is looking for a Product and Research Manager to help create, shape, and execute on our portfolio of open knowledge projects. PRMs work across every piece of the LIL ecosystem, from software experimentation to convening of events. Learn more at careers.harvard.edu/job/product-...
03.07.2025 17:26 β
π 4
π 2
π¬ 0
π 1
Vote for the best of the internet
I just voted in The Webby People's Voice Awards and checked my voter registration.
Thrilled to share that @maxy.bsky.social's "Century-Scale Storage" was nominated for a Webby Award!
You can vote in the "Best Individual Editorial Feature" category here: vote.webbyawards.com/PublicVoting...
01.04.2025 20:59 β
π 12
π 4
π¬ 1
π 1
Bagging data.gov
Ed Summers at Stanford wrote this great deep dive of how and why we designed our data.gov archiver the way we did. Thanks for digging in, Ed, this is excellent. inkdroid.org/2025/02/17/n...
19.02.2025 19:16 β
π 18
π 6
π¬ 1
π 0
Announcing the Data.gov Archive | Library Innovation Lab
Today we released our archive of data.gov on Source Cooperative. The 16TB collection includes over 311,000 datasets harvested during 2024 and 2025, a complet...
We just launched a 16TB archive of every dataset that has been available on data.gov since November. This will be updated day by day as new datasets appear. It can be freely copied, and we're sharing the code behind it to help others make their own archives of data they depend on.
06.02.2025 21:23 β
π 1905
π 1003
π¬ 43
π 66
Data Rescue Efforts
Data / Website Rescue Efforts End of Term Crawl - The main coordinated effort to archive websites, but datasets have been more of a challenge. EDGI - They have been focused on environmental data. A ...
Penn is getting a lot of questions about Data Refuge. That effort no longer exists, but several efforts are currently active. I've created a doc from what I & others have suggested. I'll update as I hear more. Feel free to share or suggest: docs.google.com/document/d/1...
03.02.2025 16:13 β
π 78
π 45
π¬ 3
π 5
What are we all missing? Anything you can't get by clicking from link to link like EOT, or downloading datasets directly from data.gov. If there's things you care about preserving that fit that description, that's where to focus.
31.01.2025 21:11 β
π 11
π 3
π¬ 1
π 0
Data.gov Home - Data.gov
Our collection from data.gov is limited: if an entry points directly to the data, such as a csv, we have the data. If it points to an html landing page, we just have the landing page. This means many, many datasets are not included. What we have from data.gov adds up to 15 or 20TB.
31.01.2025 21:07 β
π 19
π 3
π¬ 1
π 1
Data.gov Home - Data.gov
Speaking of telling someone, hereβs our update: we have copies of all metadata from data.gov, and all of the dataset URLs it points to (shallow crawl); all federal Github repositories with issues, comments, etc.; and articles from PubMed.
31.01.2025 21:06 β
π 17
π 1
π¬ 1
π 0
Internet Archive: Digital Library of Free & Borrowable Texts, Movies, Music & Wayback Machine
Third, tell someone. Archive.org is one good place to store public data for discovery, and we at LIL will consider storing and signing data in some cases as well. Just posting data somewhere search engines can find is good too.
31.01.2025 21:05 β
π 14
π 3
π¬ 0
π 0
FOIA requests are another great way to scale up β check out @muckrock.com to get started.
31.01.2025 21:04 β
π 11
π 2
π¬ 0
π 0
Next, scale up. If youβre a programmer (or can team up with one), write a python script to download a full collection β say, everything from the data portal of a given government website. Run it yourself, and share it so we libraries can use it too.
31.01.2025 21:03 β
π 12
π 3
π¬ 0
π 0
If youβre a data scientist, good news β your work isn't just downloading data and publishing about it, but also keeping safe copies!
31.01.2025 21:03 β
π 10
π 0
π¬ 0
π 0
ArchiveWeb.page
To keep access to stuff you care about: first just make a copy. Use ArchiveWeb.page to click around and download all the parts of a website youβre interested in. We like the desktop version to avoid capturing login cookies or extensions, but the browser extension is good too.
31.01.2025 21:02 β
π 13
π 5
π¬ 0
π 0