If you're attending #iPres2025, make sure to check out @petervwyatt.bsky.social 's tutorial on Monday: "A forensic spotlight on PDF/A"!
twelve.eventsair.com/QuickEventWe...
@tallison314159.bsky.social
Files, search, crawling, security. #ApacheTika among others...
If you're attending #iPres2025, make sure to check out @petervwyatt.bsky.social 's tutorial on Monday: "A forensic spotlight on PDF/A"!
twelve.eventsair.com/QuickEventWe...
Is AI fueling the old 'Dead Internet' conspiracy theory?
Yes! AI is building a fake internet just for you.
#ai #psychology #cybersecurity #society #internet
www.toxsec.com/p/ai-is-buil...
In belated celebration of World Digital Preservation Day, I'm throwing a "What's new with Apache Tika/Office hours" meetup: November 13, noon EST.
Everyone interested in files is welcome to join!
#ApacheTika #wdpd2025 #digipres #fileForensics #reverseEngineering
www.meetup.com/apache-tika-...
We're officially announcing our speakers DistrictCon Year 1! Check out our incredible lineup: www.districtcon.org/speakers
This also includes our Day 1 & Day 2 Keynotes from Ian Levy and Dan Ridge.
And don't forget, GA tickets go on sale November 16! See you in January! πͺ©
It's your responsibility - but how do you even get started fixing search? A blog for Search Product Managers and other search leads thesearchjuggler.com/its-your-res...
22.10.2025 10:12 β π 3 π 1 π¬ 0 π 0...even if the PDF is embedded in an email that was then added to a zip file.
22.10.2025 10:28 β π 0 π 0 π¬ 0 π 0We treat PDF's incremental updates as a special type of attachment. This means that with just `java -jar tika-app.jar -Z input-file.zip output_dir`
You'll be able to recover the earlier versions of a PDF if saved with incremental updates...
So, the news for #ApacheTika and #ipres2025: I implemented fully recursive extraction of raw embedded files from the commandline.
issues.apache.org/jira/browse/...
goddamn is there anything Wikipedia editors canβt do www.nytimes.com/2025/10/17/n...
18.10.2025 04:53 β π 2503 π 482 π¬ 4 π 29Everyone tests in production. Some people just donβt know it yet
17.10.2025 14:13 β π 66 π 15 π¬ 3 π 1Looking forward to some baking with #ApacheTika! News soon on some #conferenceDrivenDevelopment.
#ipres2025
Amazing work, as always, @seeinglogic.bsky.social ! #AIxCC
12.10.2025 11:47 β π 2 π 0 π¬ 0 π 0And, y, I'm late to the game, but I'm really excited for this course, @softwaredoug.bsky.social !
08.10.2025 14:43 β π 1 π 0 π¬ 0 π 0F3: The Open-Source Data File Format for the Future
Packaging WASM code to read an evolving file format with the data. Interesting approach and a good idea to test the sandbox abilities of the execution engine. Also mentions of a lot of alternatives to parquet/ORC.
We have PDF. What else do we need? π€£π€£π€£π
08.10.2025 13:24 β π 1 π 0 π¬ 0 π 0A biological 0-day? Threat-screening tools may miss AI-designed proteins. arstechnica.com/science/2025...
04.10.2025 12:13 β π 8 π 6 π¬ 0 π 0The new bugfix release 2.0.35 of #Apache #PDFBox is available pdfbox.apache.org/download.html
02.10.2025 20:44 β π 2 π 1 π¬ 0 π 0Anyone in #bugbounty looking to connect?
02.10.2025 19:44 β π 2 π 1 π¬ 0 π 0New 7-8B OCR model release from AliBaba. Integrated structures data approach looks promising for specialized use cases with complex visual inputs. huggingface.co/Logics-MLLM/...
29.09.2025 12:01 β π 29 π 4 π¬ 2 π 0Tomorrow I'll be talking about vector retrieval, continuing Cheat at Search Essentials. Full details on my blog article
softwaredoug.com/blog/2025/07...
π£This #WebArchiveWednesday, plan your proposal for #iipcWAC26, βSustainable #WebArchiving,β at KBR, Royal Library of Belgium! netpreserve.org/ga2026/CfP
ποΈ Deadline for proposals: OCT 15
#webarchives #DigitalPreservation #DigitalHumanities
Recording for BM25 + Lexical Search now up
maven.com/p/e9fbe4/che...
This Wednesday I'll be discussing how to Cheat at Query Understanding using LLMs with Jason Liu. If you want a taste of "Cheat at Search with LLMs", please come hang out!
maven.com/p/eebe98
The annual award ceremony features miniature operas, scientific demos, and 24/7 lectures. www.wired.com/story/say-he...
20.09.2025 10:06 β π 59 π 11 π¬ 1 π 1The image shows two LibreOffice documents. A normal document to the left, and a malicious document to the right. The malicious document contains an additional, malicious word/document.xml file. The signature is checked against the original word/document.xml file, while the malicious word/document.xml file is displayed to the user.
Great paper on finding and exploiting parser differentials between ZIP parsers to bypass signature validation, malware detection, or VSCode extension ID validation.
www.usenix.org/conference/u...
Lucene 10.3 is out with 40% faster lexical search, 15% faster dense vector search and 30% faster terms dictionary lookups. lucene.apache.org/core/corenew...
14.09.2025 07:26 β π 3 π 3 π¬ 1 π 0π¨ Breaking News from Community Over Code π¨
Introducing The ASFβs New Logo buff.ly/DzgT82w
#CommunityOverCode #opensource
Bluesky- "If you can't cite peer reviewed literature, your opinion is morally equivalent to fart noises <links to papers>"
Anyhow, I just want to show of my LLM side projects, there really isn't a forum for that anymore.
People on Twitter - "LLMs are gods and I command them so I am a god and people will finally give me the respect I crave"
People on Mastodon - "<frothing> slop <pant shitting> LLMs :( <howler monkey sounds> stochastic parrot <growling noises> by the way, Github is the root of all social evils"
W00t!
06.09.2025 11:44 β π 1 π 0 π¬ 0 π 0