Tim Allison's Avatar

Tim Allison

@tallison314159.bsky.social

Files, search, crawling, security. #ApacheTika among others...

81 Followers  |  136 Following  |  33 Posts  |  Joined: 06.02.2025  |  2.0613

Latest posts by tallison314159.bsky.social on Bluesky

iPRES 2025 - TUTORIAL 3: A forensic spotlight on PDF/A

If you're attending #iPres2025, make sure to check out @petervwyatt.bsky.social 's tutorial on Monday: "A forensic spotlight on PDF/A"!

twelve.eventsair.com/QuickEventWe...

30.10.2025 18:42 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
The Dead Internet - AI is Building a Fake Internet Just for You How Generative AI is Fueling the "Dead Internet Theory," Creating an Authenticity Crisis, and Why AI Detection Can't Save Us.

Is AI fueling the old 'Dead Internet' conspiracy theory?
Yes! AI is building a fake internet just for you.

#ai #psychology #cybersecurity #society #internet

www.toxsec.com/p/ai-is-buil...

28.10.2025 13:32 β€” πŸ‘ 6    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Preview
Apache Tika -- What's New/Office Hours, Thu, Nov 13, 2025, 12:00 PM | Meetup This will be an expansion of my presentation at the Digital Preservation Bake Off (Tools Demonstration) #iPres2025 and a late entry to celebrate World Digital Preservation

In belated celebration of World Digital Preservation Day, I'm throwing a "What's new with Apache Tika/Office hours" meetup: November 13, noon EST.

Everyone interested in files is welcome to join!

#ApacheTika #wdpd2025 #digipres #fileForensics #reverseEngineering

www.meetup.com/apache-tika-...

28.10.2025 13:40 β€” πŸ‘ 3    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
Post image Post image

We're officially announcing our speakers DistrictCon Year 1! Check out our incredible lineup: www.districtcon.org/speakers

This also includes our Day 1 & Day 2 Keynotes from Ian Levy and Dan Ridge.

And don't forget, GA tickets go on sale November 16! See you in January! πŸͺ©

27.10.2025 16:41 β€” πŸ‘ 11    πŸ” 15    πŸ’¬ 0    πŸ“Œ 3
Preview
It's your responsibility - but how do you even start fixing search? - Charlie Hull - The Search Juggler How to get started fixing search - looking for zero result searches, low click queries and how to prioritise

It's your responsibility - but how do you even get started fixing search? A blog for Search Product Managers and other search leads thesearchjuggler.com/its-your-res...

22.10.2025 10:12 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

...even if the PDF is embedded in an email that was then added to a zip file.

22.10.2025 10:28 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

We treat PDF's incremental updates as a special type of attachment. This means that with just `java -jar tika-app.jar -Z input-file.zip output_dir`

You'll be able to recover the earlier versions of a PDF if saved with incremental updates...

22.10.2025 10:28 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

So, the news for #ApacheTika and #ipres2025: I implemented fully recursive extraction of raw embedded files from the commandline.

issues.apache.org/jira/browse/...

22.10.2025 10:28 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Wikipedia Volunteers Avert Tragedy by Taking Down Gunman at Conference

goddamn is there anything Wikipedia editors can’t do www.nytimes.com/2025/10/17/n...

18.10.2025 04:53 β€” πŸ‘ 2503    πŸ” 482    πŸ’¬ 4    πŸ“Œ 29

Everyone tests in production. Some people just don’t know it yet

17.10.2025 14:13 β€” πŸ‘ 66    πŸ” 15    πŸ’¬ 3    πŸ“Œ 1

Looking forward to some baking with #ApacheTika! News soon on some #conferenceDrivenDevelopment.

#ipres2025

15.10.2025 10:35 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1

Amazing work, as always, @seeinglogic.bsky.social ! #AIxCC

12.10.2025 11:47 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

And, y, I'm late to the game, but I'm really excited for this course, @softwaredoug.bsky.social !

08.10.2025 14:43 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

F3: The Open-Source Data File Format for the Future

Packaging WASM code to read an evolving file format with the data. Interesting approach and a good idea to test the sandbox abilities of the execution engine. Also mentions of a lot of alternatives to parquet/ORC.

08.10.2025 12:56 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

We have PDF. What else do we need? πŸ€£πŸ€£πŸ€£πŸ˜…

08.10.2025 13:24 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
A biological 0-day? Threat-screening tools may miss AI-designed proteins. Ordering DNA for AI-designed toxins doesn’t always raise red flags.

A biological 0-day? Threat-screening tools may miss AI-designed proteins. arstechnica.com/science/2025...

04.10.2025 12:13 β€” πŸ‘ 8    πŸ” 6    πŸ’¬ 0    πŸ“Œ 0
Apache PDFBox | Download The Apache PDFBoxβ„’ library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract ...

The new bugfix release 2.0.35 of #Apache #PDFBox is available pdfbox.apache.org/download.html

02.10.2025 20:44 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Anyone in #bugbounty looking to connect?

02.10.2025 19:44 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Post image

New 7-8B OCR model release from AliBaba. Integrated structures data approach looks promising for specialized use cases with complex visual inputs. huggingface.co/Logics-MLLM/...

29.09.2025 12:01 β€” πŸ‘ 29    πŸ” 4    πŸ’¬ 2    πŸ“Œ 0
Preview
Free course: Cheat at Search Essentials A free introductory search course for anyone who wants better search without all the hard work

Tomorrow I'll be talking about vector retrieval, continuing Cheat at Search Essentials. Full details on my blog article

softwaredoug.com/blog/2025/07...

25.09.2025 14:56 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ“£This #WebArchiveWednesday, plan your proposal for #iipcWAC26, β€œSustainable #WebArchiving,” at KBR, Royal Library of Belgium! netpreserve.org/ga2026/CfP

πŸ—“οΈ Deadline for proposals: OCT 15

#webarchives #DigitalPreservation #DigitalHumanities

24.09.2025 18:48 β€” πŸ‘ 0    πŸ” 5    πŸ’¬ 0    πŸ“Œ 0
Preview
Cheat at Search Essentials: BM25 + Lexical It's often said with chat interfaces and RAG, search has become the hard problem. Search has a long history and means more than vector databases. Let's learn how BM25 and similar techniques compliment...

Recording for BM25 + Lexical Search now up

maven.com/p/e9fbe4/che...

22.09.2025 13:13 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1
Preview
Cheating at Query Understanding with LLMs LLMs transformed query understanding from months-long NLP projects into simple prompting tasks. Students learn practical skills for modern search, RAG, and e-commerce systems. This positions you for h...

This Wednesday I'll be discussing how to Cheat at Query Understanding using LLMs with Jason Liu. If you want a taste of "Cheat at Search with LLMs", please come hang out!

maven.com/p/eebe98

21.09.2025 15:05 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Say Hello to the 2025 Ig Nobel Prize Winners The annual award ceremony features miniature operas, scientific demos, and 24/7 lectures.

The annual award ceremony features miniature operas, scientific demos, and 24/7 lectures. www.wired.com/story/say-he...

20.09.2025 10:06 β€” πŸ‘ 59    πŸ” 11    πŸ’¬ 1    πŸ“Œ 1
The image shows two LibreOffice documents. A normal document to the left, and a malicious document to the right. The malicious document contains an additional, malicious word/document.xml file. The signature is checked against the original word/document.xml file, while the malicious word/document.xml file is displayed to the user.

The image shows two LibreOffice documents. A normal document to the left, and a malicious document to the right. The malicious document contains an additional, malicious word/document.xml file. The signature is checked against the original word/document.xml file, while the malicious word/document.xml file is displayed to the user.

Great paper on finding and exploiting parser differentials between ZIP parsers to bypass signature validation, malware detection, or VSCode extension ID validation.

www.usenix.org/conference/u...

15.09.2025 10:39 β€” πŸ‘ 15    πŸ” 4    πŸ’¬ 0    πŸ“Œ 0
Preview
Luceneβ„’ Core News Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for...

Lucene 10.3 is out with 40% faster lexical search, 15% faster dense vector search and 30% faster terms dictionary lookups. lucene.apache.org/core/corenew...

14.09.2025 07:26 β€” πŸ‘ 3    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0
Post image

🚨 Breaking News from Community Over Code 🚨

Introducing The ASF’s New Logo buff.ly/DzgT82w

#CommunityOverCode #opensource

11.09.2025 15:12 β€” πŸ‘ 26    πŸ” 18    πŸ’¬ 0    πŸ“Œ 2

Bluesky- "If you can't cite peer reviewed literature, your opinion is morally equivalent to fart noises <links to papers>"

Anyhow, I just want to show of my LLM side projects, there really isn't a forum for that anymore.

09.09.2025 12:03 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

People on Twitter - "LLMs are gods and I command them so I am a god and people will finally give me the respect I crave"

People on Mastodon - "<frothing> slop <pant shitting> LLMs :( <howler monkey sounds> stochastic parrot <growling noises> by the way, Github is the root of all social evils"

09.09.2025 12:03 β€” πŸ‘ 2    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

W00t!

06.09.2025 11:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@tallison314159 is following 20 prominent accounts