Cmon man
01.03.2026 02:33 — 👍 10 🔁 0 💬 1 📌 0Cmon man
01.03.2026 02:33 — 👍 10 🔁 0 💬 1 📌 0
This is a pretty important statement about engineering and experimentation speed.
openai.com/index/harnes...
Mossad or not-Mossad, but for model evals needing to be difficult, but not so difficult that they are written off as not a useful measurement.
13.02.2026 14:04 — 👍 1 🔁 0 💬 0 📌 0Great idea
10.02.2026 18:30 — 👍 1 🔁 0 💬 0 📌 0Everybody has a hard eval until gradient descent punches you in the face.
29.01.2026 23:22 — 👍 1 🔁 0 💬 0 📌 0
Accountability diffuses at the deployment layer, but dependency concentrates at the model supply layer.
The dominant risk is not what the models can do, but how fast capability diffuses, how it gets wired, and whether misuse feedback loops are actioned post release.
ok takeaways:
This is a huge unmanaged attack surface, 49% tool exposure and a bunch of residential hosts is a problem waiting ot happen.
Prioritizing a release to go far in this ecosystem? Go with 8-14B at 4bit quant.
22% of hosts have custom system prompts - we pulled and classified over 3k prompts the breakdown for the top 4 were:
1. Default Identity
2. Coding Assistants
3. Roleplay
4. Uncensored
Portable weights travel far in this network.
Probably not a huge surprise, but in this dataset 8-14B parameters is the most prevalent model size and 72% of models are 4 bit quantized.
49% of hosts enable tools.
The top 10 Model Families control 85% of the market. Other families in the long tail.
29.01.2026 20:03 — 👍 0 🔁 0 💬 1 📌 0This is an exposure dataset which means we are trying to study something by measuring the shadow that it casts. We can’t poll these systems directly, but we can understand the shape of the ecosystem.
29.01.2026 20:03 — 👍 0 🔁 0 💬 1 📌 0
New research from @silascutler.bsky.social and myself.
We tracked 175k exposed Ollama endpoints for nearly a year. Collected and analyzed custom models, sizes, quantizations, system prompts, and more.
*vague posts about upcoming research*
29.01.2026 02:28 — 👍 0 🔁 0 💬 0 📌 0Love getting malware under TLP:AMBER+S, when the S stands for “spite”. 🫖
28.01.2026 16:13 — 👍 0 🔁 0 💬 0 📌 0We about to have some Llama Drama :)
27.01.2026 20:05 — 👍 1 🔁 1 💬 1 📌 0
and of course it’s chatgpt slop with the rhetorical flourish of a remedial high school debate club.
“from X to Y — or worse”
“This Isn’t X it’s Y.”
“Replace X with Y and it’s Z.”
“The most sobering part? It’s X.”
“your no longer dealing with X. You’re facing Y”
“Wow this dude has a really strong opinion about code review”
*scans posts*
“Oh that’s his only opinion”
—dangerously-skip-permissions is the only thing keeping claude code installed on my machine.
23.01.2026 04:22 — 👍 0 🔁 0 💬 0 📌 0As a friend said awhile back “we are fine-tuning the models and they are coarse-tuning us in turn”
22.01.2026 22:01 — 👍 3 🔁 0 💬 0 📌 0
A deeper problem is that nobody has time for anything but LLM-as-a-judge evaluations (often vendor-on-vendor), creating these Ouroboros loops that are easy to overfit and hard to trust.
Thats a huge gap when we’re being asked to rely on them for SOC automations or enterprise security work.
CyberSOCEval (meta) found models can extract real signal from malware logs & CTI reports, but remain far from reliable.
Most importantly in this domain, reasoning models do not get their usual math/coding uplift suggesting that general capability ≠ analyst capability... yet.
The best “agentic” benchmark we saw (ExCyTIn-Bench) still shows how far we are. Even in a curated Azure-style environment models struggled with multi-hop investigations over heterogeneous logs (data be confusing like that).
20.01.2026 16:22 — 👍 0 🔁 0 💬 1 📌 0Most security evals reduce workflows into MCQs/static Q&A. That bakes in unrealistic assumptions that the “right question” is already asked, evidence is pre-packaged, wrong answers are cheap, and there’s no triage/queue pressure or escalation decisions.
20.01.2026 16:22 — 👍 0 🔁 0 💬 1 📌 0
Benchmarks for cybersecurity are everywhere and mostly measuring the wrong thing.
We reviewed evals from Microsoft, Meta and academia and found they don't measure what matters for defenders in real IR situations. 🧵
s1.ai/benchmk1
Reviewing AI cyber benchmarking and evaluations may break me.
Ya’ll will really LLM as a judge anything
Timely presentation from my colleague Jim on the current landscape of Hactivism and War.
youtu.be/sNaORI-k-fY?...
✅ #LLM literacy is table stakes for defenders, CTI analysts, and #cybersecurity professionals of all stripes now.
Still looking for a way into this complex field? 🤔
LABS has got you covered!
Start here:
s1.ai/inside-llm-1
@sentinelone.com
Great post from @philofishal.bsky.social on the initials stages of the LLM training pipeline!
www.sentinelone.com/labs/inside-...
"this new chemical process operates at ambient temperature and pressure. It chemically dissolves the glue holding the blade together.
The high-value carbon fiber can be recovered, cleaned, and reused in everything from new turbines to car parts."
interestingengineering.com/energy/china...