By this paper's argument, no machine can be bigger, than say, ants. It reads as one extended philosophical joke by an academic arm of The Onion.
24.10.2025 17:47 — 👍 1 🔁 0 💬 1 📌 0@andrescorrada.bsky.social
Scientist, Inventor, author of the NTQR Python package for AI safety through formal verification of unsupervised evaluations. On a mission to eliminate Majority Voting from AI systems. E Pluribus Unum.
By this paper's argument, no machine can be bigger, than say, ants. It reads as one extended philosophical joke by an academic arm of The Onion.
24.10.2025 17:47 — 👍 1 🔁 0 💬 1 📌 0Sometimes they follow the format, sometimes not. Practical result - automatic parsing of LLM responses are almost impossible using regular expressions. As a consequence, it throttles the number of questions in an experiment to what you can tolerate manually processing!
17.10.2025 11:27 — 👍 0 🔁 0 💬 0 📌 0"With the right guardrails", @andytseng.bsky.social , agreed. How do you see this playing out when currently LLMs have a hard time following formatting instructions for their answers? I'm running "AI debate" experiments asking LLMs to put their answer at the end of the reasons for it. LLMs say: Meh.
17.10.2025 11:27 — 👍 0 🔁 0 💬 1 📌 0assuming the conclusion. At best, you are proposing there is an "inside joke" that is not expressed. Sorry. This is sloppy science by people that have not bothered to study the hard work of psychologists and neuroscientists.
13.10.2025 13:10 — 👍 0 🔁 0 💬 1 📌 0"Why do thinking language models like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, it remains unclear to what extent thinking models learn entirely new reasoning capabilities or repurpose pre-existing base model ones." The logical fallacy here is ...
13.10.2025 13:10 — 👍 0 🔁 0 💬 1 📌 0Not true. Can we stop saying that building models of what you call "reasoning" and then showing that LLMs correlate with your ad-hoc definition, is not, in any way proof that these models "reason" as humans. Computational neuro-science by newbies is not science.
13.10.2025 12:40 — 👍 0 🔁 0 💬 1 📌 0count given true label. Yes, there are, algebraically they are in the appendix of my paper. The illustration at the top of the post is the 2nd term in these equations, Q_label_true. So the full visual proof is going to need illustrations for every one of those sums in the very same three squares.
13.10.2025 12:26 — 👍 0 🔁 0 💬 0 📌 0In general, the R-label classification test would have R^2 squares. Here we have 3^2=9 cells for 3-label classification. The existence of M=2 axioms for any number of labels, R, then comes down to asserting that for any R, there are a set of finite geometric operations that give you the pair correct
13.10.2025 12:26 — 👍 0 🔁 0 💬 1 📌 0For 3-labels, we can write the count of 'a' items as equal to a sum over all the pairs possible responses given 'a' label. This is visually demonstrated as three squares, one for each label. We sum squares that are not transparent.
Same as the 'a' label picture but now for the 'b' label.
Same as the 'a' label picture, but now for the 'c' label.
I'm working on visual demonstrations of the logical computations for the M=2 axioms of unsupervised evaluation for classifiers. This example is for 3 labels (a, b, c). Each illustration shows the decomposition that must be true for any pair of classifiers given assumed number of label questions.
13.10.2025 12:26 — 👍 1 🔁 0 💬 1 📌 0If we are at (Q_a=8,Q_b=9,Q_c=8) for a 3-label test, and observe a classifier saying (R_a=3,R_b=10,R_c=12) you can start parsing out the logical consequences. There is no way they can be better than 75% on the c-label, e.g. All of those bespoke deductions are the M=1 axioms. arxiv.org/abs/2510.00821
13.10.2025 11:56 — 👍 0 🔁 0 💬 0 📌 0Given an assumed Q-point, what are the evaluations possible for each classifier? All we have in an unlabeled setting are their observed response counts. So for every classifier, we have, if you will, their own estimate of the true Q-point: (R_a_i, R_b_i, R_c_i). Logical consistency now comes in.
13.10.2025 11:56 — 👍 0 🔁 0 💬 1 📌 0So all computations of possible evaluations given observations occur at the same fixed Q-point. Note that the logic cannot tell us anything about what this Q-point could be. Other knowledge or assumptions must be invoked to do that - science and engineering.
The M=1 axioms then come into play.
The plot also illustrates the "atomic" logical operations in unsupervised evaluation. We cannot know anything about the answer key. Hence, all possible values of a statistic of the answer key must be possible. For three labels, these are the points (Q_a, Q_b, Q_c). Q_a = number of 'a' questions, etc
13.10.2025 11:56 — 👍 0 🔁 0 💬 1 📌 0This diagram also reveals that the logic is not trivial. It reveals structure about possible evaluations given observed counts of how classifiers agree/disagree. This is, in effect, a purely logical density of states plot! And there is no probability involved. Only algebra.
13.10.2025 11:56 — 👍 0 🔁 0 💬 1 📌 0Possible evaluations of being correct on three labels by 2 LLMs-as-Judges. The z-axis shows the multiplicity of evaluations at their joint correct evaluation. For any number of corrects they get, there are many ways ways to be wrong.
For three or more labels, there are more error modes than correct responses. So there are many ways to be X% correct on any given label. In binary classification, there is only one. This is shown here for three label classification by three LLMs-as-Judges grading the output of two other LLMs.
13.10.2025 11:56 — 👍 1 🔁 0 💬 1 📌 0I've come to accept that both statements could be true. It is amazing what we can learn when we set out to do it, but our ignorance remains vast after our best efforts.
13.10.2025 11:29 — 👍 0 🔁 0 💬 0 📌 0I would include Plato's Ship of Fools Allegory as an example of the wisdom of the crowd and its critics. But also about the principal/agent problem in AI safety when we delegate to agents tasks that we are ignorant about or don't want to do. medium.com/@andrescorra...
11.10.2025 22:36 — 👍 1 🔁 0 💬 0 📌 0MYTH: My grandmother was deported to El Salvador. FACT: No one really knows where she was deported to.
ICE Raids: Myth Vs. Fact https://theonion.com/ice-raids-myth-vs-fact/
The possible set of evaluations starts with this space that summarizes the answer key -- (Q_a, Q_b, Q_c,..) of dimension equal to the number of labels. But it is finite inside that space. Crucially, all the statistics are integers between zero and some observable for the classifiers also.
10.10.2025 18:00 — 👍 0 🔁 0 💬 0 📌 0If there is an answer key it maps to a point in the space defined by (Q_a, Q_b,...) for the R labels in classification. Nothing in the logic can tell us what that point is just from observing test responses from experts. However, we have digitized our uncertainty of the answer key to a finite set.
10.10.2025 18:00 — 👍 0 🔁 0 💬 1 📌 0Given counts of how classifiers agree/disagree on a finite test of size Q, you can express those observations as linear transformations from a space of the unknown statistics of the answer key and the correctness/errors of the classifiers.
Let's start with the answer key.
In general, knowing and applying the M=1 axioms for R-label classification gives you a reduction in uncertainty that goes as (1 - 1 / Q^(R-1)). The M=1 axioms are explained in my latest paper arxiv.org/abs/2510.00821
10.10.2025 18:00 — 👍 0 🔁 0 💬 1 📌 0The central question in any logic of unsupervised evaluation is -- what are the group evaluations logically consistent with how we observe experts agreeing/disagreeing on a test for which we have no answer key? As this series of plots for a single binary classifier that labeled (Q=10) items shows.
10.10.2025 18:00 — 👍 1 🔁 0 💬 1 📌 0If you listen to Carlini's earlier talks on adversarial images this is not his tone. He adamantly stated that no safety was possible given his work and that of others. He never acknowledges that it isn't applied. His warnings of doom now ring equally hollow to me. youtu.be/umfeF0Dx-r4?...
10.10.2025 12:37 — 👍 0 🔁 0 💬 0 📌 0In any debate, it would help if the participants had some measure of the reliability of the opinions of others and of their own. In addition, logical consistency can act as a way to warn us that at least one member of the debate is actually violating our accuracy specification.
10.10.2025 12:29 — 👍 0 🔁 0 💬 0 📌 0Any dialogue framework would benefit from understanding the logic of unsupervised evaluation - what are the group evaluations logically consistent with how we observe experts agreeing/disagreeing in their decisions? Evaluate, then collectively decide.
10.10.2025 12:29 — 👍 0 🔁 0 💬 1 📌 0Gonna pump my own paper here which is basically saying it’s the wrong problem to focus on for making good decisions in medicine 😅
www.sciencedirect.com/science/arti...
When no one in the room can be trusted, who judges the judges can be partially answered by using the principle of logical consistency between disagreeing experts. The test itself, i.e. its answer key, should also be a suspect in unlabeled settings. arxiv.org/abs/2510.00821
10.10.2025 12:20 — 👍 0 🔁 0 💬 0 📌 0Thank you. The logic is so basic that I believe it precedes information theoretic assumptions, but I may be wrong. There is a geometry to the possible set of evaluations. In the space of label accuracy and prevalences, the axioms become polynomials. Algebraic Geometry is the tool to use then.
10.10.2025 12:14 — 👍 1 🔁 0 💬 0 📌 0Someday it will seem strange that seminars on the theoretical foundations of machine learning did not discuss the logic of unsupervised evaluation and its axioms. Simple linear relations that anyone with knowledge of high-school math can comprehend. arxiv.org/abs/2510.00821
09.10.2025 17:57 — 👍 2 🔁 0 💬 1 📌 0