⏱️ 9 days until submission deadline (Dec 11, 23:59 AoE).
Organized by: @yannicnoller.bsky.social, @rohan.padhye.org, @ruijiemeng.bsky.social, and Laszlo (@lszekeres.bsky.social) Szekeres.
@mboehme.bsky.social
Software Security @MPI, PhD @NUS, Dipl.-Inf. @TUDresden. Research Group: http://mpi-softsec.github.io
⏱️ 9 days until submission deadline (Dec 11, 23:59 AoE).
Organized by: @yannicnoller.bsky.social, @rohan.padhye.org, @ruijiemeng.bsky.social, and Laszlo (@lszekeres.bsky.social) Szekeres.
Dearly beloved, we are gathered here today to celebrate this thing called ASE 2025 ;) @aseconf.bsky.social @mboehme.bsky.social @llingming.bsky.social
17.11.2025 03:18 — 👍 14 🔁 3 💬 0 📌 0🎙️ #ASE2025 Keynote Speaker Series (1 of 3)
What do symbolic model checking, path profiling, and quantum simulation have in common? 🤔
Find out from Prof. Reps (University of Wisconsin-Madison) in his ASE2025 Keynote “We Will Publish No Algorithm Before Its Time”!
conf.researchr.org/track/ase-20...
🎙️ ASE 2025 Keynote Speaker Series (3 of 3)
Prof. Taesoo Kim (Georgia Tech)
“Hyperscale Bug Finding and Fixing: DARPA AIxCC”
conf.researchr.org/track/ase-20...
🎙️ #ASE2025 Keynote Speaker Series (2 of 3)
Dr. Cristina Cifuentes, Vice President @ Oracle Software Assurance
“Oracle Parfait – Detecting Application Vulnerabilities at Scale – Past, Present and Future”
Awesome! Also, I'll be happy to catch up in Seoul in the week after next if you are around for ASE :)
09.11.2025 13:29 — 👍 0 🔁 0 💬 1 📌 0On the negative side, the AI reviewer seems to be worse at setting priorities, i.e., distinguishing between critical and insubstantial problems w.r.t. to the main claims. Moreover, it was convincingly incorrect whereas a human reviewer might be incorrect and detectably "silent" on the rationale.
2/2
Great question!
On the positive side, I found the AI reviewer *way* more elaborate in eliciting both the positive and negative points. The review is more objective, less/not opinionated. It is more constructive and for every weakness makes suggestions for improvements.
1/
Exactly. This is our assumption. Also, there can be infinitely many ways to implement that function.
09.11.2025 07:44 — 👍 1 🔁 0 💬 2 📌 0bsky.app/profile/mboe...
08.11.2025 20:04 — 👍 0 🔁 0 💬 0 📌 0Overall, the AI reviewer is super impressive! I think, it would help me tremendously during the preparation of our submission to identify points to improve before the paper is submitted.
However, it does make errors, and I wouldn't trust it as an actual (co)-reviewer.
12/12
The AI reviewer lists several other items as weaknesses and the corresponding suggestions for improvement. These are summarily deemed to be fixable. Yay!
11/
The fourth weakness is a set of presentation issues. These are helpful but easily fixed.
10/
The third weakness is a matter of preference.
Our theorem expresses what (and how efficiently) we can learn about detecting non-zero incoherence given the alg. output: "If after n(δ,ε) samples we detect no disagreement, then incoherence is at most ε with prob. at least 1-δ".
9/
The second weakness is incorrect.
Our incoherence-based detection reports indeed no false positives: A non-zero incoherence implies a non-zero error, even empirically. To cite the AI reviewer: "If two sampled programs disagree on an input, at least one of them is wrong".
8/
Using our own formalization, the AI reviewer literally proposes a fix where we compute pass@1 as the proportion of tasks with non-zero error. Nice!
7/
The first weakness seems critical. The AI reviewer finds an error in a "key equation".
Nothing reject-worthy: Not a key equation but a remark, and the error is just a typo; e.g., 1-\bar{E}(S,1,1) fixes it.
But YES, the AI reviewer found a bug in our Equation (12). Wow!!
6/
Let's now look at the weaknesses that our AI reviewer has found in our paper. Weaknesses are usually the reasons why a paper is rejected. They better be correct.
5/
The three other strengths are also all items that we would consider as important strengths of our paper.
Interestingly, one item (highlighted in blue) is never mentioned in our paper, but something we are now actively pursuing.
4/
Our AI reviewer establishes 4 strengths. Most human reviewers might have a hard time to establish strengths so succinctly and at this level of details. Great!
Apart from a minor error (no executable semantics needed; only ability to execute), the first strength looks good.
3/
Let's find out which strengths the AI reviewer establishes for our paper. We'll look at weaknesses (those that get papers rejected) later.
The summary of review definitely hits the nail on the head. We can see motivation and main contributions. Nice!
2/
🧵 A human review of our AI review at #AAAI26.
📝: arxiv.org/abs/2507.00057
🦋 : bsky.app/profile/did:...
We are off to a good start. While the synopsis misses the motivation (*why* this is interesting), it offers the most important points. Good abstract-length summary.
1/
For any programming task d, there exists only one (unknown) ground-truth function f.
08.11.2025 10:07 — 👍 0 🔁 0 💬 2 📌 0Thanks Toby! Yes. Our assumption (which resolves both of your cases) is that for any programming task d there exists a correct, deterministic function f that the LLM is meant to implement. That function can be unknown and there can be many (correct) implementations of f in any programming language.
08.11.2025 10:03 — 👍 1 🔁 0 💬 1 📌 0What's most interesting is that we linked pass@1, a common measure of LLM performance, to incoherence and found that a ranking of LLMs using our metric is similar to the same as a pass@1-based ranking which requires some explicit definition of correctness, e.g., manually-provided "golden" programs.
08.11.2025 08:00 — 👍 2 🔁 0 💬 0 📌 0This is easy to see. For instance, if we prompt an LLM twice to write a program that reverses a list and both programs disagree on the output for an input, then at least one of those programs must be incorrect.
08.11.2025 08:00 — 👍 1 🔁 0 💬 2 📌 0We define *incorrectness* as the probability that the generated program does not implement the correct function.
We define *incoherence* as the probability that any two generated programs implement different functions and prove that incoherence is an upper bound on incorrectness.
The key idea is think of the program that is generated by an LLM for some natural language description (e.g., "Implement a function that sorts numbers") as a *random variable*. This allows us to introduce a precise probabilistic interpretation of correctness.
08.11.2025 08:00 — 👍 0 🔁 0 💬 1 📌 0Just accepted at #AAAI26 in Singapore: Our paper on estimating the *correctness* of LLM-generated code in the absence of oracles (e.g., a ground-truth implementation).
📝 arxiv.org/abs/2507.00057
with Thomas Valentin (ENS Paris-Saclay), Ardi Madadi, and Gaetano Sapia (#MPI_SP).
Thanks Dominik! It's a follow up for our ICSE'25 paper.
bsky.app/profile/mboe...