Marcel Böhme's Avatar

Marcel Böhme

@mboehme.bsky.social

Software Security @MPI, PhD @NUS, Dipl.-Inf. @TUDresden. Research Group: http://mpi-softsec.github.io

839 Followers  |  417 Following  |  146 Posts  |  Joined: 17.11.2024  |  2.4042

Latest posts by mboehme.bsky.social on Bluesky

⏱️ 9 days until submission deadline (Dec 11, 23:59 AoE).

Organized by: @yannicnoller.bsky.social, @rohan.padhye.org, @ruijiemeng.bsky.social, and Laszlo (@lszekeres.bsky.social) Szekeres.

03.12.2025 10:59 — 👍 4    🔁 5    💬 0    📌 0
Post image Post image Post image

Dearly beloved, we are gathered here today to celebrate this thing called ASE 2025 ;) @aseconf.bsky.social @mboehme.bsky.social @llingming.bsky.social

17.11.2025 03:18 — 👍 14    🔁 3    💬 0    📌 0
Post image

🎙️ #ASE2025 Keynote Speaker Series (1 of 3)

What do symbolic model checking, path profiling, and quantum simulation have in common? 🤔

Find out from Prof. Reps (University of Wisconsin-Madison) in his ASE2025 Keynote “We Will Publish No Algorithm Before Its Time”!

conf.researchr.org/track/ase-20...

22.10.2025 11:39 — 👍 10    🔁 3    💬 0    📌 1
Post image

🎙️ ASE 2025 Keynote Speaker Series (3 of 3)

Prof. Taesoo Kim (Georgia Tech)
“Hyperscale Bug Finding and Fixing: DARPA AIxCC”

conf.researchr.org/track/ase-20...

28.10.2025 07:44 — 👍 4    🔁 2    💬 1    📌 0
Post image

🎙️ #ASE2025 Keynote Speaker Series (2 of 3)

Dr. Cristina Cifuentes, Vice President @ Oracle Software Assurance

“Oracle Parfait – Detecting Application Vulnerabilities at Scale – Past, Present and Future”

26.10.2025 03:19 — 👍 6    🔁 2    💬 1    📌 2

Awesome! Also, I'll be happy to catch up in Seoul in the week after next if you are around for ASE :)

09.11.2025 13:29 — 👍 0    🔁 0    💬 1    📌 0

On the negative side, the AI reviewer seems to be worse at setting priorities, i.e., distinguishing between critical and insubstantial problems w.r.t. to the main claims. Moreover, it was convincingly incorrect whereas a human reviewer might be incorrect and detectably "silent" on the rationale.
2/2

09.11.2025 08:59 — 👍 0    🔁 0    💬 0    📌 0

Great question!

On the positive side, I found the AI reviewer *way* more elaborate in eliciting both the positive and negative points. The review is more objective, less/not opinionated. It is more constructive and for every weakness makes suggestions for improvements.

1/

09.11.2025 08:50 — 👍 0    🔁 0    💬 1    📌 0

Exactly. This is our assumption. Also, there can be infinitely many ways to implement that function.

09.11.2025 07:44 — 👍 1    🔁 0    💬 2    📌 0

bsky.app/profile/mboe...

08.11.2025 20:04 — 👍 0    🔁 0    💬 0    📌 0

Overall, the AI reviewer is super impressive! I think, it would help me tremendously during the preparation of our submission to identify points to improve before the paper is submitted.

However, it does make errors, and I wouldn't trust it as an actual (co)-reviewer.

12/12

08.11.2025 19:51 — 👍 1    🔁 0    💬 1    📌 0
Post image

The AI reviewer lists several other items as weaknesses and the corresponding suggestions for improvement. These are summarily deemed to be fixable. Yay!

11/

08.11.2025 19:51 — 👍 0    🔁 0    💬 1    📌 0
Post image

The fourth weakness is a set of presentation issues. These are helpful but easily fixed.

10/

08.11.2025 19:51 — 👍 0    🔁 0    💬 1    📌 0
Post image

The third weakness is a matter of preference.

Our theorem expresses what (and how efficiently) we can learn about detecting non-zero incoherence given the alg. output: "If after n(δ,ε) samples we detect no disagreement, then incoherence is at most ε with prob. at least 1-δ".

9/

08.11.2025 19:51 — 👍 0    🔁 0    💬 1    📌 0
Post image

The second weakness is incorrect.

Our incoherence-based detection reports indeed no false positives: A non-zero incoherence implies a non-zero error, even empirically. To cite the AI reviewer: "If two sampled programs disagree on an input, at least one of them is wrong".

8/

08.11.2025 19:51 — 👍 1    🔁 0    💬 1    📌 0
Post image

Using our own formalization, the AI reviewer literally proposes a fix where we compute pass@1 as the proportion of tasks with non-zero error. Nice!

7/

08.11.2025 19:51 — 👍 2    🔁 0    💬 1    📌 0
Post image

The first weakness seems critical. The AI reviewer finds an error in a "key equation".

Nothing reject-worthy: Not a key equation but a remark, and the error is just a typo; e.g., 1-\bar{E}(S,1,1) fixes it.

But YES, the AI reviewer found a bug in our Equation (12). Wow!!

6/

08.11.2025 19:51 — 👍 2    🔁 0    💬 1    📌 0

Let's now look at the weaknesses that our AI reviewer has found in our paper. Weaknesses are usually the reasons why a paper is rejected. They better be correct.

5/

08.11.2025 19:51 — 👍 0    🔁 0    💬 1    📌 0
Post image

The three other strengths are also all items that we would consider as important strengths of our paper.

Interestingly, one item (highlighted in blue) is never mentioned in our paper, but something we are now actively pursuing.

4/

08.11.2025 19:51 — 👍 1    🔁 0    💬 1    📌 0
Post image

Our AI reviewer establishes 4 strengths. Most human reviewers might have a hard time to establish strengths so succinctly and at this level of details. Great!

Apart from a minor error (no executable semantics needed; only ability to execute), the first strength looks good.

3/

08.11.2025 19:51 — 👍 0    🔁 0    💬 1    📌 0
Post image

Let's find out which strengths the AI reviewer establishes for our paper. We'll look at weaknesses (those that get papers rejected) later.

The summary of review definitely hits the nail on the head. We can see motivation and main contributions. Nice!

2/

08.11.2025 19:51 — 👍 0    🔁 0    💬 1    📌 0

🧵 A human review of our AI review at #AAAI26.

📝: arxiv.org/abs/2507.00057
🦋 : bsky.app/profile/did:...

We are off to a good start. While the synopsis misses the motivation (*why* this is interesting), it offers the most important points. Good abstract-length summary.

1/

08.11.2025 19:51 — 👍 2    🔁 0    💬 1    📌 1

For any programming task d, there exists only one (unknown) ground-truth function f.

08.11.2025 10:07 — 👍 0    🔁 0    💬 2    📌 0

Thanks Toby! Yes. Our assumption (which resolves both of your cases) is that for any programming task d there exists a correct, deterministic function f that the LLM is meant to implement. That function can be unknown and there can be many (correct) implementations of f in any programming language.

08.11.2025 10:03 — 👍 1    🔁 0    💬 1    📌 0
Post image

What's most interesting is that we linked pass@1, a common measure of LLM performance, to incoherence and found that a ranking of LLMs using our metric is similar to the same as a pass@1-based ranking which requires some explicit definition of correctness, e.g., manually-provided "golden" programs.

08.11.2025 08:00 — 👍 2    🔁 0    💬 0    📌 0

This is easy to see. For instance, if we prompt an LLM twice to write a program that reverses a list and both programs disagree on the output for an input, then at least one of those programs must be incorrect.

08.11.2025 08:00 — 👍 1    🔁 0    💬 2    📌 0

We define *incorrectness* as the probability that the generated program does not implement the correct function.

We define *incoherence* as the probability that any two generated programs implement different functions and prove that incoherence is an upper bound on incorrectness.

08.11.2025 08:00 — 👍 1    🔁 0    💬 1    📌 0

The key idea is think of the program that is generated by an LLM for some natural language description (e.g., "Implement a function that sorts numbers") as a *random variable*. This allows us to introduce a precise probabilistic interpretation of correctness.

08.11.2025 08:00 — 👍 0    🔁 0    💬 1    📌 0
Post image

Just accepted at #AAAI26 in Singapore: Our paper on estimating the *correctness* of LLM-generated code in the absence of oracles (e.g., a ground-truth implementation).

📝 arxiv.org/abs/2507.00057

with Thomas Valentin (ENS Paris-Saclay), Ardi Madadi, and Gaetano Sapia (#MPI_SP).

08.11.2025 08:00 — 👍 11    🔁 1    💬 1    📌 1

Thanks Dominik! It's a follow up for our ICSE'25 paper.
bsky.app/profile/mboe...

04.11.2025 07:39 — 👍 1    🔁 0    💬 1    📌 0

@mboehme is following 20 prominent accounts