ACM SURE Workshop @sureworkshop

Check out the paper:
sure-workshop.org/ac...

13.10.2025 08:11 — 👍 0 🔁 0 💬 0 📌 0

In the special sub-area of type inferencing on binary code, Noriki's work explores the recovery of structs and how different GNN architectures may have better performance.

13.10.2025 08:11 — 👍 0 🔁 0 💬 1 📌 0

On our last presented work at SURE, we have Noriki Sakamoto presenting "Toward Inferring Structural Semantics from Binary Code Using Graph Neural Networks"

13.10.2025 08:11 — 👍 0 🔁 0 💬 1 📌 0

Check out the paper:
sure-workshop.org/ac...

13.10.2025 07:52 — 👍 0 🔁 0 💬 0 📌 0

Indeed, LibIHT is more robust. They achieve better results on binaries that attempt to evade their analysis.

13.10.2025 07:52 — 👍 0 🔁 0 💬 1 📌 0

GitHub - libiht/libiht: Intel Hardware Trace Library - Kernel Space Componment Intel Hardware Trace Library - Kernel Space Componment - libiht/libiht

The magic happens at the kernel level. Their new tool LibIHT (github.com/libiht/li...), is implemented both at the user-space and kernel-space level.

This is important for speed and robustness against evasion techniques.

13.10.2025 07:52 — 👍 0 🔁 0 💬 1 📌 0

Often, when static analysis tools do not work, you need to get down in the weeds of a program and start dynamically analyzing it.

In Thomason's work, he explores a way to be more robust and efficient by utilizing hardware features for dynamic analysis.

13.10.2025 07:52 — 👍 0 🔁 0 💬 1 📌 0

We're so back, and on our last session: Applications & Future Work.
Changyu "Thomason" Zhao is presenting "LibIHT: A Hardware-Based Approach to Efficient and Evasion-Resistant Dynamic Binary Analysis".

He is presenting virtually.

13.10.2025 07:51 — 👍 0 🔁 0 💬 1 📌 0

Find the paper here:
sure-workshop.org/ac...

13.10.2025 06:59 — 👍 0 🔁 0 💬 0 📌 0

Now, you got your crazy code, how do you select which functions in the code to obfuscate and evaluate on?

Functions must be "sensitive" and "central". Sensitive: has sensitive info like a uid or gid or a password. Central: many other functions should depend on it (calls).

13.10.2025 06:59 — 👍 0 🔁 0 💬 1 📌 0

Real world programs in their set need:
- unique functionality
- complex code
- ...

Some real programs: OpenSSL, QEMU, SQLite, curl, ... all difficult targets that are already hard to analyze, so they are not obfuscated.

13.10.2025 06:59 — 👍 0 🔁 0 💬 1 📌 0

An interesting observation: obfuscation is really expensive on the CPU. Real programs don't obfuscate the entire program; they only obfuscate critical code locations like license checks.

So they construct their dataset with that in mind.

13.10.2025 06:59 — 👍 0 🔁 0 💬 1 📌 0

Dongpeng argues that many modern works in deobfuscation don't work on large complex programs. Instead, they are mostly tested on toy programs that are not real-world.

To make a more useful evaluation, they explore how real obfuscation is used.

13.10.2025 06:58 — 👍 0 🔁 0 💬 1 📌 0

We're on our last talk of the session, remaining with the obfuscation topic. Dongpeng Xu is presenting "DEBRA: A Real-World Benchmark For Evaluating Deobfuscation Methods" in the place of Zheyun Feng.

13.10.2025 06:58 — 👍 0 🔁 0 💬 1 📌 0

Interesting question: do specific features seem to matter more for the models? Example: constants.

So far, the answer is unclear. These models are very black-box and require more explainability.

13.10.2025 06:39 — 👍 0 🔁 0 💬 0 📌 0

Takeaways:
- Training on obfuscation does help models, but it is not a silver bullet. This solution does not work well on obfuscation tech it has never seen before.

Check out the work:
sure-workshop.org/ac...

13.10.2025 06:38 — 👍 0 🔁 0 💬 1 📌 0

Some results: you train on obfuscation, and it turns out the model does do better (with BinShot) on obfuscated code. However, training it on specific types of obfuscation tech matters. For instance, training on control flow flattening may not help at all with MBA.

13.10.2025 06:38 — 👍 0 🔁 0 💬 1 📌 0

The reasoning task: binary code similarity detection. Do these two code snippets come from an identical source, and does obfuscation stop it?

13.10.2025 06:38 — 👍 0 🔁 0 💬 1 📌 0

They evaluate public obfuscation tools such as an LLVM obfuscator and the classic tool Tigress.

They have a few questions, one interesting one is:
Does training on obfuscated code actually make the models better at reasoning on them?

13.10.2025 06:38 — 👍 0 🔁 0 💬 1 📌 0

When reasoning on code, does it matter if it is obfuscated? The answer feels like a strong YES; however, how much does it matter for AI?

Jiyong's work explores this idea in a measurable way.

13.10.2025 06:38 — 👍 0 🔁 0 💬 1 📌 0

Next up is "On the Learnability, Robustness, and Adaptability of Deep Learning Models for Obfuscation-applied Code," presented by Jiyong Uhm of Sungkyunkwan University.

13.10.2025 06:38 — 👍 1 🔁 0 💬 1 📌 0

To run those tests, you can use LLMs! They get the input from decompilation and try to take the multiple-choice guess. It's important you measure probabilities along the way.

Check out the paper:
sure-workshop.org/ac...

13.10.2025 06:10 — 👍 0 🔁 0 💬 0 📌 0

Decoys are wrong answers on a test that could be reasonably guessed more closely. Instead of just having random answers, have highly guessed answers as decoys. See if things still work the same.

13.10.2025 06:10 — 👍 0 🔁 0 💬 1 📌 0

Something cool:
You get the following choices:
A: return *(x + 8)
B: return x._length

Both give you a similar probability of being chosen as `getLength` on a random test, which is unexpected. This is why you need stronger decoys!

13.10.2025 06:10 — 👍 0 🔁 0 💬 1 📌 0

Florian's work argues that we need to start integrated work in natural sciences, like multiple-choice tests that evaluate how good your data is at helping people on the test.

13.10.2025 06:10 — 👍 0 🔁 0 💬 1 📌 0

Let's start with variable/property naming inside of decompilation. Given many different choices for the same name, does one lead to an LLM or a human to make a decision that is more correct or predictable?

13.10.2025 06:10 — 👍 1 🔁 0 💬 1 📌 0

Speaking of evaluating things, how do you evaluate if your decompilation (or other tool) is actually helping you more in understanding software?

Florian Magin is back to present:
"Towards Scalable Evaluation of Software Understanding: A Methodology Proposal"

13.10.2025 06:10 — 👍 0 🔁 0 💬 1 📌 0

An interesting finding:
- Decompilers seem to perform better for type inference on O2 instead of O0 on coverage, but, as expected, it does not hold on other metrics

Find the paper here:
sure-workshop.org/ac...

13.10.2025 05:52 — 👍 0 🔁 0 💬 0 📌 0

While evaluating, they focus on a few essential types:
Primitives:
-> Char, int, lono long...
-> Pointers

Complex Types:
-> Structs
-> Arrays

They find that complex types is where much work still exists.

13.10.2025 05:52 — 👍 0 🔁 0 💬 1 📌 0

One problem is not every one is speaking the same language. Some decompiler report types as a QWORD some say Int some say undefined.

Vedant's work normalizes these differences to make the evaluation more fair.

13.10.2025 05:52 — 👍 0 🔁 0 💬 1 📌 0

ACM SURE Workshop

Latest posts by sureworkshop.bsky.social on Bluesky

@sureworkshop is following 1 prominent accounts