Discussion
N-Day-Bench
mbbutler: It would be helpful to add in some cases that do not contain any vulnerabilities to assess false-positive rate as well.
cortesoft: Any code that is certain that it doesn't have any vulnerabilities is going to be pretty trivial to verify.
Cynddl: > Each case runs three agents: a Curator reads the advisory and builds an answer key, a Finder (the model under test) gets 24 shell steps to explore the code and write a structured report, and a Judge scores the blinded submission. The Finder never sees the patch. It starts from sink hints and must trace the bug through actual code.Curator, answer key, Finder, shell steps, structured report, sink hints… I understand nothing. Did you use an LLM to generate this HN submission?It looks like a standard LLM-as-a-judge approach. Do you manually validate or verify some of the results? Done poorly, the results can be very noisy and meaningless.
rohansood15: I worked in AppSec in the past, made sense to me. Maybe you aren't the target audience?You don't really need manual verification for these, the CVEs (vulnerabilities) are public and can be programmatically validated.
johnfn: Is this really that hard to parse?Curator and Finder are the names of the agents. "answer key" - haven't you ever taken a test in high school? It's an explanation of the answer. "shell steps" I presume means it gets to run 24 commands on the shell. "structured report" - do I really need to explain to you what a report is? "sink hints" - I admit I didn't know this one, but a bit of searching indicates that it's a hint at where the vulnerability lies.
sacrelege: Thanks for putting N-Day-Bench together - really interesting benchmark design and results.I'd love to see how the model we serve, Qwen3.5 122B A10B, stacks up against the rest on this benchmark. AI Router Switzerland (aiRouter.ch) can sponsor free API access for about a month if that helps for adding it to the evaluation set.
StrauXX: Do you plan on adding more models in the future? I would love to see how other OSS modles like Gemma, GPT-OSS and Qwen fare.
muldvarp: Manual verification that the "judge" judges correctly.Also, how exactly do you programmatically validate CVEs?
ra: [delayed]