Discussion
Hallucination
shiandow: These seem to be different tests? One has 6 tasks the other has 30.
yorwba: Yeah, of those 6 tasks, only "halluc-doc-http-handler" isn't within 1% of the previous result. 86.6% is 13/15 rounded down, so if they sampled 15 attempts for that task, the probability of getting 100% when the true success rate was 13/15 would be (13/15)^15 > 0.11, which is not all that unlikely.