Discussion
Sup AI: The Most Accurate AI in Existence
siliconc0w: Do you have data for other benchmarks? +7% for HLE isn't nothing but it'd be more compelling if you could show you're consistently doing better with your method across more domains (especially coding, which seems like the primary use-case these days).
wavemode: Is 7 extra percent on HLE benchmark really worth the cost of running an entire ensemble of models?
kelseyfrog: Depends on the use-case and requirements.
Tomjosetj31: Impressive result on HLE if the methodology holds up. One thing I'd want to understand better: how much of the gain comes from the entropy weighting specifically vs. simply having more compute via parallel inference? Would be curious to see an ablation — same models, same budget, but with naive majority voting instead. That would isolate the actual contribution of your confidence-weighting approach.
scottmu: I want to clarify what Ken meant by "entropy in the output token probability distributions." Whenever an LLM outputs a token, it's choosing that token out of all possible tokens. Every possible output token has a probability assigned by the model (typically a logarithm of the probability). This is a probability distribution (the output token probabilities sum to 1). Entropy is a measure of uncertainty and can quantify if a token probability distribution is certain (1 token has a 99.9% probability, and the rest share the leftover 0.1% probability) or uncertain (every token has roughly the same probability, so it's pretty much random which token is selected). Low entropy is the former case, and high entropy is the latter.There is interesting research in the correlation of entropy with accuracy and hallucinations:- https://www.nature.com/articles/s41586-024-07421-0- https://arxiv.org/abs/2405.19648- https://arxiv.org/abs/2509.04492 (when only a small number of probabilities are available, which is something we frequently deal with)- https://arxiv.org/abs/2603.18940- tons more, happy to chat about if interested
stephantul: Buddy… your son gets a top post on HN in which he clearly mentions you, yet you feel the need to make an account just to correct him in the first comment? Can’t you send him a message and let him correct it?
scottmu: You're right! I could've phrased my comment better. Ken actually wanted to edit his post, but it was too late. So he asked me to write a response explaining what he meant. Of course, he could've commented too. I was just trying to be helpful to him and others wanting an explanation.
mememememememo: Wow if it is that easy to detect hallucinations, are the big models or rigs (agentic scaffolds) building any self-correcting behaviour. Or possibly switching it to an I don't know mode so it can ask the human for help understanding.Maybe this insight is why I feel hallucinations are much rarer in the last 12 months on top models. Are they being detected before they get sent out.
scottmu: I wouldn't say it's easy to detect hallucinations. Understanding output token probability distributions is only part of a solution, and we still aren't perfect. Just better than individual models.Hallucinations may seem rarer for a few reasons. First, models are more accurate with certain prompts. Second, models are more convincing when they do hallucinate. They may get an overall idea, but hallucinate the details. Hallucinations are still a major problem and are fundamental to the way modern LLMs work.