Discussion
OpenClaw Arena
skysniper: I ran 300+ benchmarks across 15 models in OpenClaw and published two separate leaderboards: performance and cost-effectiveness.The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance.Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance.Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but "A beat B" is reliable. Full methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hnI built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.
refulgentis: Please don’t use AI to write comments, it cuts against HN guidelines.
hadlock: According to openrouter.ai it looks like StepFun 3.5 Flash is the most popular model at 3.5T tokens, vs GLM 5 Turbo at 2.5T tokens. Claude Sonnet is in 5th place with 1.05T tokens. Which isn't super suprising as StepFun is ~about 5% the price of Sonnet.https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F
skysniper: the real surprising part to me is that, despite being the cheapest model on board, stepfun is often able to score high at pure performance. Other models at the same price range (e.g. kimi) fails to do that.
skysniper: sorry didn't know that. Here is my hand writing tldr:gemini is very unreliable at using skills, often just read skills and decide to do nothing.stepfun leads cost-effectiveness leaderboard.ranking really depends on tasks, better try your own task.
refulgentis: It’s too late once it’s happened. I was curious, then when I saw the site looked vibecoded and you’re commenting with AI, I decided to stop trying to reason through the discrepancies between what was claimed and what’s on the site (ex. 300 battles vs. only a handful in site data).
skysniper: all 300+ battle data are available at https://app.uniclaw.ai/arena/battles, every single battle is shown with raw conversional history, produced files, judge's verdict and final scores
rat9988: Too late for what? For you? maybe. There are many others that are okay with it and it doesn't disminish the quality of the work. Props to the author.
WhitneyLand: StepFun is an interesting model.If you haven’t heard of it yet there’s some good discussion here: https://news.ycombinator.com/item?id=47069179
tarruda: Since that discussion, they released the base model and a midtrain checkpoint:- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base-Midtra...I'm not aware of other AI labs that released base checkpoint for models in this size class. Qwen released some base models for 3.5, but the biggest one is the 35B checkpoint.They also released the entire training pipeline:- https://huggingface.co/datasets/stepfun-ai/Step-3.5-Flash-SF...- https://github.com/stepfun-ai/SteptronOss
NitpickLawyer: > the most popular modelIt was free for a long time. That usually skews the statistics. It was the same with grok-code-fast1.
skysniper: another thing from the bench I didn't expect: gemini 3.1 pro is very unreliable at using skills. sometimes it just reads the skill and decide to do nothing, while opus/sonnet 4.6 and gpt 5.4 never have this issue.
dmazin: why do half the comments here read like ai trying to boost some sort of scam?
MaxikCZ: Exactly. When I read the headline I thought: "Ofc it is, its free."
skysniper: I should have clarified I didn't use the free version...
johndough: Could you add a column for time or number of tokens? Some models take forever because of their excessive reasoning chains.