Partners in Evaluation
Key Takeaways
- At the top, GPT-5.5 leads under both agents we ran — GPT 5.5 on Codex at 62.55% and GPT 5.5 on OpenHands at 62.21%.
- Skills helped every model we tested. The average score rose from 35.53% without them to 52.53% with them — a 17.00-point gain.
- The biggest gains came in the middle of the pack. MiniMax-M3 improved the most (+25.43 points), and Qwen 3.7 Plus jumped to third place.
- Cheaper models benefited too. Qwen 3.7 Plus reached 54.30% at just $0.14 per task.
Background
SkillsBench asks a focused question: do agents get better at software tasks when you hand them reusable, task-specific knowledge? To answer it, the benchmark runs each model twice with the same agent — once on its own, and once with relevant skills made available. The gap between the two runs is the value the skills add.
The benchmark was introduced by BenchFlow AI, who maintain an official leaderboard and the original repository. We partnered with them to integrate it into Valkyrie, which let us add it to the site quickly.
You can find that implementation in our new public benchmark registry, where the community can contribute benchmarks that are compatible with Valkyrie.
Results
Skills lifted strong and weak models alike, by an average of 17 points. The largest jumps came from MiniMax-M3 (+25.43 points), DeepSeek V4 (+24.05), Qwen 3.7 Plus (+23.15), Grok 4.3 (+21.33), and Claude Sonnet 4.6 (+21.22).
At the top of the with-skills board, GPT-5.5 leads under both agents we ran — GPT 5.5 on Codex at 62.55% and GPT 5.5 on OpenHands at 62.21%. Qwen 3.7 Plus offers the best balance of score and cost among the top five, hitting 54.30% at $0.14 per task.
Skill usage varies widely across models, and even across agents for the same model. On Codex, GPT 5.5 references skill paths far more than anything else — about 4.8 times per task — while the same base model on OpenHands, GPT 5.5, is among the lightest at roughly 1.2. After Codex, MiniMax-M3 and Claude Opus 4.8 reference skills most, and Claude Sonnet 4.6 least, despite a strong final score.
Methodology
Every model runs the same SkillsBench task set in two conditions:
- No Skills — the agent gets its default tools and prompt.
- With Skills — the same agent, plus task-specific skills.
The default agent is OpenHands-CLI, with the commit locked at 3ca17446c5d9c1e35e054803478a3501ec251ecf. The Codex row uses Codex with the same GPT-5.5 base model.
We run three trials per model in each condition. For each task we average the score across the three trials, then average those task scores into the model’s final accuracy.
The observed skill reference chart counts deduped agent tool actions in the With Skills traces that reference a skill path, such as /skills/<skill>.
All runs use the public SkillsBench task definitions and run protocol.
Citation (BibTeX)
@misc{li2026skillsbench,
title = {SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks},
author = {Xiangyi Li and Wenbo Chen and Yimin Liu and Shenghan Zheng and Xiaokun Chen and Yifeng He and Yubo Li and Bingran You and Haotian Shen and Jiankai Sun and Shuyi Wang and Qunhong Zeng and Di Wang and Xuandong Zhao and Yuanli Wang and Roey Ben Chaim and Zonglin Di and Yipeng Gao and Junwei He and Yizhuo He and Liqiang Jing and Luyang Kong and Xin Lan and Jiachen Li and Songlin Li and Yijiang Li and Yueqian Lin and Xinyi Liu and Xuanqing Liu and Haoran Lyu and Ze Ma and Bowei Wang and Runhui Wang and Tianyu Wang and Wengao Ye and Yue Zhang and Hanwen Xing and Yiqi Xue and Steven Dillmann and Han-chung Lee},
year = {2026},
eprint = {2602.12670},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2602.12670},
}