#15

Codex CLI

🔒 Closed

OpenAI's open-source terminal agent. Lightweight, sandboxed execution, multi-model support via OpenAI-compatible APIs.

💰 Included with OpenAI subscription · cli, api

51.1Overall Score

Non-Gameable Scoring

Scores are derived from established benchmarks, adjusted for harness-specific performance across four dimensions: Coding, Reasoning, Tool Use, and Autonomy.

Each dimension starts from public benchmark data and applies harness-specific modifiers based on tool integration, context handling, and orchestration quality. The overall score is a weighted composite that penalizes narrow optimization.

Model	Overall	Coding	Reasoning	Tool Use	Autonomy
GPT-5.4	51.1	46.35	62.53	68.34	57.04
GPT-5.2	35.6	49.04	60.88	33.01	35.19
GPT-5 (high)	33.1	52.72	49.02	26.07	37.89
GPT-oss 120B	31.2	51.6	53.19	11.05	36.19
GPT-5 mini	30.5	36.34	41.02	36.6	38.39
GPT-5 (medium)	28.3	47.76	44.77	18.05	31.01
GPT-5.1	25.3	44.52	47.13	9.9	24.74
o3	24.7	36.35	7.15	33.36	46.46
GPT-5.1 Thinking	21.7	43.56	44.46	0.3	20.38
GPT-5.1 (high)	20.8	46.46	45.49	0.26	11.8