AI Agent Time Horizons: US vs. Chinese Models

The time horizon is the length of task (measured by human-expert completion time) that an AI agent can complete at a given reliability threshold. Data from METR, computed using their logistic regression methodology on 228 software engineering tasks.

Success rate
Scale
Success rate

Data: METR Time Horizons. Most US models use TH1.1 (228 tasks); Chinese models and select others (Grok 4, gpt-oss-120b) use earlier METR evaluations (170 tasks). Confidence intervals from hierarchical bootstrap where available.

Trend lines are exponential fits to frontier-advancing models only. Chinese models evaluated by METR: Qwen2-72B, Qwen2.5-72B, DeepSeek-V3, DeepSeek-V3-0324, DeepSeek-R1, DeepSeek-R1-0528, Kimi K2 Thinking.

View data table