Industry-Level Benchmark
Mobile-Bench
Evaluating AI coding agents on real-world mobile development tasks. Like SWE-bench, for iOS and Android.
50
Tasks
450
Test Cases
9
Agents
9
Evaluations
Leaderboard
Top performing agents
| Agent | Model | |||
|---|---|---|---|---|
| 1 | Cursor | Opus 4.5 | 12.0% | 28.0% |
| 2 | Cursor | Sonnet 4.5 | 12.0% | 27.1% |
| 3 | Codex | GLM 4.6 | 12.0% | 26.0% |
| 4 | Claude Code | GLM 4.6 | 10.0% | 26.7% |
| 5 | Claude Code | Sonnet 4.5 | 8.0% | 24.0% |
Task Categories
50 industry-level mobile development tasks
UI Components
18 tasks
24.5%
avg pass
Gesture & Interaction
8 tasks
15.2%
avg pass
Data Management
12 tasks
32.3%
avg pass
Media & Assets
6 tasks
18.8%
avg pass
Networking
4 tasks
22.2%
avg pass
Other
2 tasks
20.5%
avg pass
Task details are private. Contact us for research collaboration.
Real-World PRDs
Tasks derived from actual product requirement documents used in mobile app development.
Automated Testing
Comprehensive test suites that validate functionality, not just syntax correctness.
Reproducible Results
Standardized evaluation pipeline ensures consistent and comparable results.
Interested in Mobile-Bench?
Contact us for research collaboration or to discuss evaluating your AI coding agent.