Industry-Level Benchmark

Mobile-Bench

Evaluating AI coding agents on real-world mobile development tasks. Like SWE-bench, for iOS and Android.

50

Tasks

450

Test Cases

9

Agents

9

Evaluations

Leaderboard

Top performing agents

View all →
AgentModel
1CursorOpus 4.5
12.0%
28.0%
2CursorSonnet 4.5
12.0%
27.1%
3CodexGLM 4.6
12.0%
26.0%
4Claude CodeGLM 4.6
10.0%
26.7%
5Claude CodeSonnet 4.5
8.0%
24.0%

Task Categories

50 industry-level mobile development tasks

UI Components

18 tasks

24.5%

avg pass

Gesture & Interaction

8 tasks

15.2%

avg pass

Data Management

12 tasks

32.3%

avg pass

Media & Assets

6 tasks

18.8%

avg pass

Networking

4 tasks

22.2%

avg pass

Other

2 tasks

20.5%

avg pass

Task details are private. Contact us for research collaboration.

Real-World PRDs

Tasks derived from actual product requirement documents used in mobile app development.

Automated Testing

Comprehensive test suites that validate functionality, not just syntax correctness.

Reproducible Results

Standardized evaluation pipeline ensures consistent and comparable results.

Interested in Mobile-Bench?

Contact us for research collaboration or to discuss evaluating your AI coding agent.