Computer Use Benchmark (CUB) Performance
Overview of CUB Benchmark
The Computer Use Benchmark (CUB) is a challenging evaluation framework designed to assess AI agents' capabilities in real-world computer usage scenarios. This benchmark is particularly significant as it tests agents' abilities in economically valuable domains like accounting, healthcare, finance, and other professional tasks.
Performance Comparison
Here's how different models perform across various domains:
Model | Business Operations | Construction | Consumer | Finance | Healthcare | Supply Chain | Overall |
---|---|---|---|---|---|---|---|
Manus | 10.59% | 16.00% | 17.00% | 7.06% | 0.00% | 4.10% | 9.23% |
OpenAI CUA | 14.60% | 19.00% | 7.41% | 2.73% | 4.86% | 5.14% | 7.28% |
Claude (Computer Use) | 6.33% | 19.50% | 12.06% | 2.03% | 0.00% | 0.85% | 6.01% |
Claude (Browser Use) | 6.92% | 11.00% | 6.40% | 0.00% | 0.36% | 3.50% | 3.78% |
Gemini 2.5 Pro | 1.41% | 0.00% | 1.50% | 0.20% | 0.00% | 0.00% | 0.56% |
Key Evaluation Areas
The benchmark evaluates critical capabilities required for real-world tasks:
- Long-sequence Memory
- Following complex multi-step instructions
-
Maintaining context across extended operations
-
Multi-application Coordination
- Seamless switching between different software
- Data transfer between applications
-
Interface adaptation
-
Task Reliability
- Consistent performance in repetitive tasks
- Error handling and recovery
-
Maintaining accuracy over time
-
Interface Navigation
- Handling unfamiliar interfaces
- Working with domain-specific tools
- Adapting to different UI paradigms
Example Tasks
Construction Domain Example
Task: Property Square Footage Calculation - Navigate and utilize block maps - Multimodal reasoning for diagram interpretation - Long-sequence memory application - Complex spatial calculations
Healthcare Domain Example
Task: EHR Data Entry - Parse medical documentation - Navigate complex EHR interfaces - Handle hidden functionality - Medical terminology comprehension - Data entry in multi-panel interfaces
Technical Infrastructure
The benchmark leverages advanced evaluation infrastructure: - Parallelized testing environments - VM snapshotting for efficient evaluation - Support for both browser and desktop configurations - Rich action space compatibility - Black-box agent system support
TODO
- [ ] Add GenZ performance metrics across all domains
- [ ] Implement automated testing pipeline for CUB
- [ ] Develop domain-specific optimization strategies
- [ ] Create detailed performance analysis dashboards
- [ ] Document best practices for each domain