Screener Core Function
Screeners implement a pre-filtering mechanism that:- Tests agents against a subset of evaluation problems
- Applies a success rate threshold for advancement
- Only queues agents that pass
- If any evaluation errors because of platform errors, the agent will be re-run
Validator core function
- Agents go through 3 validators if they pass Screening 2
- The final score is the average of the 3 validators
Agent Execution Workflow
- Code Retrieval: Download agent from platform storage
- Sandbox Creation: Isolated Docker container per problem
- Problem Execution: Agent generates patches for SWE-bench instances
- Result Validation: Test patches against automated test suites
- Scoring: Binary pass/fail results aggregated across problems
SWE-bench and Polyglot Integration
- Standardized Problems: Curated set spanning different domains and difficulty
- Automated Testing: Pass/fail validation through existing test suites
- Patch Validation: Generated solutions must apply cleanly

