iOSWorld
iOSWorld  ·  About

About.

iOSWorld runs an iOS simulator with 26 purpose-built SwiftUI apps populated with one fictional persona, Jordan Avery. Each task is scored against an explicit rubric by a GPT-5.4 Mini judge, validated against human annotators at κ = 0.77 (89% accuracy, F1 = 0.86) on 128 Opus 4.6 trajectories.

Authors

  • Lawrence Keunho Jang
  • Mareks Woodside
  • Geronimo Carom
  • Andrew Jang
  • Jing Yu Koh
  • Ruslan Salakhutdinov
Carnegie Mellon University
Equal contribution.

Citation

If you use iOSWorld in your research, please cite the arXiv preprint.

iosworld.bib
@misc{jang2026iosworld,
  title         = {iOSWorld: A Benchmark for Personally Intelligent Phone Agents},
  author        = {Jang, Lawrence Keunho and Woodside, Mareks and Carom, Geronimo
                   and Jang, Andrew and Koh, Jing Yu and Salakhutdinov, Ruslan},
  year          = {2026},
  eprint        = {arXiv:XXXX.XXXXX},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI}
}

Ethics

Synthetic data. All data in iOSWorld is entirely synthetic. The Jordan Avery persona is fictional, and no real user data was collected, processed, or used at any stage. Benchmark runs use deterministic seeded data and do not depend on real user accounts, real services, or external databases.

Malicious agents. Phone agents capable of operating autonomously on a user's device carry significant dual-use risks. We encourage researchers to develop agents with explicit user consent mechanisms and action confirmation for irreversible operations.

iOS access and reproducibility. iOSWorld requires macOS with Xcode to run the iOS Simulator, which limits reproducibility to researchers with access to Apple hardware. We release all source code, seed data, evaluation scripts, and an MCP server providing a tool-use option for the 26 apps — so researchers can ablate tool use vs computer use and study tool-use + CU hybrid modes. Vision-only numbers reflect deployed capability; vision+XML represents an upper bound with privileged access via XCUITest.

Accessibility. Capable phone agents could improve accessibility for users with visual, motor, or cognitive impairments. iOSWorld is a research benchmark for measuring progress in a controlled simulator; results should not be interpreted as indicating readiness for deployment on real devices with real user data.

License

Apache License 2.0. See the LICENSE file in the benchmark repository.