LogoCua Documentation

Introduction

Overview of benchmarking in the Cua agent framework

The Cua agent framework uses benchmarks to test the performance of supported models and providers at various agentic tasks.

Benchmark Types

Computer-Agent benchmarks evaluate two key capabilities:

  • Plan Generation: Breaking down complex tasks into a sequence of actions
  • Coordinate Generation: Predicting precise click locations on GUI elements

Using State-of-the-Art Models

Let's see how to use the SOTA vision-language models in the Cua agent framework.

Plan Generation + Coordinate Generation

OS-World - Benchmark for complete computer-use agents

This leaderboard tests models that can understand instructions and automatically perform the full sequence of actions needed to complete tasks.

# UI-TARS-1.5 is a SOTA unified plan generation + coordinate generation VLM
# This makes it suitable for agentic loops for computer-use
agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
agent.run("Open Firefox and go to github.com")
# Success! 🎉

Coordinate Generation Only

GUI Agent Grounding Leaderboard - Benchmark for click prediction accuracy

This leaderboard tests models that specialize in finding exactly where to click on screen elements, but needs to be told what specific action to take.

# GTA1-7B is a SOTA coordinate generation VLM
# It can only generate coordinates, not plan:
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
agent.predict_click("find the button to open the settings") # (27, 450)
# This will raise an error:
# agent.run("Open Firefox and go to github.com") 

Composed Agent

The Cua agent framework also supports composed agents, which combine a planning model with a clicking model for the best of both worlds. Any liteLLM model can be used as the plan generation model.

# It can be paired with any LLM to form a composed agent:
# "gemini/gemini-1.5-pro" will be used as the plan generation LLM
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro", tools=[computer])
agent.run("Open Firefox and go to github.com")
# Success! 🎉

Was this page helpful?