電腦使用 (original) (raw)

跳至主要內容

電腦使用

您可以使用 Gemini 2.5 電腦使用搶先體驗版模型和工具,建構可與瀏覽器互動並自動執行工作的瀏覽器控制代理。電腦使用模型可透過螢幕截圖「查看」電腦畫面,並透過產生特定 UI 動作 (例如滑鼠點擊和鍵盤輸入)「執行動作」。與函式呼叫類似,您需要編寫用戶端應用程式程式碼,才能接收及執行電腦使用動作。

透過電腦使用功能,您可以建構可執行下列操作的代理程式:

如要測試 Gemini Computer Use 模型,最簡單的方法是透過參考實作Browserbase 試用環境

電腦使用時間的運作方式

如要使用 Computer Use 模型建構瀏覽器控制代理程式,請實作代理程式迴圈,執行下列操作:

  1. 向模型傳送要求
    • 在 API 要求中新增「電腦使用」工具,並視需要新增任何自訂使用者定義函式或排除的函式。
    • 使用者的要求會做為提示,傳送給電腦使用模型。
  2. 接收模型回應
    • 電腦使用模型會分析使用者要求和螢幕截圖,並生成回應,包括代表 UI 動作的建議 function_call (例如「click at coordinate (x,y)」或「type 'text'」)。如要瞭解電腦使用模型支援的所有 UI 動作,請參閱「支援的動作」。
    • API 回應也可能包含內部安全系統的 safety_decision,用於檢查模型建議的動作。這會將動作分類為:
      safety_decision
      * 一般 / 允許:系統將動作視為安全。這也可能以沒有 safety_decision 的形式呈現。
      * 需要確認 (require_confirmation):模型即將執行可能具有風險的動作 (例如點選「接受 Cookie 通知」)。
  3. 執行收到的動作
    • 您的用戶端程式碼會收到 function_call 和任何隨附的 safety_decision
      * 正常 / 允許:如果 safety_decision 指出正常/允許 (或沒有 safety_decision),用戶端程式碼即可在目標環境 (例如網頁瀏覽器) 中執行指定的 function_call
      * 需要確認:如果 safety_decision 指出需要確認,應用程式必須先提示使用者確認,才能執行 function_call。如果使用者確認,請繼續執行動作。如果使用者拒絕,請勿執行動作。
  4. 擷取新環境狀態
    • 如果動作已執行,用戶端會擷取 GUI 和目前網址的新螢幕截圖,並做為 function_response 的一部分傳回電腦使用模型。
    • 如果安全系統封鎖某項動作,或使用者拒絕確認,應用程式可能會將不同形式的回饋傳送給模型,或終止互動。

這個程序會從步驟 2 開始重複執行,使用新的螢幕截圖和持續進行的目標,透過電腦使用模型建議下一個動作。這個迴圈會持續執行,直到工作完成、發生錯誤或程序終止 (例如,由於「封鎖」安全回應或使用者決定)。

電腦使用概覽

如何實作電腦使用

使用電腦使用模型和工具建構內容前,您需要設定下列項目:

本節的範例使用瀏覽器做為執行環境,並以 Playwright 做為用戶端動作處理常式。如要執行這些範例,您必須安裝必要的依附元件,並初始化 Playwright 瀏覽器執行個體。

安裝 Playwright

pip install google-genai playwright
playwright install chromium

初始化 Playwright 瀏覽器執行個體

from playwright.sync_api import sync_playwright

# 1. Configure screen dimensions for the target environment
SCREEN_WIDTH = 1440
SCREEN_HEIGHT = 900

# 2. Start the Playwright browser
# In production, utilize a sandboxed environment.
playwright = sync_playwright().start()
# Set headless=False to see the actions performed on your screen
browser = playwright.chromium.launch(headless=False)

# 3. Create a context and page with the specified dimensions
context = browser.new_context(
    viewport={"width": SCREEN_WIDTH, "height": SCREEN_HEIGHT}
)
page = context.new_page()

# 4. Navigate to an initial page to start the task
page.goto("https://www.google.com")

# The 'page', 'SCREEN_WIDTH', and 'SCREEN_HEIGHT' variables
# will be used in the steps below.

使用自訂使用者定義函式」一節包含擴充至 Android 環境的程式碼範例。

1. 向模型傳送要求

在 API 要求中加入電腦使用工具,並將包含使用者目標的提示傳送至電腦使用模型。您必須使用 Gemini Computer Use 模型,gemini-2.5-computer-use-preview-10-2025。如果嘗試使用其他模型搭配電腦使用工具,系統會顯示錯誤訊息。

您也可以視需要加入下列參數:

請注意,發出要求時不必指定螢幕大小;模型會預測像素座標,並根據螢幕高度和寬度進行縮放。

Python

from google import genai
from google.genai import types
from google.genai.types import Content, Part

client = genai.Client()

# Specify predefined functions to exclude (optional)
excluded_functions = ["drag_and_drop"]

generate_content_config = genai.types.GenerateContentConfig(
    tools=[
        # 1. Computer Use tool with browser environment
        types.Tool(
            computer_use=types.ComputerUse(
                environment=types.Environment.ENVIRONMENT_BROWSER,
                # Optional: Exclude specific predefined functions
                excluded_predefined_functions=excluded_functions
                )
              ),
        # 2. Optional: Custom user-defined functions
        #types.Tool(
          # function_declarations=custom_functions
          #   )
          ],
  )

# Create the content with user message
contents=[
    Content(
        role="user",
        parts=[
            Part(text="Search for highly rated smart fridges with touchscreen, 2 doors, around 25 cu ft, priced below 4000 dollars on Google Shopping. Create a bulleted list of the 3 cheapest options in the format of name, description, price in an easy-to-read layout."),
        ],
    )
]

# Generate content with the configured settings
response = client.models.generate_content(
    model='gemini-2.5-computer-use-preview-10-2025',
    contents=contents,
    config=generate_content_config,
)

# Print the response output
print(response)

如需自訂函式的範例,請參閱「使用自訂使用者定義函式」。

2. 接收模型回覆

如果「電腦使用」模型判斷需要 UI 動作才能完成工作,就會傳回一或多個 FunctionCalls。電腦用途支援平行函式呼叫,也就是說,模型可以在單一回合中傳回多個動作。

以下是模型回覆範例。

{
  "content": {
    "parts": [
      {
        "text": "I will type the search query into the search bar. The search bar is in the center of the page."
      },
      {
        "function_call": {
          "name": "type_text_at",
          "args": {
            "x": 371,
            "y": 470,
            "text": "highly rated smart fridges with touchscreen, 2 doors, around 25 cu ft, priced below 4000 dollars on Google Shopping",
            "press_enter": true
          }
        }
      }
    ]
  }
}

3. 執行收到的動作

應用程式程式碼需要剖析模型回應、執行動作,並收集結果。

下方程式碼範例會從電腦使用模型的回應中擷取函式呼叫,並將其轉換為可透過 Playwright 執行的動作。無論輸入圖片的尺寸為何,模型都會輸出正規化座標 (0 到 999),因此轉換步驟的一部分是將這些正規化座標轉換回實際像素值。

建議使用 (1440, 900) 的螢幕尺寸,搭配「電腦使用」模型。模型可處理任何解析度,但結果品質可能會受到影響。

請注意,這個範例只包含 3 項最常見的 UI 動作實作:open_web_browserclick_attype_text_at。如要用於正式版,您必須實作「支援的動作」清單中的所有其他 UI 動作,除非您明確將這些動作新增為 excluded_predefined_functions

Python

from typing import Any, List, Tuple
import time

def denormalize_x(x: int, screen_width: int) -> int:
    """Convert normalized x coordinate (0-1000) to actual pixel coordinate."""
    return int(x / 1000 * screen_width)

def denormalize_y(y: int, screen_height: int) -> int:
    """Convert normalized y coordinate (0-1000) to actual pixel coordinate."""
    return int(y / 1000 * screen_height)

def execute_function_calls(candidate, page, screen_width, screen_height):
    results = []
    function_calls = []
    for part in candidate.content.parts:
        if part.function_call:
            function_calls.append(part.function_call)

    for function_call in function_calls:
        action_result = {}
        fname = function_call.name
        args = function_call.args
        print(f"  -> Executing: {fname}")

        try:
            if fname == "open_web_browser":
                pass # Already open
            elif fname == "click_at":
                actual_x = denormalize_x(args["x"], screen_width)
                actual_y = denormalize_y(args["y"], screen_height)
                page.mouse.click(actual_x, actual_y)
            elif fname == "type_text_at":
                actual_x = denormalize_x(args["x"], screen_width)
                actual_y = denormalize_y(args["y"], screen_height)
                text = args["text"]
                press_enter = args.get("press_enter", False)

                page.mouse.click(actual_x, actual_y)
                # Simple clear (Command+A, Backspace for Mac)
                page.keyboard.press("Meta+A")
                page.keyboard.press("Backspace")
                page.keyboard.type(text)
                if press_enter:
                    page.keyboard.press("Enter")
            else:
                print(f"Warning: Unimplemented or custom function {fname}")

            # Wait for potential navigations/renders
            page.wait_for_load_state(timeout=5000)
            time.sleep(1)

        except Exception as e:
            print(f"Error executing {fname}: {e}")
            action_result = {"error": str(e)}

        results.append((fname, action_result))

    return results

4. 擷取新環境狀態

執行動作後,將函式執行結果傳回模型,模型就能使用這項資訊生成下一個動作。如果執行多個動作 (平行呼叫),您必須在後續使用者回合中,為每個動作傳送 FunctionResponse

Python


def get_function_responses(page, results):
    screenshot_bytes = page.screenshot(type="png")
    current_url = page.url
    function_responses = []
    for name, result in results:
        response_data = {"url": current_url}
        response_data.update(result)
        function_responses.append(
            types.FunctionResponse(
                name=name,
                response=response_data,
                parts=[types.FunctionResponsePart(
                        inline_data=types.FunctionResponseBlob(
                            mime_type="image/png",
                            data=screenshot_bytes))
                ]
            )
        )
    return function_responses

建構代理程式迴圈

如要啟用多步驟互動,請將「如何實作電腦使用」一節中的四個步驟合併成一個迴圈。請記得附加模型回應和函式回應,正確管理對話記錄。

如要執行這個程式碼範例,請完成下列步驟:

Python


import time
from typing import Any, List, Tuple
from playwright.sync_api import sync_playwright

from google import genai
from google.genai import types
from google.genai.types import Content, Part

client = genai.Client()

# Constants for screen dimensions
SCREEN_WIDTH = 1440
SCREEN_HEIGHT = 900

# Setup Playwright
print("Initializing browser...")
playwright = sync_playwright().start()
browser = playwright.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": SCREEN_WIDTH, "height": SCREEN_HEIGHT})
page = context.new_page()

# Define helper functions. Copy/paste from steps 3 and 4
# def denormalize_x(...)
# def denormalize_y(...)
# def execute_function_calls(...)
# def get_function_responses(...)

try:
    # Go to initial page
    page.goto("https://ai.google.dev/gemini-api/docs")

    # Configure the model (From Step 1)
    config = types.GenerateContentConfig(
        tools=[types.Tool(computer_use=types.ComputerUse(
            environment=types.Environment.ENVIRONMENT_BROWSER
        ))],
        thinking_config=types.ThinkingConfig(include_thoughts=True),
    )

    # Initialize history
    initial_screenshot = page.screenshot(type="png")
    USER_PROMPT = "Go to ai.google.dev/gemini-api/docs and search for pricing."
    print(f"Goal: {USER_PROMPT}")

    contents = [
        Content(role="user", parts=[
            Part(text=USER_PROMPT),
            Part.from_bytes(data=initial_screenshot, mime_type='image/png')
        ])
    ]

    # Agent Loop
    turn_limit = 5
    for i in range(turn_limit):
        print(f"\n--- Turn {i+1} ---")
        print("Thinking...")
        response = client.models.generate_content(
            model='gemini-2.5-computer-use-preview-10-2025',
            contents=contents,
            config=config,
        )

        candidate = response.candidates[0]
        contents.append(candidate.content)

        has_function_calls = any(part.function_call for part in candidate.content.parts)
        if not has_function_calls:
            text_response = " ".join([part.text for part in candidate.content.parts if part.text])
            print("Agent finished:", text_response)
            break

        print("Executing actions...")
        results = execute_function_calls(candidate, page, SCREEN_WIDTH, SCREEN_HEIGHT)

        print("Capturing state...")
        function_responses = get_function_responses(page, results)

        contents.append(
            Content(role="user", parts=[Part(function_response=fr) for fr in function_responses])
        )

finally:
    # Cleanup
    print("\nClosing browser...")
    browser.close()
    playwright.stop()

使用自訂使用者定義函式

您也可以視需要將自訂使用者定義函式納入要求,藉此擴充模型的功能。下例會納入使用者自訂動作 (例如 open_applong_press_atgo_home),並排除瀏覽器專屬動作,以調整電腦使用模型和工具,適用於行動裝置用途。模型可以智慧地呼叫這些自訂函式和標準 UI 動作,在非瀏覽器環境中完成工作。

Python

from typing import Optional, Dict, Any

from google import genai
from google.genai import types
from google.genai.types import Content, Part

client = genai.Client()

SYSTEM_PROMPT = """You are operating an Android phone. Today's date is October 15, 2023, so ignore any other date provided.
* To provide an answer to the user, *do not use any tools* and output your answer on a separate line. IMPORTANT: Do not add any formatting or additional punctuation/text, just output the answer by itself after two empty lines.
* Make sure you scroll down to see everything before deciding something isn't available.
* You can open an app from anywhere. The icon doesn't have to currently be on screen.
* Unless explicitly told otherwise, make sure to save any changes you make.
* If text is cut off or incomplete, scroll or click into the element to get the full text before providing an answer.
* IMPORTANT: Complete the given task EXACTLY as stated. DO NOT make any assumptions that completing a similar task is correct.  If you can't find what you're looking for, SCROLL to find it.
* If you want to edit some text, ONLY USE THE `type` tool. Do not use the onscreen keyboard.
* Quick settings shouldn't be used to change settings. Use the Settings app instead.
* The given task may already be completed. If so, there is no need to do anything.
"""

def open_app(app_name: str, intent: Optional[str] = None) -> Dict[str, Any]:
    """Opens an app by name.

    Args:
        app_name: Name of the app to open (any string).
        intent: Optional deep-link or action to pass when launching, if the app supports it.

    Returns:
        JSON payload acknowledging the request (app name and optional intent).
    """
    return {"status": "requested_open", "app_name": app_name, "intent": intent}

def long_press_at(x: int, y: int) -> Dict[str, int]:
    """Long-press at a specific screen coordinate.

    Args:
        x: X coordinate (absolute), scaled to the device screen width (pixels).
        y: Y coordinate (absolute), scaled to the device screen height (pixels).

    Returns:
        Object with the coordinates pressed and the duration used.
    """
    return {"x": x, "y": y}

def go_home() -> Dict[str, str]:
    """Navigates to the device home screen.

    Returns:
        A small acknowledgment payload.
    """
    return {"status": "home_requested"}

#  Build function declarations
CUSTOM_FUNCTION_DECLARATIONS = [
    types.FunctionDeclaration.from_callable(client=client, callable=open_app),
    types.FunctionDeclaration.from_callable(client=client, callable=long_press_at),
    types.FunctionDeclaration.from_callable(client=client, callable=go_home),
]

#Exclude browser functions
EXCLUDED_PREDEFINED_FUNCTIONS = [
    "open_web_browser",
    "search",
    "navigate",
    "hover_at",
    "scroll_document",
    "go_forward",
    "key_combination",
    "drag_and_drop",
]

#Utility function to construct a GenerateContentConfig
def make_generate_content_config() -> genai.types.GenerateContentConfig:
    """Return a fixed GenerateContentConfig with Computer Use + custom functions."""
    return genai.types.GenerateContentConfig(
        system_instruction=SYSTEM_PROMPT,
        tools=[
            types.Tool(
                computer_use=types.ComputerUse(
                    environment=types.Environment.ENVIRONMENT_BROWSER,
                    excluded_predefined_functions=EXCLUDED_PREDEFINED_FUNCTIONS,
                )
            ),
            types.Tool(function_declarations=CUSTOM_FUNCTION_DECLARATIONS),
        ],
    )

# Create the content with user message
contents: list[Content] = [
    Content(
        role="user",
        parts=[
            # text instruction
            Part(text="Open Chrome, then long-press at 200,400."),
        ],
    )
]

# Build your fixed config (from helper)
config = make_generate_content_config()

# Generate content with the configured settings
response = client.models.generate_content(
        model='gemini-2.5-computer-use-preview-10-2025',
        contents=contents,
        config=config,
    )

print(response)

支援的 UI 動作

電腦使用模型可以透過 FunctionCall 要求下列 UI 動作。用戶端程式碼必須實作這些動作的執行邏輯。如需範例,請參閱參考實作

指令名稱 說明 引數 (在函式呼叫中) 函式呼叫範例
open_web_browser 開啟網路瀏覽器。 {"name": "open_web_browser", "args": {}}
wait_5_seconds 暫停執行 5 秒,讓動態內容載入或動畫完成。 {"name": "wait_5_seconds", "args": {}}
go_back 前往瀏覽器記錄中的上一頁。 {"name": "go_back", "args": {}}
go_forward 前往瀏覽器記錄中的下一頁。 {"name": "go_forward", "args": {}}
search 前往預設搜尋引擎的首頁 (例如 Google),適合用來開始新的搜尋工作。 {"name": "search", "args": {}}
navigate 直接將瀏覽器導向指定網址。 url:str {"name": "navigate", "args": {"url": "https://www.wikipedia.org"}}
click_at 點選網頁上的特定座標。x 和 y 值是以 1000x1000 格線為準,並會縮放至螢幕尺寸。 y:int (0 到 999),x:int (0 到 999) {"name": "click_at", "args": {"y": 300, "x": 500}}
hover_at 將滑鼠懸停在網頁上的特定座標。可用於顯示子選單。x 和 y 是以 1000x1000 格線為準。 y: int (0-999) x: int (0-999) {"name": "hover_at", "args": {"y": 150, "x": 250}}
type_text_at 在特定座標輸入文字,預設會先清除欄位,然後在輸入完畢後按下 ENTER 鍵,但這些動作可以停用。x 和 y 座標是以 1000x1000 的格線為準。 y:int (0-999)、x:int (0-999)、text:str、press_enter:bool (選用,預設為 True)、clear_before_typing:bool (選用,預設為 True) {"name": "type_text_at", "args": {"y": 250, "x": 400, "text": "search query", "press_enter": false}}
key_combination 按下鍵盤按鍵或組合鍵,例如「Ctrl+C」或「Enter」。可用於觸發動作 (例如使用「Enter」鍵提交表單) 或剪貼簿作業。 keys:字串 (例如「enter」、「control+c」)。 {"name": "key_combination", "args": {"keys": "Control+A"}}
scroll_document 將整個網頁「向上」、「向下」、「向左」或「向右」捲動。 direction:字串 (「up」、「down」、「left」或「right」) {"name": "scroll_document", "args": {"direction": "down"}}
scroll_at 在指定方向上,將特定元素或區域捲動特定幅度,座標為 (x, y)。座標和量值 (預設為 800) 是以 1000x1000 格線為準。 y:int (0-999)、x:int (0-999)、direction:str ("up"、"down"、"left"、"right")、magnitude:int (0-999,選用,預設為 800) {"name": "scroll_at", "args": {"y": 500, "x": 500, "direction": "down", "magnitude": 400}}
drag_and_drop 從起始座標 (x, y) 拖曳元素,並在目的地座標 (destination_x, destination_y) 放開。所有座標都是以 1000x1000 的格線為準。 y:int (0-999)、x:int (0-999)、destination_y:int (0-999)、destination_x:int (0-999) {"name": "drag_and_drop", "args": {"y": 100, "x": 100, "destination_y": 500, "destination_x": 500}}

安全與安全性

確認安全裁決

視動作而定,模型回覆也可能包含來自內部安全系統的 safety_decision,該系統會檢查模型建議的動作。

{
  "content": {
    "parts": [
      {
        "text": "I have evaluated step 2. It seems Google detected unusual traffic and is asking me to verify I'm not a robot. I need to click the 'I'm not a robot' checkbox located near the top left (y=98, x=95).",
      },
      {
        "function_call": {
          "name": "click_at",
          "args": {
            "x": 60,
            "y": 100,
            "safety_decision": {
              "explanation": "I have encountered a CAPTCHA challenge that requires interaction. I need you to complete the challenge by clicking the 'I'm not a robot' checkbox and any subsequent verification steps.",
              "decision": "require_confirmation"
            }
          }
        }
      }
    ]
  }
}

如果 safety_decisionrequire_confirmation,您必須先請使用者確認,才能繼續執行動作。根據服務條款,您不得略過要求人工確認的步驟。

這個程式碼範例會先提示使用者確認,再執行動作。如果使用者未確認動作,迴圈就會終止。如果使用者確認動作,系統就會執行動作,並將 safety_acknowledgement 欄位標示為 True

Python

import termcolor

def get_safety_confirmation(safety_decision):
    """Prompt user for confirmation when safety check is triggered."""
    termcolor.cprint("Safety service requires explicit confirmation!", color="red")
    print(safety_decision["explanation"])

    decision = ""
    while decision.lower() not in ("y", "n", "ye", "yes", "no"):
        decision = input("Do you wish to proceed? [Y]es/[N]o\n")

    if decision.lower() in ("n", "no"):
        return "TERMINATE"
    return "CONTINUE"

def execute_function_calls(candidate, page, screen_width, screen_height):

    # ... Extract function calls from response ...

    for function_call in function_calls:
        extra_fr_fields = {}

        # Check for safety decision
        if 'safety_decision' in function_call.args:
            decision = get_safety_confirmation(function_call.args['safety_decision'])
            if decision == "TERMINATE":
                print("Terminating agent loop")
                break
            extra_fr_fields["safety_acknowledgement"] = "true" # Safety acknowledgement

        # ... Execute function call and append to results ...

如果使用者確認,您必須在 FunctionResponse 中加入安全確認聲明。

Python

function_response_parts.append(
    FunctionResponse(
        name=name,
        response={"url": current_url,
                  **extra_fr_fields},  # Include safety acknowledgement
        parts=[
            types.FunctionResponsePart(
                inline_data=types.FunctionResponseBlob(
                    mime_type="image/png", data=screenshot
                )
             )
           ]
         )
       )

安全性最佳做法

電腦使用 API 是新穎的 API,會帶來開發人員應注意的新風險:

為因應這些風險,您可以採取下列安全措施和最佳做法:

  1. 人機迴圈 (HITL):
    • 實作使用者確認:當安全回應指出 require_confirmation 時,您必須先實作使用者確認,才能執行操作。如需範例程式碼,請參閱「確認安全決策」。
    • 提供自訂安全指示:除了內建的使用者確認檢查外,開發人員也可以選擇新增自訂系統指示,強制執行自己的安全政策,禁止特定模型動作,或要求使用者先確認,模型才能執行特定高風險的不可逆動作。以下是與模型互動時可加入的自訂安全系統指令範例。

    安全操作指南範例

    將自訂安全規則設為系統指令:
    ## RULE 1: Seek User Confirmation (USER_CONFIRMATION)
    This is your first and most important check. If the next required action falls
    into any of the following categories, you MUST stop immediately, and seek the
    user's explicit permission.
    Procedure for Seeking Confirmation: * For Consequential Actions:
    Perform all preparatory steps (e.g., navigating, filling out forms, typing a
    message). You will ask for confirmation AFTER all necessary information is
    entered on the screen, but BEFORE you perform the final, irreversible action
    (e.g., before clicking "Send", "Submit", "Confirm Purchase", "Share"). * For
    Prohibited Actions:
    If the action is strictly forbidden (e.g., accepting legal
    terms, solving a CAPTCHA), you must first inform the user about the required
    action and ask for their confirmation to proceed.
    USER_CONFIRMATION Categories:
    * Consent and Agreements: You are FORBIDDEN from accepting, selecting, or
    agreeing to any of the following on the user's behalf. You must ask the
    user to confirm before performing these actions.
    * Terms of Service
    * Privacy Policies
    * Cookie consent banners
    * End User License Agreements (EULAs)
    * Any other legally significant contracts or agreements.
    * Robot Detection: You MUST NEVER attempt to solve or bypass the
    following. You must ask the user to confirm before performing these actions.
    * CAPTCHAs (of any kind)
    * Any other anti-robot or human-verification mechanisms, even if you are
    capable.
    * Financial Transactions:
    * Completing any purchase.
    * Managing or moving money (e.g., transfers, payments).
    * Purchasing regulated goods or participating in gambling.
    * Sending Communications:
    * Sending emails.
    * Sending messages on any platform (e.g., social media, chat apps).
    * Posting content on social media or forums.
    * Accessing or Modifying Sensitive Information:
    * Health, financial, or government records (e.g., medical history, tax
    forms, passport status).
    * Revealing or modifying sensitive personal identifiers (e.g., SSN, bank
    account number, credit card number).
    * User Data Management:
    * Accessing, downloading, or saving files from the web.
    * Sharing or sending files/data to any third party.
    * Transferring user data between systems.
    * Browser Data Usage:
    * Accessing or managing Chrome browsing history, bookmarks, autofill data,
    or saved passwords.
    * Security and Identity:
    * Logging into any user account.
    * Any action that involves misrepresentation or impersonation (e.g.,
    creating a fan account, posting as someone else).
    * Insurmountable Obstacles: If you are technically unable to interact with
    a user interface element or are stuck in a loop you cannot resolve, ask the
    user to take over.
    ---
    ## RULE 2: Default Behavior (ACTUATE)
    If an action does NOT fall under the conditions for USER_CONFIRMATION,
    your default behavior is to Actuate.
    Actuation Means: You MUST proactively perform all necessary steps to move
    the user's request forward. Continue to actuate until you either complete the
    non-consequential task or encounter a condition defined in Rule 1.
    * Example 1: If asked to send money, you will navigate to the payment
    portal, enter the recipient's details, and enter the amount. You will then
    STOP as per Rule 1 and ask for confirmation before clicking the final
    "Send" button.
    * Example 2: If asked to post a message, you will navigate to the site,
    open the post composition window, and write the full message. You will then
    STOP as per Rule 1 and ask for confirmation before clicking the final
    "Post" button.
    After the user has confirmed, remember to get the user's latest screen
    before continuing to perform actions.
    # Final Response Guidelines:
    Write final response to the user in the following cases:
    - User confirmation
    - When the task is complete or you have enough information to respond to the user

  2. 安全執行環境:在安全沙箱環境中執行代理程式,限制潛在影響 (例如沙箱虛擬機器 (VM)、容器 (例如 Docker),或是權限受限的專用瀏覽器設定檔。
  3. 輸入內容清理:清理提示中的所有使用者生成文字,降低出現非預期指令或提示注入的風險。這層安全防護很有幫助,但無法取代安全執行環境。
  4. 內容防護機制:使用防護機制和內容安全 API 評估使用者輸入內容、工具輸入和輸出內容、代理程式回覆是否適當、提示詞注入和越獄偵測。
  5. 許可清單和封鎖清單:導入篩選機制,控管模型可前往的位置和可執行的動作。禁止存取的網站封鎖清單是不錯的起點,而限制更嚴格的許可清單則更加安全。
  6. 可觀測性和記錄:維護詳細記錄,以利偵錯、稽核及事件應變。客戶應記錄提示、螢幕截圖、模型建議的動作 (function_call)、安全回應,以及客戶最終執行的所有動作。
  7. 環境管理:確保 GUI 環境一致。如果出現非預期的彈出式視窗、通知或版面配置變更,模型可能會感到困惑。盡可能從已知且乾淨的狀態開始執行每項新工作。

模型版本

屬性 說明
模型代碼 Gemini API gemini-2.5-computer-use-preview-10-2025
支援的資料類型 輸入功率 圖片、文字 輸出內容 文字
代幣限制[*] 輸入權杖限制 128,000 輸出詞元限制 64,000
個版本 如要瞭解詳情,請參閱模型版本模式。 預覽:gemini-2.5-computer-use-preview-10-2025
最新更新 2025 年 10 月

後續步驟

除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。

上次更新時間:2025-10-24 (世界標準時間)。