datacurve-pier (original) (raw)

Pier is a Harbor-compatible framework for evaluating coding agents in sandboxed environments. It reads Harbor's task format and runs trials against it.

pier run -p path/to/task --agent claude-code --env modal

Why pier

Pier is a fork. We wanted a smaller, more opinionated base to build on. On top of Harbor, Pier adds:

What works today

Pier does not currently resolve or download Harbor registry datasets directly.

Install

uv tool install datacurve-pier

or

pip install datacurve-pier

Run

export ANTHROPIC_API_KEY=... pier run -p path/to/task --agent claude-code --env modal --env-file .env

Run a local dataset, optionally a deterministic random subset:

pier run -p path/to/dataset --agent claude-code --env modal pier run -p path/to/dataset --n-tasks 10 --sample-seed 0

To use a Harbor registry dataset, download it with Harbor first, then point Pier at it:

uv run --directory ~/code/harbor harbor download swebenchpro -o ~/code/pier/datasets uv run pier run -p datasets/swebenchpro --n-tasks 10 --sample-seed 0

Trials land under jobs/<timestamp_or_name>/<trial_id>/. See pier run --help, pier job --help, pier critique --help, and pier view --help for everything else.

Agent runtime configuration

Use agent.model_name for trial metadata, agent.env for runtime env vars, and agent-specific kwargs for tool config. Pier's network allowlist also reads URLs out of those configs (Codex config_toml, OpenCode opencode_config, mini-swe config_yaml), so any base URL you set is allowlisted without code changes.

A few things we've learned plumbing this through Respan and OpenRouter:

Claude Code routes through the Anthropic face from Respan. Plan mode is disabled by default (--disallowedTools EnterPlanMode).

Codex needs a [model_providers.<name>] block with wire_api = "responses" (not WebSockets, which Codex defaults to and Respan doesn't speak).

Gemini CLI:

Cursor CLI uses the installed cursor-agent binary, so it fits the same inside-the-sandbox path as Claude Code, Codex, Gemini CLI, and OpenCode. Usecursor/composer-2.5 for Composer 2.5 trial metadata and pass CURSOR_API_KEYthrough your env file.

OpenCode uses opencode_config to add unknown providers or override known ones. To redirect Google to Respan, override just options.baseURL; to add a fully custom provider, use opencode_config.provider.<name> with the npm package, options, and models.

mini-swe-agent picks a native adapter from the model-name prefix: openai/...litellm_response (OpenAI Responses end-to-end), openrouter/...openrouter (BYOK costs from cost_details.upstream_inference_cost), everything else → LiteLLM auto.

For Gemini 3 via mini-swe-agent/LiteLLM, omitting reasoning_effort uses the Gemini API default high/dynamic thinking level, but it does not request readable thought summaries. Set kwargs.reasoning_effort: high explicitly when you want LiteLLM to send includeThoughts and preserve returned summaries as reasoning content.