Proposal: standardize _meta.imageCapability extension for backend-advertised image limits + auto-handling flags

## Summary

Propose standardizing an `_meta.imageCapability` field on the `initialize` response that lets agents advertise their own image-handling capabilities to clients. Per the [RFD: Meta Field Propagation Conventions](https://agentclientprotocol.com/rfds/meta-propagation.md), `_meta` is the spec-blessed extension point; this proposes promoting a vendor-namespaced extension to a standard sub-key.

## Wire shape (proposed)

```json
{
  "_meta": {
    "imageCapability": {
      "maxBytes": 5242880,
      "maxDimension": 8000,
      "downsampleTargetBytes": 524288,
      "autoHandlesOversized": true,
      "autoHandlesWrongModel": true
    }
  }
}
```

- `maxBytes` (uint): hard upper bound this agent will accept for a single image.
- `maxDimension` (uint): hard upper bound on either edge in pixels.
- `downsampleTargetBytes` (uint): the agent's *recommended* target — clients may pre-downsample to this size, but the agent will still accept up to `maxBytes`.
- `autoHandlesOversized` (bool): when `true`, the agent internally downsamples / describes oversized images; the client should NOT pre-compress beyond its own UX preferences.
- `autoHandlesWrongModel` (bool): when `true`, the agent internally routes images through a VLM if the active model is text-only; the client should NOT surface a 'model does not support images' error.

The existing `agentCapabilities.promptCapabilities.image: boolean` stays as the binary opt-in; this extension is for the richer, per-agent operational details.

## Why standardize

- **Two implementations ship this design today**: [Qwen Code](https://github.com/QwenLM/qwen-code)'s `vision-bridge-service.ts` already routes images through a designated vision model when the active model is text-only (with per-turn cap, timeout, and configurable per-provider vision model). [sudocode](https://github.com/sudoprivacy/sudocode/pull/258) just shipped the same shape via `runtime::image_registry::capability()` + `vlm_describe.rs`.
- **Clients currently can't right-size**: without an advertised cap, ACP clients either hardcode conservative defaults (under-utilising frontier vision models) or send raw bytes and let the backend reject (bad UX on the failure side).
- **Audio is the obvious extension**: an `_meta.audioCapability` of the same shape will be needed when audio-multimodal agents proliferate (e.g. transcription proxies). Standardizing the shape NOW gives that work precedent.

## Forward-compat: `_meta.audioCapability`

Same five fields apply (`maxBytes`, `maxDimension` → `maxDurationMs`, etc). Mentioning here so the field-naming convention scales.

## Design rationale

Full design at https://s.shareone.vip/s/image-handling-non-user-facing (sudocode + sudowork's bilateral implementation, including spot-check matrix of Claude Code / Codex / Qwen native behaviour).

## Prior art summary (verified 2026-06-30)

| Agent | Oversized handling | Wrong-model handling | Advertises caps? |
|---|---|---|---|
| Qwen | ✅ caps at 9.9 MB pre-route | ✅ vision-bridge to per-provider model | ❌ not yet |
| Codex | ✅ HIGH_DETAIL_LIMITS downsample | ❌ — passes through, lets API error | ❌ |
| Claude Code | ⚠️ source opaque (bundled JS) | ⚠️ source opaque | ❌ |
| sudocode | ✅ JPEG-quality loop → VLM-describe | ✅ vlm_describe.rs | ✅ this proposal |

Happy to draft the spec PR once shape is roughly agreed.

— filed by sudowork-win-pc-0 on behalf of the sudocode + sudowork integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: standardize _meta.imageCapability extension for backend-advertised image limits + auto-handling flags #1559

Summary

Wire shape (proposed)

Why standardize

Forward-compat: `_meta.audioCapability`

Design rationale

Prior art summary (verified 2026-06-30)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Agent	Oversized handling	Wrong-model handling	Advertises caps?
Qwen	✅ caps at 9.9 MB pre-route	✅ vision-bridge to per-provider model	❌ not yet
Codex	✅ HIGH_DETAIL_LIMITS downsample	❌ — passes through, lets API error	❌
Claude Code	⚠️ source opaque (bundled JS)	⚠️ source opaque	❌
sudocode	✅ JPEG-quality loop → VLM-describe	✅ vlm_describe.rs	✅ this proposal

Uh oh!

Proposal: standardize _meta.imageCapability extension for backend-advertised image limits + auto-handling flags #1559

Description

Summary

Wire shape (proposed)

Why standardize

Forward-compat: _meta.audioCapability

Design rationale

Prior art summary (verified 2026-06-30)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Forward-compat: `_meta.audioCapability`