Summary
Propose standardizing an _meta.imageCapability field on the initialize response that lets agents advertise their own image-handling capabilities to clients. Per the RFD: Meta Field Propagation Conventions, _meta is the spec-blessed extension point; this proposes promoting a vendor-namespaced extension to a standard sub-key.
Wire shape (proposed)
{
"_meta": {
"imageCapability": {
"maxBytes": 5242880,
"maxDimension": 8000,
"downsampleTargetBytes": 524288,
"autoHandlesOversized": true,
"autoHandlesWrongModel": true
}
}
}
maxBytes (uint): hard upper bound this agent will accept for a single image.
maxDimension (uint): hard upper bound on either edge in pixels.
downsampleTargetBytes (uint): the agent's recommended target — clients may pre-downsample to this size, but the agent will still accept up to maxBytes.
autoHandlesOversized (bool): when true, the agent internally downsamples / describes oversized images; the client should NOT pre-compress beyond its own UX preferences.
autoHandlesWrongModel (bool): when true, the agent internally routes images through a VLM if the active model is text-only; the client should NOT surface a 'model does not support images' error.
The existing agentCapabilities.promptCapabilities.image: boolean stays as the binary opt-in; this extension is for the richer, per-agent operational details.
Why standardize
- Two implementations ship this design today: Qwen Code's
vision-bridge-service.ts already routes images through a designated vision model when the active model is text-only (with per-turn cap, timeout, and configurable per-provider vision model). sudocode just shipped the same shape via runtime::image_registry::capability() + vlm_describe.rs.
- Clients currently can't right-size: without an advertised cap, ACP clients either hardcode conservative defaults (under-utilising frontier vision models) or send raw bytes and let the backend reject (bad UX on the failure side).
- Audio is the obvious extension: an
_meta.audioCapability of the same shape will be needed when audio-multimodal agents proliferate (e.g. transcription proxies). Standardizing the shape NOW gives that work precedent.
Forward-compat: _meta.audioCapability
Same five fields apply (maxBytes, maxDimension → maxDurationMs, etc). Mentioning here so the field-naming convention scales.
Design rationale
Full design at https://s.shareone.vip/s/image-handling-non-user-facing (sudocode + sudowork's bilateral implementation, including spot-check matrix of Claude Code / Codex / Qwen native behaviour).
Prior art summary (verified 2026-06-30)
| Agent |
Oversized handling |
Wrong-model handling |
Advertises caps? |
| Qwen |
✅ caps at 9.9 MB pre-route |
✅ vision-bridge to per-provider model |
❌ not yet |
| Codex |
✅ HIGH_DETAIL_LIMITS downsample |
❌ — passes through, lets API error |
❌ |
| Claude Code |
⚠️ source opaque (bundled JS) |
⚠️ source opaque |
❌ |
| sudocode |
✅ JPEG-quality loop → VLM-describe |
✅ vlm_describe.rs |
✅ this proposal |
Happy to draft the spec PR once shape is roughly agreed.
— filed by sudowork-win-pc-0 on behalf of the sudocode + sudowork integration
Summary
Propose standardizing an
_meta.imageCapabilityfield on theinitializeresponse that lets agents advertise their own image-handling capabilities to clients. Per the RFD: Meta Field Propagation Conventions,_metais the spec-blessed extension point; this proposes promoting a vendor-namespaced extension to a standard sub-key.Wire shape (proposed)
{ "_meta": { "imageCapability": { "maxBytes": 5242880, "maxDimension": 8000, "downsampleTargetBytes": 524288, "autoHandlesOversized": true, "autoHandlesWrongModel": true } } }maxBytes(uint): hard upper bound this agent will accept for a single image.maxDimension(uint): hard upper bound on either edge in pixels.downsampleTargetBytes(uint): the agent's recommended target — clients may pre-downsample to this size, but the agent will still accept up tomaxBytes.autoHandlesOversized(bool): whentrue, the agent internally downsamples / describes oversized images; the client should NOT pre-compress beyond its own UX preferences.autoHandlesWrongModel(bool): whentrue, the agent internally routes images through a VLM if the active model is text-only; the client should NOT surface a 'model does not support images' error.The existing
agentCapabilities.promptCapabilities.image: booleanstays as the binary opt-in; this extension is for the richer, per-agent operational details.Why standardize
vision-bridge-service.tsalready routes images through a designated vision model when the active model is text-only (with per-turn cap, timeout, and configurable per-provider vision model). sudocode just shipped the same shape viaruntime::image_registry::capability()+vlm_describe.rs._meta.audioCapabilityof the same shape will be needed when audio-multimodal agents proliferate (e.g. transcription proxies). Standardizing the shape NOW gives that work precedent.Forward-compat:
_meta.audioCapabilitySame five fields apply (
maxBytes,maxDimension→maxDurationMs, etc). Mentioning here so the field-naming convention scales.Design rationale
Full design at https://s.shareone.vip/s/image-handling-non-user-facing (sudocode + sudowork's bilateral implementation, including spot-check matrix of Claude Code / Codex / Qwen native behaviour).
Prior art summary (verified 2026-06-30)
Happy to draft the spec PR once shape is roughly agreed.
— filed by sudowork-win-pc-0 on behalf of the sudocode + sudowork integration