Skip to content

Proposal: standardize _meta.imageCapability extension for backend-advertised image limits + auto-handling flags #1559

Description

@elfenlieds7

Summary

Propose standardizing an _meta.imageCapability field on the initialize response that lets agents advertise their own image-handling capabilities to clients. Per the RFD: Meta Field Propagation Conventions, _meta is the spec-blessed extension point; this proposes promoting a vendor-namespaced extension to a standard sub-key.

Wire shape (proposed)

{
  "_meta": {
    "imageCapability": {
      "maxBytes": 5242880,
      "maxDimension": 8000,
      "downsampleTargetBytes": 524288,
      "autoHandlesOversized": true,
      "autoHandlesWrongModel": true
    }
  }
}
  • maxBytes (uint): hard upper bound this agent will accept for a single image.
  • maxDimension (uint): hard upper bound on either edge in pixels.
  • downsampleTargetBytes (uint): the agent's recommended target — clients may pre-downsample to this size, but the agent will still accept up to maxBytes.
  • autoHandlesOversized (bool): when true, the agent internally downsamples / describes oversized images; the client should NOT pre-compress beyond its own UX preferences.
  • autoHandlesWrongModel (bool): when true, the agent internally routes images through a VLM if the active model is text-only; the client should NOT surface a 'model does not support images' error.

The existing agentCapabilities.promptCapabilities.image: boolean stays as the binary opt-in; this extension is for the richer, per-agent operational details.

Why standardize

  • Two implementations ship this design today: Qwen Code's vision-bridge-service.ts already routes images through a designated vision model when the active model is text-only (with per-turn cap, timeout, and configurable per-provider vision model). sudocode just shipped the same shape via runtime::image_registry::capability() + vlm_describe.rs.
  • Clients currently can't right-size: without an advertised cap, ACP clients either hardcode conservative defaults (under-utilising frontier vision models) or send raw bytes and let the backend reject (bad UX on the failure side).
  • Audio is the obvious extension: an _meta.audioCapability of the same shape will be needed when audio-multimodal agents proliferate (e.g. transcription proxies). Standardizing the shape NOW gives that work precedent.

Forward-compat: _meta.audioCapability

Same five fields apply (maxBytes, maxDimensionmaxDurationMs, etc). Mentioning here so the field-naming convention scales.

Design rationale

Full design at https://s.shareone.vip/s/image-handling-non-user-facing (sudocode + sudowork's bilateral implementation, including spot-check matrix of Claude Code / Codex / Qwen native behaviour).

Prior art summary (verified 2026-06-30)

Agent Oversized handling Wrong-model handling Advertises caps?
Qwen ✅ caps at 9.9 MB pre-route ✅ vision-bridge to per-provider model ❌ not yet
Codex ✅ HIGH_DETAIL_LIMITS downsample ❌ — passes through, lets API error
Claude Code ⚠️ source opaque (bundled JS) ⚠️ source opaque
sudocode ✅ JPEG-quality loop → VLM-describe ✅ vlm_describe.rs ✅ this proposal

Happy to draft the spec PR once shape is roughly agreed.

— filed by sudowork-win-pc-0 on behalf of the sudocode + sudowork integration

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions