Skip to content

Design Discussion: Source-Local First Plugin Execution #156

@rudolphpienaar

Description

@rudolphpienaar

Thanks all. I want to open a separate design thread focused on federation/data-local execution behavior, distinct from the immediate timeout fix.

Related context:

  • high-fanout / timeout discussion: https://github.com/FNNDSC/pfcon/issues/155
  • implementation thread for this discussion: https://github.com/FNNDSC/pfcon/issues/156
  • architecture thread for this discussion: https://github.com/FNNDSC/CHRIS_docs/issues/48

Proposal: source-local first plugin execution

Today, CUBE typically asks pfcon to run a specific plugin against a specific input directory. I think we should preserve that model, but broaden what "input directory" can mean.

Instead of assuming the first plugin must run only on data already ingested into CUBE storage, allow:

  1. CUBE to send pfcon a manifest-like request describing the input source.
  2. pfcon to resolve/access that source from wherever the data actually lives, where permitted.
  3. pfcon to run the first plugin directly against that source-local data.

This makes the primary concept:

  • source-local first plugin execution

rather than:

  • "upload first, then execute"

Why this seems valuable

  • Reduces avoidable ingress/re-copy overhead for large datasets.
  • Better fits remote/federated compute use cases, including ATLAS-like collaborations.
  • Keeps the existing CUBE -> pfcon execution model intact: "here is the input, run this plugin."
  • Lets us use the same mechanism for either import/copy plugins or actual analysis plugins.

Important implication

If the first plugin is effectively a copy/import plugin (for example, something in the spirit of pl-rclone-copy-template), then this becomes a feed-scoped ingestion path through normal plugin execution.

If the first plugin is an analysis plugin, then the same mechanism becomes compute-near-data, with only the results being copied back and registered.

So feed-scoped ingestion is one important consequence of this design, but it is not the only purpose and should not be the primary framing.

Clarification on scope

  • This is broader than "a new upload path."
  • It is a way to let pfcon execute the first plugin directly on source-local data.
  • Feed-scoped import is one special case of that model.

Relationship to timeout work

  • Timeout/reliability can still be addressed first with async pre-copy/init staging.
  • This source-local execution path can then build on the same staging mechanism and source abstractions.

How this complements the high-fanout pfcon issue

This proposal should be viewed as complementary to the high-fanout / timeout work in pfcon#155, not as a competing direction.

  • pfcon#155 is primarily about making handoff/staging robust when CUBE asks pfcon to operate on large inputs.
  • This discussion is about broadening what the initial input source can be, while still preserving the same overall execution contract.

In other words:

  • pfcon#155 asks: how does pfcon reliably get data it needs?
  • this discussion asks: what counts as valid input locality for the first plugin, and how should that be represented?

The same pre-copy/init machinery and cache/source-adapter logic can support both.

How should CUBE communicate "compute on source-local data" to pfcon?

This is the part that needs to stay disciplined.

An out-of-band control signal such as a special env var or hidden runtime switch would be the wrong shape here. It would blur the data/control boundary and likely violate the Data-State DAG model by making important execution intent implicit rather than materialized.

Instead, the better model is:

  • represent the directive as data
  • materialize that directive as a file or manifest-like input artifact
  • have pfcon and the first plugin interpret that artifact explicitly

Concretely, that could mean a source descriptor file that says, in effect:

  • here is the input source
  • here is how it should be accessed
  • here is the first plugin to run against it

That keeps the system aligned with ChRIS principles:

  • execution intent is explicit and inspectable
  • provenance can capture the directive as an input artifact
  • the control plane stays closer to a data-state transition than to an imperative side-channel

So the question should not be "what env var tells pfcon to do local compute?" but rather:

  • what materialized input artifact tells pfcon how to resolve and stage source-local input for the first plugin?

Practical architecture shape

  • Keep CUBE -> pfcon manifest-driven control.
  • Implement source adapters in pfcon (cube://, posix://, s3://, etc., as permitted).
  • Allow the first plugin to operate on source-local data when possible.
  • Make cache behavior explicit (hit, miss, warm) to avoid repeated transfers.
  • Preserve idempotency and state transitions (copying -> ready -> execute).

Container/runtime question (how does pfcon access non-container files?)

Possible patterns:

  • bind mounts from host paths
  • mounted network filesystems (NFS/SMB/CephFS/CSI)
  • object storage/API pulls (S3-compatible, etc.)
  • optional data-access broker service for policy/credential mediation

The right choice depends on deployment and tenant boundaries.

Guardrails to define explicitly

Even if current ingestion policy is permissive, direct source access still changes the execution boundary and should be explicit:

  • source allowlists/policies
  • scoped credentials (short-lived where possible)
  • provenance capture (source URI, version/checksum, retrieval time)
  • idempotent import/execution behavior

Suggested phased plan

  1. Land async staging (pre-copy/init) for immediate timeout reliability.
  2. Add source abstraction + cache-aware staging in pfcon.
  3. Formalize source-local first plugin execution as the primary model.
  4. Treat feed-scoped import as one documented workflow built on that model.
  5. Add push/event UX only if polling/status is insufficient.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions