Design Discussion: Source-Local First Plugin Execution

Thanks all. I want to open a separate design thread focused on federation/data-local execution behavior, distinct from the immediate timeout fix.

Related context:

- high-fanout / timeout discussion: `https://github.com/FNNDSC/pfcon/issues/155`
- implementation thread for this discussion: `https://github.com/FNNDSC/pfcon/issues/156`
- architecture thread for this discussion: `https://github.com/FNNDSC/CHRIS_docs/issues/48`

## Proposal: source-local first plugin execution

Today, CUBE typically asks `pfcon` to run a specific plugin against a specific input directory. I think we should preserve that model, but broaden what "input directory" can mean.

Instead of assuming the first plugin must run only on data already ingested into CUBE storage, allow:

1. CUBE to send `pfcon` a manifest-like request describing the input source.
2. `pfcon` to resolve/access that source from wherever the data actually lives, where permitted.
3. `pfcon` to run the first plugin directly against that source-local data.

This makes the primary concept:

- source-local first plugin execution

rather than:

- "upload first, then execute"

## Why this seems valuable

- Reduces avoidable ingress/re-copy overhead for large datasets.
- Better fits remote/federated compute use cases, including ATLAS-like collaborations.
- Keeps the existing CUBE -> `pfcon` execution model intact: "here is the input, run this plugin."
- Lets us use the same mechanism for either import/copy plugins or actual analysis plugins.

## Important implication

If the first plugin is effectively a copy/import plugin (for example, something in the spirit of `pl-rclone-copy-template`), then this becomes a feed-scoped ingestion path through normal plugin execution.

If the first plugin is an analysis plugin, then the same mechanism becomes compute-near-data, with only the results being copied back and registered.

So feed-scoped ingestion is one important consequence of this design, but it is not the only purpose and should not be the primary framing.

## Clarification on scope

- This is broader than "a new upload path."
- It is a way to let `pfcon` execute the first plugin directly on source-local data.
- Feed-scoped import is one special case of that model.

## Relationship to timeout work

- Timeout/reliability can still be addressed first with async pre-copy/init staging.
- This source-local execution path can then build on the same staging mechanism and source abstractions.

## How this complements the high-fanout pfcon issue

This proposal should be viewed as complementary to the high-fanout / timeout work in `pfcon#155`, not as a competing direction.

- `pfcon#155` is primarily about making handoff/staging robust when CUBE asks `pfcon` to operate on large inputs.
- This discussion is about broadening what the initial input source can be, while still preserving the same overall execution contract.

In other words:

- `pfcon#155` asks: how does `pfcon` reliably get data it needs?
- this discussion asks: what counts as valid input locality for the first plugin, and how should that be represented?

The same pre-copy/init machinery and cache/source-adapter logic can support both.

## How should CUBE communicate "compute on source-local data" to pfcon?

This is the part that needs to stay disciplined.

An out-of-band control signal such as a special env var or hidden runtime switch would be the wrong shape here. It would blur the data/control boundary and likely violate the Data-State DAG model by making important execution intent implicit rather than materialized.

Instead, the better model is:

- represent the directive as data
- materialize that directive as a file or manifest-like input artifact
- have `pfcon` and the first plugin interpret that artifact explicitly

Concretely, that could mean a source descriptor file that says, in effect:

- here is the input source
- here is how it should be accessed
- here is the first plugin to run against it

That keeps the system aligned with ChRIS principles:

- execution intent is explicit and inspectable
- provenance can capture the directive as an input artifact
- the control plane stays closer to a data-state transition than to an imperative side-channel

So the question should not be "what env var tells pfcon to do local compute?" but rather:

- what materialized input artifact tells `pfcon` how to resolve and stage source-local input for the first plugin?

## Practical architecture shape

- Keep CUBE -> `pfcon` manifest-driven control.
- Implement source adapters in `pfcon` (`cube://`, `posix://`, `s3://`, etc., as permitted).
- Allow the first plugin to operate on source-local data when possible.
- Make cache behavior explicit (`hit`, `miss`, `warm`) to avoid repeated transfers.
- Preserve idempotency and state transitions (`copying -> ready -> execute`).

## Container/runtime question (how does `pfcon` access non-container files?)

Possible patterns:

- bind mounts from host paths
- mounted network filesystems (NFS/SMB/CephFS/CSI)
- object storage/API pulls (S3-compatible, etc.)
- optional data-access broker service for policy/credential mediation

The right choice depends on deployment and tenant boundaries.

## Guardrails to define explicitly

Even if current ingestion policy is permissive, direct source access still changes the execution boundary and should be explicit:

- source allowlists/policies
- scoped credentials (short-lived where possible)
- provenance capture (source URI, version/checksum, retrieval time)
- idempotent import/execution behavior

## Suggested phased plan

1. Land async staging (pre-copy/init) for immediate timeout reliability.
2. Add source abstraction + cache-aware staging in `pfcon`.
3. Formalize source-local first plugin execution as the primary model.
4. Treat feed-scoped import as one documented workflow built on that model.
5. Add push/event UX only if polling/status is insufficient.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design Discussion: Source-Local First Plugin Execution #156

Proposal: source-local first plugin execution

Why this seems valuable

Important implication

Clarification on scope

Relationship to timeout work

How this complements the high-fanout pfcon issue

How should CUBE communicate "compute on source-local data" to pfcon?

Practical architecture shape

Container/runtime question (how does `pfcon` access non-container files?)

Guardrails to define explicitly

Suggested phased plan

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Design Discussion: Source-Local First Plugin Execution #156

Description

Proposal: source-local first plugin execution

Why this seems valuable

Important implication

Clarification on scope

Relationship to timeout work

How this complements the high-fanout pfcon issue

How should CUBE communicate "compute on source-local data" to pfcon?

Practical architecture shape

Container/runtime question (how does pfcon access non-container files?)

Guardrails to define explicitly

Suggested phased plan

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Container/runtime question (how does `pfcon` access non-container files?)