Thanks all. I want to open a separate design thread focused on federation/data-local execution behavior, distinct from the immediate timeout fix.
Related context:
- high-fanout / timeout discussion:
https://github.com/FNNDSC/pfcon/issues/155
- implementation thread for this discussion:
https://github.com/FNNDSC/pfcon/issues/156
- architecture thread for this discussion:
https://github.com/FNNDSC/CHRIS_docs/issues/48
Proposal: source-local first plugin execution
Today, CUBE typically asks pfcon to run a specific plugin against a specific input directory. I think we should preserve that model, but broaden what "input directory" can mean.
Instead of assuming the first plugin must run only on data already ingested into CUBE storage, allow:
- CUBE to send
pfcon a manifest-like request describing the input source.
pfcon to resolve/access that source from wherever the data actually lives, where permitted.
pfcon to run the first plugin directly against that source-local data.
This makes the primary concept:
- source-local first plugin execution
rather than:
- "upload first, then execute"
Why this seems valuable
- Reduces avoidable ingress/re-copy overhead for large datasets.
- Better fits remote/federated compute use cases, including ATLAS-like collaborations.
- Keeps the existing CUBE ->
pfcon execution model intact: "here is the input, run this plugin."
- Lets us use the same mechanism for either import/copy plugins or actual analysis plugins.
Important implication
If the first plugin is effectively a copy/import plugin (for example, something in the spirit of pl-rclone-copy-template), then this becomes a feed-scoped ingestion path through normal plugin execution.
If the first plugin is an analysis plugin, then the same mechanism becomes compute-near-data, with only the results being copied back and registered.
So feed-scoped ingestion is one important consequence of this design, but it is not the only purpose and should not be the primary framing.
Clarification on scope
- This is broader than "a new upload path."
- It is a way to let
pfcon execute the first plugin directly on source-local data.
- Feed-scoped import is one special case of that model.
Relationship to timeout work
- Timeout/reliability can still be addressed first with async pre-copy/init staging.
- This source-local execution path can then build on the same staging mechanism and source abstractions.
How this complements the high-fanout pfcon issue
This proposal should be viewed as complementary to the high-fanout / timeout work in pfcon#155, not as a competing direction.
pfcon#155 is primarily about making handoff/staging robust when CUBE asks pfcon to operate on large inputs.
- This discussion is about broadening what the initial input source can be, while still preserving the same overall execution contract.
In other words:
pfcon#155 asks: how does pfcon reliably get data it needs?
- this discussion asks: what counts as valid input locality for the first plugin, and how should that be represented?
The same pre-copy/init machinery and cache/source-adapter logic can support both.
How should CUBE communicate "compute on source-local data" to pfcon?
This is the part that needs to stay disciplined.
An out-of-band control signal such as a special env var or hidden runtime switch would be the wrong shape here. It would blur the data/control boundary and likely violate the Data-State DAG model by making important execution intent implicit rather than materialized.
Instead, the better model is:
- represent the directive as data
- materialize that directive as a file or manifest-like input artifact
- have
pfcon and the first plugin interpret that artifact explicitly
Concretely, that could mean a source descriptor file that says, in effect:
- here is the input source
- here is how it should be accessed
- here is the first plugin to run against it
That keeps the system aligned with ChRIS principles:
- execution intent is explicit and inspectable
- provenance can capture the directive as an input artifact
- the control plane stays closer to a data-state transition than to an imperative side-channel
So the question should not be "what env var tells pfcon to do local compute?" but rather:
- what materialized input artifact tells
pfcon how to resolve and stage source-local input for the first plugin?
Practical architecture shape
- Keep CUBE ->
pfcon manifest-driven control.
- Implement source adapters in
pfcon (cube://, posix://, s3://, etc., as permitted).
- Allow the first plugin to operate on source-local data when possible.
- Make cache behavior explicit (
hit, miss, warm) to avoid repeated transfers.
- Preserve idempotency and state transitions (
copying -> ready -> execute).
Container/runtime question (how does pfcon access non-container files?)
Possible patterns:
- bind mounts from host paths
- mounted network filesystems (NFS/SMB/CephFS/CSI)
- object storage/API pulls (S3-compatible, etc.)
- optional data-access broker service for policy/credential mediation
The right choice depends on deployment and tenant boundaries.
Guardrails to define explicitly
Even if current ingestion policy is permissive, direct source access still changes the execution boundary and should be explicit:
- source allowlists/policies
- scoped credentials (short-lived where possible)
- provenance capture (source URI, version/checksum, retrieval time)
- idempotent import/execution behavior
Suggested phased plan
- Land async staging (pre-copy/init) for immediate timeout reliability.
- Add source abstraction + cache-aware staging in
pfcon.
- Formalize source-local first plugin execution as the primary model.
- Treat feed-scoped import as one documented workflow built on that model.
- Add push/event UX only if polling/status is insufficient.
Thanks all. I want to open a separate design thread focused on federation/data-local execution behavior, distinct from the immediate timeout fix.
Related context:
https://github.com/FNNDSC/pfcon/issues/155https://github.com/FNNDSC/pfcon/issues/156https://github.com/FNNDSC/CHRIS_docs/issues/48Proposal: source-local first plugin execution
Today, CUBE typically asks
pfconto run a specific plugin against a specific input directory. I think we should preserve that model, but broaden what "input directory" can mean.Instead of assuming the first plugin must run only on data already ingested into CUBE storage, allow:
pfcona manifest-like request describing the input source.pfconto resolve/access that source from wherever the data actually lives, where permitted.pfconto run the first plugin directly against that source-local data.This makes the primary concept:
rather than:
Why this seems valuable
pfconexecution model intact: "here is the input, run this plugin."Important implication
If the first plugin is effectively a copy/import plugin (for example, something in the spirit of
pl-rclone-copy-template), then this becomes a feed-scoped ingestion path through normal plugin execution.If the first plugin is an analysis plugin, then the same mechanism becomes compute-near-data, with only the results being copied back and registered.
So feed-scoped ingestion is one important consequence of this design, but it is not the only purpose and should not be the primary framing.
Clarification on scope
pfconexecute the first plugin directly on source-local data.Relationship to timeout work
How this complements the high-fanout pfcon issue
This proposal should be viewed as complementary to the high-fanout / timeout work in
pfcon#155, not as a competing direction.pfcon#155is primarily about making handoff/staging robust when CUBE askspfconto operate on large inputs.In other words:
pfcon#155asks: how doespfconreliably get data it needs?The same pre-copy/init machinery and cache/source-adapter logic can support both.
How should CUBE communicate "compute on source-local data" to pfcon?
This is the part that needs to stay disciplined.
An out-of-band control signal such as a special env var or hidden runtime switch would be the wrong shape here. It would blur the data/control boundary and likely violate the Data-State DAG model by making important execution intent implicit rather than materialized.
Instead, the better model is:
pfconand the first plugin interpret that artifact explicitlyConcretely, that could mean a source descriptor file that says, in effect:
That keeps the system aligned with ChRIS principles:
So the question should not be "what env var tells pfcon to do local compute?" but rather:
pfconhow to resolve and stage source-local input for the first plugin?Practical architecture shape
pfconmanifest-driven control.pfcon(cube://,posix://,s3://, etc., as permitted).hit,miss,warm) to avoid repeated transfers.copying -> ready -> execute).Container/runtime question (how does
pfconaccess non-container files?)Possible patterns:
The right choice depends on deployment and tenant boundaries.
Guardrails to define explicitly
Even if current ingestion policy is permissive, direct source access still changes the execution boundary and should be explicit:
Suggested phased plan
pfcon.