Skip to content

Resume classic remoting reader after failed handoff#3205

Open
He-Pin wants to merge 1 commit into
apache:mainfrom
He-Pin:fix-classic-remoting-read-handoff
Open

Resume classic remoting reader after failed handoff#3205
He-Pin wants to merge 1 commit into
apache:mainfrom
He-Pin:fix-classic-remoting-read-handoff

Conversation

@He-Pin

@He-Pin He-Pin commented Jun 27, 2026

Copy link
Copy Markdown
Member

Motivation

Classic remoting passive read handoff can leave the original outbound reader stuck in notReading if the read-only handoff endpoint fails.

PR validation also exposed an unrelated BindCanonicalAddressSpec failure in the tls-tcp cases, where the spec fell back to the reference.conf default keystore and truststore relative paths.

Modification

Track passive read handoff fallback endpoints and resume the original writable reader when the read-only endpoint terminates while the original endpoint and UID still match.

Carry fallback state across replacement read-only endpoints, clear stale fallback state on UID/address changes, handle ResumeReading in supervisor states, add logging, and add regression coverage. Also correct a stale classic remoting passive-connections config reference.

Also include the TLS test configuration fix from #3210, reusing ArterySpecSupport.tlsConfig in BindCanonicalAddressSpec.commonConfig so the existing PR validation can complete.

Result

Classic remoting can recover future inbound messages on the original association after a failed passive read handoff.

The TLS bind canonical address cases no longer fail with NoSuchFileException: keystore during PR validation.

Tests

  • rtk sbt "remote / Test / testOnly org.apache.pekko.remote.ReliableDeliverySupervisorSpec" / passed (4 tests)
  • rtk sbt "remote / Test / testOnly org.apache.pekko.remote.classic.RemotingSpec -- -z \"resume the outbound reader\"" / passed
  • rtk sbt "remote / Test / testOnly org.apache.pekko.remote.classic.RemotingSpec" / passed (22 tests)
  • JDK 17: rtk bash -lc "export JAVA_HOME=$(/usr/libexec/java_home -v 17); export PATH=\"$JAVA_HOME/bin:$PATH\"; java -version; sbt \"remote / Test / testOnly org.apache.pekko.remote.artery.BindCanonicalAddressSpec\"" / passed (15 tests)
  • rtk scalafmt --list --mode diff-ref=origin/main / passed (JVM Unsafe warnings only)
  • rtk bash -lc "git diff --check" / passed
  • rtk sbt sortImports / failed: Scalafix/scala-meta NoSuchMethodError in local tooling; unrelated generated change was reverted
  • qodercli stdout review / No must-fix findings
  • independent subAgent review / No must-fix findings

References

Fixes #3206
Refs #3210

@He-Pin He-Pin requested a review from Copilot June 27, 2026 17:20
@He-Pin He-Pin force-pushed the fix-classic-remoting-read-handoff branch from fc181bd to 704455e Compare June 27, 2026 17:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a classic remoting edge case where a passive read handoff can leave the original outbound reader stuck in notReading if the read-only handoff endpoint fails, preventing future inbound messages on the original association.

Changes:

  • Track per-read-only-endpoint “resumable reader” fallback state and resume the original writable reader when the read-only endpoint terminates (only if remote address + UID still match).
  • Add EndpointWriter.ResumeReading handling across ReliableDeliverySupervisor, EndpointWriter, and EndpointReader to re-enable reads after stop/handoff.
  • Add regression coverage for resume-reading behavior and correct a stale passive-connections config reference in tests/docs.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
remote/src/main/scala/org/apache/pekko/remote/Remoting.scala Adds fallback tracking for passive read handoffs and resumes original reader on read-only endpoint termination when safe.
remote/src/main/scala/org/apache/pekko/remote/Endpoint.scala Introduces/handles ResumeReading across supervisor/writer/reader states, including gated/idle transitions.
remote/src/test/scala/org/apache/pekko/remote/ReliableDeliverySupervisorSpec.scala Adds unit-level regression coverage for ResumeReading in idle/gated scenarios.
remote/src/test/scala/org/apache/pekko/remote/classic/RemotingSpec.scala Adds an integration regression test covering failed passive handoff and successful subsequent delivery.
cluster/src/multi-jvm/scala/org/apache/pekko/cluster/RestartNodeSpec.scala Updates commented config key to the correct classic remoting passive-connections setting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread remote/src/main/scala/org/apache/pekko/remote/Remoting.scala Outdated
case s @ Send(_: SystemMessage, _, _, _) => tryBuffer(s.copy(seqOpt = Some(nextSeq())))
case s: Send => context.system.deadLetters ! s
case EndpointWriter.FlushAndStop => context.stop(self)
case EndpointWriter.ResumeReading =>

@He-Pin He-Pin Jun 27, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no such case, so it can never resume.

@He-Pin He-Pin force-pushed the fix-classic-remoting-read-handoff branch 2 times, most recently from a29905f to 3b57554 Compare June 27, 2026 17:57
@He-Pin He-Pin requested a review from pjfanning June 27, 2026 18:00
@He-Pin He-Pin added cluster bug Something isn't working labels Jun 27, 2026
@He-Pin He-Pin added this to the 2.0.0-M4 milestone Jun 27, 2026
He-Pin added a commit to He-Pin/incubator-pekko that referenced this pull request Jun 27, 2026
Motivation:
PR validation failed in BindCanonicalAddressSpec when tls-tcp used the reference.conf default keystore path.

Modification:
Reuse the existing ArterySpecSupport TLS test configuration for the bind canonical address spec so tls-tcp tests resolve the test keystore and truststore resources.

Result:
The TLS bind canonical address cases no longer fail with NoSuchFileException: keystore.

Tests:
- rtk bash -lc 'export JAVA_HOME=/Users/hepin/Library/Java/JavaVirtualMachines/azul-17.0.17/Contents/Home; export PATH="/Users/hepin/.sdkman/candidates/java/current/bin:/Users/hepin/.nvm/versions/node/v22.23.0/lib/node_modules/@openai/codex/node_modules/@openai/codex-darwin-arm64/vendor/aarch64-apple-darwin/codex-path:/Users/hepin/.nvm/versions/node/v22.23.0/lib/node_modules/@openai/codex/node_modules/@openai/codex-darwin-arm64/vendor/aarch64-apple-darwin/codex-path:/Users/hepin/.codex/tmp/arg0/codex-arg02IyHp2:/Applications/Wukong.app/Contents/Resources/resources/dingsync/realdoc.app/Contents/MacOS:/Users/hepin/.local/bin:/Users/hepin/.nvm/versions/node/v22.23.0/bin:/Users/hepin/.mimocode/bin:/Users/hepin/Library/Python/3.10/bin:/Users/hepin/.qoderwork/bin:/Users/hepin/.opencode/bin:/Users/hepin/.antigravity/antigravity/bin:/Users/hepin/bin:/Users/hepin/.sdkman/candidates/java/current/bin:/Users/hepin/Library/Java/JavaVirtualMachines/openjdk-25.0.2/Contents/Home/bin:/Applications/Wukong.app/Contents/Resources/resources/dingsync/realdoc.app/Contents/MacOS:/Users/hepin/.local/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/opt/pmk/env/global/bin:/Library/Apple/usr/bin:/Applications/Wireshark.app/Contents/MacOS:/Users/hepin/.tmo/bin:/Users/hepin/.local/bin:/Users/hepin/.cargo/bin:/Users/hepin/Library/Application Support/JetBrains/Toolbox/scripts:/Users/hepin/Library/Application Support/Coursier/bin:/Users/hepin/.local/bin"; java -version; sbt "remote / Test / testOnly org.apache.pekko.remote.artery.BindCanonicalAddressSpec"' / passed (15 tests)
- rtk scalafmt --mode diff-ref=origin/main --non-interactive / passed
- rtk scalafmt --list --mode diff-ref=origin/main --non-interactive / passed
- rtk bash -lc 'git diff --check' / passed

References:
Refs apache#3205
@He-Pin He-Pin force-pushed the fix-classic-remoting-read-handoff branch from 062e3ef to 3b57554 Compare June 27, 2026 19:04
He-Pin added a commit to He-Pin/incubator-pekko that referenced this pull request Jun 27, 2026
Motivation:
PR validation failed in BindCanonicalAddressSpec when tls-tcp used the reference.conf default keystore path.

Modification:
Reuse the existing ArterySpecSupport TLS test configuration for the bind canonical address spec so tls-tcp tests resolve the test keystore and truststore resources.

Result:
The TLS bind canonical address cases no longer fail with NoSuchFileException: keystore.

Tests:
- rtk bash -lc 'export JAVA_HOME=/Users/hepin/Library/Java/JavaVirtualMachines/azul-17.0.17/Contents/Home; export PATH="/Users/hepin/.sdkman/candidates/java/current/bin:/Users/hepin/.nvm/versions/node/v22.23.0/lib/node_modules/@openai/codex/node_modules/@openai/codex-darwin-arm64/vendor/aarch64-apple-darwin/codex-path:/Users/hepin/.nvm/versions/node/v22.23.0/lib/node_modules/@openai/codex/node_modules/@openai/codex-darwin-arm64/vendor/aarch64-apple-darwin/codex-path:/Users/hepin/.codex/tmp/arg0/codex-arg02IyHp2:/Applications/Wukong.app/Contents/Resources/resources/dingsync/realdoc.app/Contents/MacOS:/Users/hepin/.local/bin:/Users/hepin/.nvm/versions/node/v22.23.0/bin:/Users/hepin/.mimocode/bin:/Users/hepin/Library/Python/3.10/bin:/Users/hepin/.qoderwork/bin:/Users/hepin/.opencode/bin:/Users/hepin/.antigravity/antigravity/bin:/Users/hepin/bin:/Users/hepin/.sdkman/candidates/java/current/bin:/Users/hepin/Library/Java/JavaVirtualMachines/openjdk-25.0.2/Contents/Home/bin:/Applications/Wukong.app/Contents/Resources/resources/dingsync/realdoc.app/Contents/MacOS:/Users/hepin/.local/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/opt/pmk/env/global/bin:/Library/Apple/usr/bin:/Applications/Wireshark.app/Contents/MacOS:/Users/hepin/.tmo/bin:/Users/hepin/.local/bin:/Users/hepin/.cargo/bin:/Users/hepin/Library/Application Support/JetBrains/Toolbox/scripts:/Users/hepin/Library/Application Support/Coursier/bin:/Users/hepin/.local/bin"; java -version; sbt "remote / Test / testOnly org.apache.pekko.remote.artery.BindCanonicalAddressSpec"' / passed (15 tests)
- rtk scalafmt --mode diff-ref=origin/main --non-interactive / passed
- rtk scalafmt --list --mode diff-ref=origin/main --non-interactive / passed
- rtk bash -lc 'git diff --check' / passed

References:
Refs apache#3205
He-Pin added a commit that referenced this pull request Jun 27, 2026
Motivation:
PR validation failed in BindCanonicalAddressSpec when tls-tcp used the reference.conf default keystore path.

Modification:
Reuse the existing ArterySpecSupport TLS test configuration for the bind canonical address spec so tls-tcp tests resolve the test keystore and truststore resources.

Result:
The TLS bind canonical address cases no longer fail with NoSuchFileException: keystore.

Tests:
- rtk bash -lc 'export JAVA_HOME=/Users/hepin/Library/Java/JavaVirtualMachines/azul-17.0.17/Contents/Home; export PATH="/Users/hepin/.sdkman/candidates/java/current/bin:/Users/hepin/.nvm/versions/node/v22.23.0/lib/node_modules/@openai/codex/node_modules/@openai/codex-darwin-arm64/vendor/aarch64-apple-darwin/codex-path:/Users/hepin/.nvm/versions/node/v22.23.0/lib/node_modules/@openai/codex/node_modules/@openai/codex-darwin-arm64/vendor/aarch64-apple-darwin/codex-path:/Users/hepin/.codex/tmp/arg0/codex-arg02IyHp2:/Applications/Wukong.app/Contents/Resources/resources/dingsync/realdoc.app/Contents/MacOS:/Users/hepin/.local/bin:/Users/hepin/.nvm/versions/node/v22.23.0/bin:/Users/hepin/.mimocode/bin:/Users/hepin/Library/Python/3.10/bin:/Users/hepin/.qoderwork/bin:/Users/hepin/.opencode/bin:/Users/hepin/.antigravity/antigravity/bin:/Users/hepin/bin:/Users/hepin/.sdkman/candidates/java/current/bin:/Users/hepin/Library/Java/JavaVirtualMachines/openjdk-25.0.2/Contents/Home/bin:/Applications/Wukong.app/Contents/Resources/resources/dingsync/realdoc.app/Contents/MacOS:/Users/hepin/.local/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/opt/pmk/env/global/bin:/Library/Apple/usr/bin:/Applications/Wireshark.app/Contents/MacOS:/Users/hepin/.tmo/bin:/Users/hepin/.local/bin:/Users/hepin/.cargo/bin:/Users/hepin/Library/Application Support/JetBrains/Toolbox/scripts:/Users/hepin/Library/Application Support/Coursier/bin:/Users/hepin/.local/bin"; java -version; sbt "remote / Test / testOnly org.apache.pekko.remote.artery.BindCanonicalAddressSpec"' / passed (15 tests)
- rtk scalafmt --mode diff-ref=origin/main --non-interactive / passed
- rtk scalafmt --list --mode diff-ref=origin/main --non-interactive / passed
- rtk bash -lc 'git diff --check' / passed

References:
Refs #3205
Motivation:
Classic remoting passive read handoff can leave the original outbound reader stuck in notReading if the read-only handoff endpoint fails.

Modification:
Track passive read handoff fallback endpoints and resume the original writable reader when the read-only endpoint terminates while the original endpoint and UID still match. Carry fallback state across replacement read-only endpoints, clear stale fallback state on UID/address changes, handle ResumeReading in supervisor states, add logging, and add regression coverage. Also correct the stale classic remoting passive-connections config text.

Result:
Classic remoting can recover future inbound messages on the original association after a failed passive read handoff.

Tests:
- rtk sbt "remote / Test / testOnly org.apache.pekko.remote.ReliableDeliverySupervisorSpec" / passed (4 tests)
- rtk sbt "remote / Test / testOnly org.apache.pekko.remote.classic.RemotingSpec -- -z \"resume the outbound reader\"" / passed
- rtk sbt "remote / Test / testOnly org.apache.pekko.remote.classic.RemotingSpec" / passed (22 tests)
- rtk scalafmt --list --mode diff-ref=origin/main / passed (JVM Unsafe warnings only)
- rtk bash -lc 'git diff --check' / passed
- rtk sbt sortImports / failed: Scalafix/scala-meta NoSuchMethodError in local tooling; unrelated generated change was reverted
- qodercli stdout review / No must-fix findings
- independent subAgent review / No must-fix findings

References:
Fixes apache#3206
@He-Pin He-Pin force-pushed the fix-classic-remoting-read-handoff branch from e971975 to 76fbfda Compare June 27, 2026 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cluster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Classic remoting passive read handoff can leave outbound reader stopped

3 participants