Skip to content

[SPARK-57003][SQL][SS] Widen stateful operator output and state schema nullability#56061

Open
HeartSaVioR wants to merge 2 commits into
apache:masterfrom
HeartSaVioR:widen-stateful-op-nullability
Open

[SPARK-57003][SQL][SS] Widen stateful operator output and state schema nullability#56061
HeartSaVioR wants to merge 2 commits into
apache:masterfrom
HeartSaVioR:widen-stateful-op-nullability

Conversation

@HeartSaVioR
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Introduce a three-component fix for stateful-operator nullability drift, gated by spark.sql.streaming.statefulOperator.alwaysNullableOutput.enabled (pinned per-query via the offset log):

  • (a) WidenStatefulOpNullability.widenStateSchema: every stateful physical exec widens its state key/value schema to fully nullable at construction.
  • (b) WidenStatefulOpNullability.widenOutputForStatefulOp: every stateful logical and physical operator widens its declared output to fully nullable.
  • (c) WidenStatefulOperatorAttributeNullability: an optimizer rule that widens AttributeReferences inside stateful ops' internal expressions and propagates upward through ancestor expressions.

With the above fix, we aim to ensure the state schema to be "fully" nullable (top level column, nested column, and collection types) regardless of the input schema, and the output schema of the stateful operator to be also "fully" nullable as well. The change of output schema for stateful operator is necessary, because even if the input schema is non-nullable, state can produce the null value, hence the output can be nullable.

Why are the changes needed?

This has been a long standing issue of streaming engine vs Query Optimizer.

By the nature of streaming query, the query is meant to be long-running, in many cases spans to multiple Spark versions. Also, the logical plan is not always the same across batches (e.g. there are multiple stream sources and one of the source does not have a new data at batch N). This puts the streaming query to be affected by analyzer and optimizer.

The state schema of stateful operator is mostly determined by the input schema of the stateful operator, and nullability isn't an exception. If the input schema has a nullable column, state schema would have a nullable column. Vice versa with non-nullable column.

For Query Optimizer, one of the optimizations is to flip the nullability, say, nullable to non-nullable if appropriate. This can be done directly or indirectly, and the most problematic case is when the optimization is applied "selectively".

The one of easy example is the elimination of Union: for the streaming query with multiple streams using Union, batch N could have one stream be non-empty while another stream to be empty. For that case,PropagateEmptyRelation can drop empty Union branches, causing a per-column nullability flip that propagates into a stateful operator's state schema across microbatches or restarts. This causes either STATE_STORE_KEY_SCHEMA_NOT_COMPATIBLE on restart or a codegen NPE when state-restored rows carry nulls in columns declared non-nullable.

Does this PR introduce any user-facing change?

No user-visible behavior change for new queries (all stateful operator outputs become nullable, which is semantically correct). Existing queries keep their original behavior via the offset log gate.

How was this patch tested?

New StreamingStatefulOperatorNullabilityDriftSuite covering:

  • New-query path: Union-branch-drop restart scenarios for aggregate,
    dropDuplicates, dropDuplicatesWithinWatermark.
  • Codegen NPE regression with struct grouping keys.
  • Existing-query path: widening forced off still triggers schema mismatch.
  • Rule-level: scope check (non-stateful subtrees skipped).
  • Helper-level: deepWidenAttribute recursion into nested types.

Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: Claude 4.7 Opus

…o `LogicalPlan`

### What changes were proposed in this pull request?

Introduce two new methods on `LogicalPlan`:

- `def isStateful: Boolean = false` -- per-operator declaration of whether the node
  is a streaming stateful operator (kept across microbatches).
- `def containsStatefulOperator: Boolean` -- subtree-level check, memoized.

Override `isStateful` on the operators that are streaming stateful:
`Aggregate`, `Join` (stream-stream), `GlobalLimit`, `Distinct`,
`Deduplicate`, `DeduplicateWithinWatermark`, `FlatMapGroupsWithState`,
`FlatMapGroupsInPandasWithState`, `TransformWithState`,
`TransformWithStateInPySpark`.

### Why are the changes needed?

Several upcoming streaming-side rules (e.g. an optimizer rule that widens
`AttributeReference` nullability around stateful operators) need an
`isStateful` / `containsStatefulOperator` notion on `LogicalPlan` itself
rather than having each rule re-derive the stateful-operator check via
pattern matching.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing `UnsupportedOperationCheckerSuite` and streaming test suites
cover the behavior preservation. No new tests are added in this commit;
subsequent PRs that build on `isStateful` will add targeted tests.

### Was this patch authored or co-authored using generative AI tooling?

Yes.
…lability

### What changes were proposed in this pull request?

Introduce a three-component fix for stateful-operator nullability drift,
gated by `spark.sql.streaming.statefulOperator.alwaysNullableOutput.enabled`
(pinned per-query via the offset log):

- (a) `WidenStatefulOpNullability.widenStateSchema`: every stateful physical
  exec widens its state key/value schema to fully nullable at construction.
- (b) `WidenStatefulOpNullability.widenOutputForStatefulOp`: every stateful
  logical and physical operator widens its declared `output` to fully nullable.
- (c) `WidenStatefulOperatorAttributeNullability`: an optimizer rule that
  widens `AttributeReference`s inside stateful ops' internal expressions and
  propagates upward through ancestor expressions.

### Why are the changes needed?

`PropagateEmptyRelation` can drop empty `Union` branches, causing a
per-column nullability flip that propagates into a stateful operator's
state schema across microbatches or restarts. This causes either
`STATE_STORE_KEY_SCHEMA_NOT_COMPATIBLE` on restart or a codegen NPE
when state-restored rows carry nulls in columns declared non-nullable.

### Does this PR introduce _any_ user-facing change?

No user-visible behavior change for new queries (all stateful operator
outputs become nullable, which is semantically correct). Existing queries
keep their original behavior via the offset log gate.

### How was this patch tested?

New `StreamingStatefulOperatorNullabilityDriftSuite` covering:
- New-query path: Union-branch-drop restart scenarios for aggregate,
  dropDuplicates, dropDuplicatesWithinWatermark.
- Codegen NPE regression with struct grouping keys.
- Existing-query path: widening forced off still triggers schema mismatch.
- Rule-level: scope check (non-stateful subtrees skipped).
- Helper-level: `deepWidenAttribute` recursion into nested types.

### Was this patch authored or co-authored using generative AI tooling?

Yes.
@HeartSaVioR
Copy link
Copy Markdown
Contributor Author

cc. @cloud-fan Please take a look, thanks!

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Prior state and problem. A stateful operator's state schema is built from its input attributes, and historically the schema (including nullability) gets recorded on the first batch and re-validated against the input schema on every subsequent batch. IncrementalExecution.optimizedPlan is recomputed every microbatch — same analyzed plan, but the optimizer runs fresh each batch and observes per-batch data state. Rules like PropagateEmptyRelation collapse one branch of a Union when that branch is empty in some microbatch: the surviving branch's nullability becomes the Union's output nullability, propagates into the stateful operator above, and the stateful operator's own child.output-derived output flips with it. Two downstream consequences from that drift: (1) state schema drift triggers STATE_STORE_KEY_SCHEMA_NOT_COMPATIBLE on restart, since the existing equalsIgnoreNameAndCompatibleNullability check rejects nullable→non-nullable narrowing; (2) operators above the stateful op see "non-nullable" output for one batch, codegen skips null checks, then state-restored rows from a prior nullable batch carry actual nulls (NPE).

Design approach. Three independent components, all gated on spark.sql.streaming.statefulOperator.alwaysNullableOutput.enabled (default true, pinned per-query via OffsetSeq at batch 0 so existing queries keep their old behavior):

  • (a) Stateful physical execs widen the state key/value schemas they register via validateAndMaybeEvolveStateSchema and pass to mapPartitionsWith*StateStore to fully nullable (outer + nested asNullable). This stabilizes the on-disk schema.
  • (b) Stateful logical operators (Aggregate, Join, Distinct, Deduplicate, DeduplicateWithinWatermark, GlobalLimit, FlatMapGroupsWithState, TransformWithState, etc.) and their physical execs widen their declared output to fully nullable. Drivers above see nullable inputs even if the optimizer would have inferred non-nullable. isStateful / containsStatefulOperator on LogicalPlan (from the previous commit) provide the gating mechanism.
  • (c) New optimizer rule WidenStatefulOperatorAttributeNullability runs after UpdateAttributeNullability in both the main optimizer (via IncrementalExecution.optimizedPlan) and AQE. It bottom-up walks subtrees containing a stateful operator and deep-widens AttributeReferences whose exprId matches p.output ++ p.children.flatMap(_.output). This catches references that the per-op output override on its own would not (e.g. nested-struct nullability inside expression bodies, references in ancestor Project / Filter).

Key design decisions.

  • Conf pinned via OffsetSeq at batch 0 (new entry in OffsetSeqMetadata.relevantSQLConfs + relevantSQLConfDefaultValues with "false" default for pre-existing queries). Restart-safe migration.
  • isStateful lives on LogicalPlan itself rather than a marker trait, so the rule can use a uniform containsStatefulOperator check without re-deriving statefulness via pattern matching against the union of stateful types. Tradeoff: tiny generic-API bloat for a streaming-only concept on LogicalPlan. Defensible.
  • The state schema compatibility check (StateSchemaCompatibilityChecker) is unchanged — the design relies on both stored and new schemas being widened so the existing strict-nullability check trivially passes. (See the inline comment on the SQLConf doc string.)

Implementation sketch. New file WidenStatefulOperatorAttributeNullability.scala (catalyst/analysis) holds both the helper object WidenStatefulOpNullability (deep-widen + state-schema widen + output widen) and the rule. Stateful logical operators in basicLogicalOperators.scala, object.scala, pythonLogicalOperators.scala get isStateful + output-widening overrides. Stateful execs in statefulOperators.scala, streamingLimits.scala, StreamingSymmetricHashJoinExec.scala, TransformWithStateExec.scala, TransformWithStateInPySparkExec.scala, FlatMapGroupsWithStateExec.scala, FlatMapGroupsInPandasWithStateExec.scala get output overrides; the agg / dedup / join execs also widen the state schemas they register and open. IncrementalExecution.optimizedPlan and AQEOptimizer get the new rule batch. New regression suite StreamingStatefulOperatorNullabilityDriftSuite covers the Union-branch-drop restart for aggregate / dedup / dedup-within-watermark, plus the codegen-NPE struct-key case and rule-level scope / recursion checks.

General notes

  • Test coverage gaps. The new drift suite covers Aggregate, Deduplicate, DeduplicateWithinWatermark. Missing union-branch-drop restart cases for: stream-stream Join, FlatMapGroupsWithState, TransformWithState. The last two are especially worth adding given the component-(a) gap noted inline on the helper Scaladoc — their grouping-key state schemas can still drift.
  • Import ordering is off in six files. The new import org.apache.spark.sql.catalyst.analysis.WidenStatefulOpNullability is placed after org.apache.spark.sql.types.* / org.apache.spark.sql.streaming.* in TransformWithStateExec.scala, TransformWithStateInPySparkExec.scala, FlatMapGroupsWithStateExec.scala, FlatMapGroupsInPandasWithStateExec.scala, StreamingSymmetricHashJoinExec.scala, streamingLimits.scala. Spark convention is alphabetical within org.apache.spark.*, so it should go among the other org.apache.spark.sql.catalyst.* imports.
  • [SPARK-XXXXX] placeholder. Both commits in this branch still have [SPARK-XXXXX] in the subject; the PR description references SPARK-57003. Needs the JIRA ID before merge.

This is a substantive fix for a long-standing streaming / optimizer interaction issue, and the three-component design is sound. Inline comments below cover the more substantive items.

Comment on lines +32 to +33
* - (a) `widenStateSchema`: explicit `asNullable` at every state-schema construction
* site in each stateful physical exec.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Component (a) is described as applying "at every state-schema construction site in each stateful physical exec," but several execs are missing the explicit widening:

  • FlatMapGroupsWithStateExec.validateAndMaybeEvolveStateSchema (FlatMapGroupsWithStateExec.scala ~L203): groupingAttributes.toStructType is registered un-widened; and the two StateStore.get / mapPartitionsWithStateStore calls in doExecute (~L247-263) open state stores with the un-widened key schema.
  • FlatMapGroupsInPandasWithStateExec inherits the same base, so it has the same gap.
  • TransformWithStateExec: getColFamilySchemas's defaultSchema (~L143-145), validateAndMaybeEvolveStateSchema (via validateAndWriteStateSchema at ~L380), and the StateStore.get / mapPartitionsWithStateStore calls (~L406-417, ~L428-435) all use keyExpressions.toStructType / keyEncoder.schema un-widened.
  • TransformWithStateInPySparkExec: same pattern.

Grouping attributes are input-derived and subject to the same nullability drift the rest of the fix is preventing. Component (c) may incidentally widen the references via the logical-plan rewrite, but having component (a) skip these execs makes the defense-in-depth claim of the design false and leaves a real gap if (c) misses for any reason (rule excluded, unresolved subplan, etc.). Either add widenStateSchema(...) at these sites for consistency with StateStoreSaveExec / BaseStreamingDeduplicateExec / StreamingSymmetricHashJoinExec, or tighten the wording here to describe which execs are intentionally exempt and why.

Comment on lines +85 to +91
* 1. At a stateful operator: rewrite every `AttributeReference` inside the operator's
* internal expressions via [[WidenStatefulOpNullability#deepWidenAttribute]] whenever
* the attribute's `exprId` matches one in the operator's own (already widened via
* component (b)) `output`.
*
* 2. At non-stateful ancestor operators: rewrite `AttributeReference`s whose `exprId` is
* in `children.flatMap(_.output)` (already widened thanks to component (b)).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two bullets describe a split — (1) "at a stateful operator" matches against "the operator's own ... output", (2) "at non-stateful ancestor operators" matches against children.flatMap(_.output). But the implementation below uses the same union (p.output ++ p.children.flatMap(_.output)) for every node it visits, with no branch on isStateful. Either rewrite this section to describe the actual uniform behavior, or change the code to take different exprId sources for the two cases (the more conservative version would also help with the over-widening concern in the next comment).

case p: LeafNode => p
case p if !p.containsStatefulOperator => p
case p =>
val widenableExprIds: Set[ExprId] = (p.output ++ p.children.flatMap(_.output))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because widenableExprIds always pulls from both p.output and all of p.children.flatMap(_.output), when an operator has a mix of stateful and non-stateful children (e.g. a non-stream-stream Join above a streaming aggregate on one side and a batch source on the other), references to the non-stateful sibling's attributes are also deep-widened. The docstring above implies this happens only against attributes "already widened thanks to component (b)" — but the non-stateful sibling's attributes are not. The widening is always correctness-safe (nullable is a valid weakening), so this is a docs / over-widening concern, not a bug. Worth either restricting to children whose subtrees contain a stateful operator (p.children.filter(_.containsStatefulOperator).flatMap(_.output)), or acknowledging the over-widening in the comment so future readers don't expect the narrower behavior the docstring promises.

.doc("When true, every streaming stateful operator reports its output schema with " +
"nullable=true on all columns (including nested struct fields, array elements, and " +
"map values), the state schema is widened at every construction site, and the state " +
"schema compatibility checker ignores nullability for stateful operator schemas. " +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This claim is not accurate: StateSchemaCompatibilityChecker is not changed by this PR and does not ignore nullability — it still calls DataType.equalsIgnoreNameAndCompatibleNullability which rejects nullable→non-nullable narrowing. What actually happens is that this conf causes the schemas passed to the checker (both stored and new) to be widened beforehand, so the existing strict check trivially passes regardless of input nullability. Suggested rewording:

Suggested change
"schema compatibility checker ignores nullability for stateful operator schemas. " +
"schema is widened at every construction site, so the existing state schema " +
"compatibility check trivially passes regardless of input nullability. " +

Comment on lines +448 to 453
class StateSchemaCompatibilityCheckerWithNullabilityWideningDisabledSuite
extends StateSchemaCompatibilityCheckerTestMixin {

private def applyNewSchemaToNestedFieldInValue(newNestedSchema: StructType): StructType = {
applyNewSchemaToNestedField(valueSchema, newNestedSchema, "value3")
override protected def sparkConf: org.apache.spark.SparkConf = {
super.sparkConf.set(SQLConf.STATEFUL_OPERATOR_ALWAYS_NULLABLE_OUTPUT.key, "false")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing this suite changes vs the parent is setting STATEFUL_OPERATOR_ALWAYS_NULLABLE_OUTPUT=false. But StateSchemaCompatibilityChecker.validateAndMaybeEvolveStateSchema doesn't read that conf — the schemas it receives are exactly what the test passes in, and no production widening helper is invoked from these tests. The four storing nullable column into non-nullable column ... tests and the two changing the name of nested field ... tests therefore pass identically with the conf at either value. The conf separation is cosmetic.

Two follow-ups:

  1. The changing the name of nested field ... pair is unrelated to nullability — there's no reason for these to live in a "NullabilityWideningDisabled" suite. Move them back to the main suite.
  2. For the four nullability tests, the intent appears to be "these are no longer reachable in production with widening on," but the unit tests don't simulate that path — they exercise the checker in isolation. Either consolidate them back into the main suite (they still validate the unchanged checker behavior, which is worth keeping), or rework the setup so the conf actually has an observable effect (e.g. call through the production widening helpers in the test).

import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LocalRelation, Project}
import org.apache.spark.sql.types.IntegerType

withSQLConf(SQLConf.STATEFUL_OPERATOR_ALWAYS_NULLABLE_OUTPUT.key -> "true") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STATEFUL_OPERATOR_ALWAYS_NULLABLE_OUTPUT.key -> "true" is redundant — true is the default. Either drop the withSQLConf wrapper, or change to "false" and assert the rule no-ops (which would be a useful additional case).

import org.apache.spark.sql.execution.streaming.state._
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.catalyst.analysis.WidenStatefulOpNullability
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import out of order — org.apache.spark.sql.catalyst.analysis.WidenStatefulOpNullability should sit alongside the other org.apache.spark.sql.catalyst.* imports near the top of the org.apache.spark.* block, not after org.apache.spark.sql.streaming._. Same issue in TransformWithStateInPySparkExec.scala, FlatMapGroupsWithStateExec.scala, FlatMapGroupsInPandasWithStateExec.scala, StreamingSymmetricHashJoinExec.scala, and streamingLimits.scala.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants