Skip to content

[SPARK-57027][SQL] SortMergeJoinExec: skip statically-dead branches in codegen#56075

Draft
gengliangwang wants to merge 1 commit into
apache:masterfrom
gengliangwang:SPARK-57027-smj-dead-branches
Draft

[SPARK-57027][SQL] SortMergeJoinExec: skip statically-dead branches in codegen#56075
gengliangwang wants to merge 1 commit into
apache:masterfrom
gengliangwang:SPARK-57027-smj-dead-branches

Conversation

@gengliangwang
Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This is a sub-task of SPARK-56908.

Two statically-dead patterns in SortMergeJoinExec codegen:

  1. genComparison emits

    comp = 0;
    if (comp == 0) { comp = compare(k1); }
    if (comp == 0) { comp = compare(k2); }
    

    The first if (comp == 0) is always true (we just assigned 0). Emit comp = compare(k1); directly; only wrap subsequent keys. genComparison is called 5x per SMJ stage (twice in genScanner, three times in codegenFullOuter). For single-key joins (common), each call collapses to one line.

  2. genScanner and codegenFullOuter emit if (k1IsNull || k2IsNull || ...) { handler }. When all key ExprValues have isNull == FalseLiteral, the disjunction is statically false and the whole block (including its handleStreamedAnyNull / "join with null row" handler) is dead. Detect this and omit the block. Hits fact/dimension joins on numeric keys where Spark has already proved non-nullability.

Why are the changes needed?

Smaller generated Java per SMJ stage. JIT eliminates the dead code at runtime; the win is smaller generated source, more 64KB method-limit headroom, and slightly faster Janino compile.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing test suites cover both paths with whole-stage codegen on and off:

  • OuterJoinSuite (SMJ full-outer codegen + interpreted scanner).
  • InnerJoinSuite (SMJ codegen and non-codegen paths).
  • ExistenceJoinSuite (SMJ existence path).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

…n codegen

### What changes were proposed in this pull request?

Two statically-dead patterns in `SortMergeJoinExec` codegen:

1. `genComparison` emits `comp = 0; if (comp == 0) { comp = compare(k1); } ...`.
The first `if (comp == 0)` is always true (we just assigned 0). Emit
`comp = compare(k1);` directly; only wrap subsequent keys. `genComparison`
is called 5x per SMJ stage (twice in `genScanner`, three times in
`codegenFullOuter`). For single-key joins (common), each call collapses
to one line.

2. `genScanner` and `codegenFullOuter` emit
`if (k1IsNull || k2IsNull || ...) { handler }`. When all key `ExprValue`s
have `isNull == FalseLiteral`, the disjunction is statically `false` and
the whole block (including its `handleStreamedAnyNull` / "join with null
row" handler) is dead. Detect this and omit the block. Hits fact/
dimension joins on numeric keys where Spark has already proved
non-nullability.

### Why are the changes needed?

Smaller generated Java per SMJ stage. JIT eliminates the dead code at
runtime; the win is smaller generated source, more 64KB method-limit
headroom, and slightly faster Janino compile.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test suites cover both paths with whole-stage codegen on and
off:
- `OuterJoinSuite` (SMJ full-outer codegen + interpreted scanner).
- `InnerJoinSuite` (SMJ codegen and non-codegen paths).
- `ExistenceJoinSuite` (SMJ existence path).

### Was this patch authored or co-authored using generative AI tooling?

Yes, with Claude Code.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant