Add nested-type access and more SQL operators to data-generation inference#608
Open
wmoustafa wants to merge 1 commit into
Open
Add nested-type access and more SQL operators to data-generation inference#608wmoustafa wants to merge 1 commit into
wmoustafa wants to merge 1 commit into
Conversation
…rence Extends `coral-data-generation` so the symbolic-constraint solver from PR linkedin#564 covers a wider class of WHERE predicates: more SQL operators, struct and map/array element access, and a predicate-based inference entry point that resolves per-path domains from a DNF query. Also tightens two inference paths whose existing rewrites silently produced wrong results for the new cases. ## New operator coverage Eight new `DomainTransformer` implementations are wired into `DomainInferenceProgram.withDefaultTransformers()`: | Transformer | SQL operator | | --- | --- | | `AbsIntegerTransformer` | `ABS(x)` | | `MinusIntegerTransformer` | binary `x - k` and `k - x` | | `NegateIntegerTransformer` | unary `-x` | | `UpperRegexTransformer` | `UPPER(x)` | | `ConcatRegexTransformer` | `CONCAT(x, lit)` / `CONCAT(lit, x)` | | `TrimRegexTransformer` | `TRIM(x)` — supports both Calcite's 3-operand standard form and Hive's 1-operand form | | `FieldAccessTransformer` | struct field access (`s.name`) on nested expressions | | `ItemTransformer` | `ITEM(coll, idx-or-key)` for array indexing and map lookup on nested expressions | `ConcatRegexTransformer` matches both `SqlStdOperatorTable.CONCAT` (the SQL `||` operator) and the `OTHER_FUNCTION` named `concat` that Hive emits. Existing transformers (`LowerRegexTransformer`, `PlusIntegerTransformer`, `TimesIntegerTransformer`, `SubstringRegexTransformer`) now accept `RexFieldAccess` as a valid variable operand, so expressions like `LOWER(s.name)`, `s.age + 5`, and `UPPER(sarr[0].name)` flow through. `SubstringRegexTransformer.canHandle` also gained an operand-arity check. The transformer registration is grouped into string ops → integer ops → cross-domain → structural pass-throughs for readability. ## Nested-type access New `AccessPath` value type identifies any value reachable from a root column index through a chain of struct fields (`FIELD`), map lookups (`MAP_KEY`), and array indices (`ARRAY_INDEX`). It's the key type of the new multi-path resolution API (below) and is also used in tests to assert which nested values were resolved. `DomainInferenceProgram.deriveInputDomain` gained two base cases so inference terminates correctly at nested column references — struct field access on a `RexInputRef` (e.g., `$3.name`) and ITEM access on a `RexInputRef` (e.g., `ITEM($2, 1)` for arrays, `ITEM($4, 'env')` for maps). ## Predicate-based inference: two reductions up the SQL evaluation hierarchy Master exposed one primitive — `deriveInputDomain(expr, outputDomain) → inputDomain` — which answers the leaf question: given an expression and a constraint on its output, derive the constraint on the input variable. Real callers, though, start higher up the SQL evaluation stack. The PR adds the two reductions that bridge a full WHERE clause down to the primitive: ``` WHERE clause (tree of AND / OR over comparisons) │ │ DnfRewriter (already exists) ▼ list of DNF disjuncts ── resolveAllPaths (new) │ │ for each disjunct, for each conjunct ▼ single comparison predicate (expr OP literal) ── deriveInputDomainFromPredicate (new) │ │ compute output domain from OP + literal ▼ (expression, output domain) pair ── deriveInputDomain (primitive) │ │ walk expr, refine via transformers ▼ domain on the input variable ``` - **`deriveInputDomainFromPredicate(RexCall predicate)`** is one reduction above the primitive. It takes a comparison `expr OP literal` (`=`, `<`, `>`, `<=`, `>=`), computes the output domain from the operator and literal — `> 5` ⇒ `IntegerDomain([6, ∞))`, `= 'abc'` ⇒ `RegexDomain.literal("abc")` — and reduces to `deriveInputDomain(expr, that)`. It also unwraps the `RexCall(UNARY_MINUS, RexLiteral)` shape Calcite uses for negative literals so `age = -5` works the same as `age = 5`. - **`resolveAllPaths(List<RexNode> disjuncts)`** is one reduction above that. Given the DNF disjuncts produced by `DnfRewriter`, it walks every disjunct, every conjunct, calls `deriveInputDomainFromPredicate` on each comparison, and combines the per-`AccessPath` results with AND semantics within a disjunct (intersection) and OR semantics across disjuncts (union). Predicates outside the comparison-with-literal shape are silently skipped — notably column-to-column join predicates, which still require per-column literals. For `WHERE (age > 10 AND name = 'foo') OR (age = 0)` the result is roughly `{ $age → IntegerDomain([11,∞) ∪ {0}), $name → RegexDomain("foo") }`. Nothing else is added: anything more specific belongs in a transformer, and anything less specific (such as converting a WHERE tree to DNF in the first place) was already the caller's job via `DnfRewriter`. ## Tighten `RegexToIntegerDomainConverter`: accept only canonical decimal regexes - **Input:** `R = ^[0-9]{3}$`. - **Master returns:** `IntegerDomain{0..999}`. - **Should return:** `IntegerDomain{100..999}` — SQL `CAST(integer AS VARCHAR)` produces canonical decimal (`0 → "0"`, never `"000"`), so `0` does not belong. - **Fix:** narrow the converter's contract to canonical-decimal regexes only. The accept rule changes from "finite + digit-only" to "finite + subset of `^(0|[1-9][0-9]*)$`". Non-canonical inputs (`^[0-9]{3}$`, `^009$`, empty regex, …) are now rejected with `NonConvertibleDomainException`. `CastRegexTransformer`'s `CAST(int AS VARCHAR)` branch keeps calling `convert(outputRegex)` directly and relies on this strict contract. ## ProjectPullUpRewriter: remap the join condition when a left Project changes field count Concrete scenario: tables `T1(a, b, c)` (3 cols) and `T2(x, y)` (2 cols). Plan before pull-up: ``` Join(condition: b = x) ├── Project(a, b) keeps 2 of T1's 3 columns │ └── Scan(T1) └── Scan(T2) ``` The join's row type is `[Project-output | T2] = [a, b, x, y]`, so inside the condition `b` resolves to `$1` and `x` to `$2`. The condition is `$1 = $2`. After pull-up, the `Project` moves above the `Join`, and the new join's left input is the raw `Scan(T1)`: ``` Project(...) └── Join(condition: ???) ├── Scan(T1) └── Scan(T2) ``` The new join's row type is `[T1 | T2] = [a, b, c, x, y]`. `b` is still `$1`, but `x` is now `$3` because the left input grew from 2 columns back to 3. The rewritten condition must be `$1 = $3`. Master inlined left-side `InputRef`s through the removed `Project` but left right-side `InputRef`s at their old positions. The rewritten condition came out as `$1 = $2`, which in the new frame points at `T1.c` (`VARCHAR`) — not `T2.x` (`INTEGER`). Wrong column, and a type mismatch that breaks join evaluation. The fix replaces the two side-specific helpers (`inlineLeftSide`, `inlineRightSide`) with a single `remapJoinCondition` pass. For every `InputRef` in the old condition it computes the position in the new frame using `oldLeftCount` (Project-output width) and `newLeftCount` (unprojected-left width): right-side references shift by `newLeftCount - oldLeftCount`; left-side references are remapped through the lifted projection expressions. ## IntegerDomain - New `negate()` method (returns `multiply(-1)`), used by the new `NegateIntegerTransformer`. - `Interval.isAdjacent` refactored to make the overflow guard explicit in two named booleans, matching the original behavior. ## Build `coral-data-generation/build.gradle` now applies the `java-library` plugin so the module exposes proper `api`/`implementation` configurations. ## Tests `RegexDomainInferenceProgramTest` is the main integration suite and grows substantially: it exercises every new operator individually, every new nested-type access pattern, and combined SQL queries with AND/OR over struct/map/array paths against four test tables (`test.T`, `test.complex`, `test.deep`, `test.interleaved`). Notable coverage areas: - single-operator tests for `SUBSTRING`, `LOWER`, `UPPER`, `CAST(int→str)`, `CAST(str→int)`, `CAST(str→date)`, arithmetic, `MINUS`, `ABS`, unary minus, `CONCAT`, `TRIM`, comparison operators with and without arithmetic - multi-column AND/OR with same-column intersection, disjoint ranges, range-with-equality, contradictory ranges, mixed regex/integer domains - struct field equality and arithmetic, map-element equality, array of structs, nested struct (`nested_struct.sub.value`), map of structs (`map_of_structs['key'].score`), and interleaved combinations - CAST cross-domain on struct fields, OR disjunction on struct fields, per-column union semantics `RegexTransformerTest` is a new dedicated unit-test class for `Concat`: prefix/suffix stripping, prefix/suffix mismatch (empty domain), empty suffix as identity, non-literal output passthrough. `IntegerTransformerTest` adds rigorous-style cases for `Minus`, `Negate`, and `Abs`: each test constructs the `RexCall` via `RexBuilder` and calls `transformer.refineInputDomain` directly, then asserts containment and boundaries — including the empty case for `ABS` over an all-negative output interval. `RegexToIntegerDomainConverterTest` is updated to match the new contract: tests that previously passed non-canonical regexes (e.g., `^[0-9]{3}$`, `^009$`, `^[0-9]?$`) now assert the converter rejects them with `NonConvertibleDomainException`. Parallel positive tests use canonical-form inputs (`^[1-9][0-9]{2}$` instead of `^[0-9]{3}$`). `CastRegexTransformerTest` adds concrete accept/reject probes for the returned regex (e.g., `getAutomaton().run("100")`), pins the canonical behavior of `CAST(int AS VARCHAR)` with a canonical 3-digit output, and documents the non-canonical fallback path. `ProjectPullUpRewriterTest` asserts row-type field-name and type preservation across pull-ups, and pins the rewritten join condition to `=($1, $3)` for the case described above. ## Verification Full module pipeline (`build`, `javadoc`, `spotlessJavaCheck`) passes; all tests in the module pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Extends
coral-data-generationso the symbolic-constraint solver from PR #564 covers a wider class of WHERE predicates: more SQL operators, struct and map/array element access, and a predicate-based inference entry point that resolves per-path domains from a DNF query. Also tightens two inference paths whose existing rewrites silently produced wrong results for the new cases.New operator coverage
Eight new
DomainTransformerimplementations are wired intoDomainInferenceProgram.withDefaultTransformers():AbsIntegerTransformerABS(x)MinusIntegerTransformerx - kandk - xUpperRegexTransformerUPPER(x)ConcatRegexTransformerCONCAT(x, lit)/CONCAT(lit, x)ConcatRegexTransformermatches bothSqlStdOperatorTable.CONCAT(the SQL||operator) and theOTHER_FUNCTIONnamedconcatthat Hive emits. Existing transformers (LowerRegexTransformer,PlusIntegerTransformer,TimesIntegerTransformer,SubstringRegexTransformer) now acceptRexFieldAccessas a valid variable operand, so expressions likeLOWER(s.name),s.age + 5, andUPPER(sarr[0].name)flow through.SubstringRegexTransformer.canHandlealso gained an operand-arity check.The transformer registration is grouped into string ops → integer ops → cross-domain → structural pass-throughs for readability.
Nested-type access
New
AccessPathvalue type identifies any value reachable from a root column index through a chain of struct fields (FIELD), map lookups (MAP_KEY), and array indices (ARRAY_INDEX). It's the key type of the new multi-path resolution API (below) and is also used in tests to assert which nested values were resolved.DomainInferenceProgram.deriveInputDomaingained two base cases so inference terminates correctly at nested column references — struct field access on aRexInputRef(e.g.,$3.name) and ITEM access on aRexInputRef(e.g.,ITEM($2, 1)for arrays,ITEM($4, 'env')for maps).Predicate-based inference: two reductions up the SQL evaluation hierarchy
Master exposed one primitive —
deriveInputDomain(expr, outputDomain) → inputDomain— which answers the leaf question: given an expression and a constraint on its output, derive the constraint on the input variable. Real callers, though, start higher up the SQL evaluation stack. The PR adds the two reductions that bridge a full WHERE clause down to the primitive:deriveInputDomainFromPredicate(RexCall predicate)is one reduction above the primitive. It takes a comparisonexpr OP literal(=,<,>,<=,>=), computes the output domain from the operator and literal —> 5⇒IntegerDomain([6, ∞)),= 'abc'⇒RegexDomain.literal("abc")— and reduces toderiveInputDomain(expr, that). It also unwraps theRexCall(UNARY_MINUS, RexLiteral)shape Calcite uses for negative literals soage = -5works the same asage = 5.resolveAllPaths(List<RexNode> disjuncts)is one reduction above that. Given the DNF disjuncts produced byDnfRewriter, it walks every disjunct, every conjunct, callsderiveInputDomainFromPredicateon each comparison, and combines the per-AccessPathresults with AND semantics within a disjunct (intersection) and OR semantics across disjuncts (union). Predicates outside the comparison-with-literal shape are silently skipped — notably column-to-column join predicates, which still require per-column literals.For
WHERE (age > 10 AND name = 'foo') OR (age = 0)the result is roughly{ $age → IntegerDomain([11,∞) ∪ {0}), $name → RegexDomain("foo") }.Nothing else is added: anything more specific belongs in a transformer, and anything less specific (such as converting a WHERE tree to DNF in the first place) was already the caller's job via
DnfRewriter.Tighten
RegexToIntegerDomainConverter: accept only canonical decimal regexesR = ^[0-9]{3}$.IntegerDomain{0..999}.IntegerDomain{100..999}— SQLCAST(integer AS VARCHAR)produces canonical decimal (0 → "0", never"000"), so0does not belong.^(0|[1-9][0-9]*)$". Non-canonical inputs (^[0-9]{3}$,^009$, empty regex, …) are now rejected withNonConvertibleDomainException.CastRegexTransformer'sCAST(int AS VARCHAR)branch keeps callingconvert(outputRegex)directly and relies on this strict contract.ProjectPullUpRewriter: remap the join condition when a left Project changes field count
Concrete scenario: tables
T1(a, b, c)(3 cols) andT2(x, y)(2 cols). Plan before pull-up:The join's row type is
[Project-output | T2] = [a, b, x, y], so inside the conditionbresolves to$1andxto$2. The condition is$1 = $2.After pull-up, the
Projectmoves above theJoin, and the new join's left input is the rawScan(T1):The new join's row type is
[T1 | T2] = [a, b, c, x, y].bis still$1, butxis now$3because the left input grew from 2 columns back to 3. The rewritten condition must be$1 = $3.Master inlined left-side
InputRefs through the removedProjectbut left right-sideInputRefs at their old positions. The rewritten condition came out as$1 = $2, which in the new frame points atT1.c(VARCHAR) — notT2.x(INTEGER). Wrong column, and a type mismatch that breaks join evaluation.The fix replaces the two side-specific helpers (
inlineLeftSide,inlineRightSide) with a singleremapJoinConditionpass. For everyInputRefin the old condition it computes the position in the new frame usingoldLeftCount(Project-output width) andnewLeftCount(unprojected-left width): right-side references shift bynewLeftCount - oldLeftCount; left-side references are remapped through the lifted projection expressions.IntegerDomain
negate()method (returnsmultiply(-1)), used by the newNegateIntegerTransformer.Interval.isAdjacentrefactored to make the overflow guard explicit in two named booleans, matching the original behavior.Build
coral-data-generation/build.gradlenow applies thejava-libraryplugin so the module exposes properapi/implementationconfigurations.Tests
RegexDomainInferenceProgramTestis the main integration suite and grows substantially: it exercises every new operator individually, every new nested-type access pattern, and combined SQL queries with AND/OR over struct/map/array paths against four test tables (test.T,test.complex,test.deep,test.interleaved). Notable coverage areas:SUBSTRING,LOWER,UPPER,CAST(int→str),CAST(str→int),CAST(str→date), arithmetic,MINUS,ABS, unary minus,CONCAT,TRIM, comparison operators with and without arithmeticnested_struct.sub.value), map of structs (map_of_structs['key'].score), and interleaved combinationsRegexTransformerTestis a new dedicated unit-test class forConcat: prefix/suffix stripping, prefix/suffix mismatch (empty domain), empty suffix as identity, non-literal output passthrough.IntegerTransformerTestadds rigorous-style cases forMinus,Negate, andAbs: each test constructs theRexCallviaRexBuilderand callstransformer.refineInputDomaindirectly, then asserts containment and boundaries — including the empty case forABSover an all-negative output interval.RegexToIntegerDomainConverterTestis updated to match the new contract: tests that previously passed non-canonical regexes (e.g.,^[0-9]{3}$,^009$,^[0-9]?$) now assert the converter rejects them withNonConvertibleDomainException. Parallel positive tests use canonical-form inputs (^[1-9][0-9]{2}$instead of^[0-9]{3}$).CastRegexTransformerTestadds concrete accept/reject probes for the returned regex (e.g.,getAutomaton().run("100")), pins the canonical behavior ofCAST(int AS VARCHAR)with a canonical 3-digit output, and documents the non-canonical fallback path.ProjectPullUpRewriterTestasserts row-type field-name and type preservation across pull-ups, and pins the rewritten join condition to=($1, $3)for the case described above.Verification
Full module pipeline (
build,javadoc,spotlessJavaCheck) passes; all tests in the module pass.