fix(spark): CONCAT/PAD type leaks non-string arg when string args aren't TEXT [CLAUDE]#7661
fix(spark): CONCAT/PAD type leaks non-string arg when string args aren't TEXT [CLAUDE]#7661RichardHughes-amp wants to merge 5 commits into
Conversation
750219e to
bbea140
Compare
…s are VARCHAR/CHAR [CLAUDE] `_annotate_by_similar_args` only short-circuited on an exact `is_type(TEXT)` match. VARCHAR, CHAR, NVARCHAR, and string literals (annotated as VARCHAR) are not `TEXT`, so for `CONCAT(varchar_col, '-', date_col)` the loop fell through every argument and the final `last_datatype = expr.type` left the result as the *last* arg's type (DATE), not a string. Accept a tuple of acceptable match types and use its first element as the canonical result. Spark's CONCAT/PAD now pass `(TEXT, *TEXT_TYPES)`, so any string-family arg yields TEXT. Also stop overwriting `last_datatype` on every non-matching arg — keep the first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bbea140 to
7c31366
Compare
…UDE] Replace varchar_col-based cases (which required a schema addition) with literal-based equivalents — string literals parse as VARCHAR, supplying the same TEXT_TYPES coverage without an extra schema column. Add missing cases: date-first CONCAT, array round-trips, and LPAD with date args. Remove varchar_col from the test schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ations [CLAUDE] Array mixed with string, binary, or a literal should resolve to UNKNOWN since Spark rejects these at analysis time with DATATYPE_MISMATCH. The annotator currently returns the wrong type for all six cases; these tests are expected to fail until the fix lands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e combinations [CLAUDE]" This reverts commit 494c9be.
f824499 to
69139fb
Compare
…annotate_pad [CLAUDE]
The old helper accumulated type info arg-by-arg against a target_type,
which failed to recognize that VARCHAR/CHAR (TEXT_TYPES but not DType.TEXT)
are valid string-concat participants. Replace with two dedicated annotators
whose dispatch matches Spark's actual type rules:
CONCAT: UNKNOWN-in → UNKNOWN; all-binary → BINARY;
all-array of identical type → that ARRAY type; else → TEXT.
PAD: ARRAY arg → UNKNOWN (invalid); else same binary/text dispatch as
CONCAT, but without the array-propagation path.
The ARRAY branch in _annotate_concat uses type.sql() equality so that
ARRAY<STRING> and ARRAY<ARRAY<STRING>> are not treated as the same type.
Also correct a pre-existing fixture expectation: CONCAT(str_col, unknown)
should return UNKNOWN, not STRING — if any arg type is unknown the output
type is unknown.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
69139fb to
080d0de
Compare
| def _annotate_by_similar_args( | ||
| self: TypeAnnotator, expression: E, *args: str, target_type: exp.DataType | exp.DType | ||
| ) -> E: | ||
| def _common_array_element_type(types: list[exp.DataType]) -> exp.DataType | exp.DType: |
There was a problem hiding this comment.
This has turned into something more complicated than anticipated. Can you reiterate the original problem in a few words? My hunch is that we shouldn't do all of this and I'm wondering if some assumptions made here are unnecessary / possibly incorrect.
There was a problem hiding this comment.
Ok, nevermind, I read your updated description. Regarding:
[...] when it swept the arguments of CONCAT (or LPAD) for types, it searched for exp.DType.TEXT, but it didn't have any special handling for other text types like VARCHAR2. So it went with whatever the last type was: in this case, date.
This can be easily fixed by expanding the "target type" to a set of types; in this case all TEXT_TYPES. Regarding:
I determined that there's really only three possible data types that CONCAT can return: a TEXT, a BINARY, or an ARRAY. Admittedly, 'array' is a very broad category - but even more bizarrely, it's possible for Spark to concatenate an Array of strings and an array of ints and emit it as an array of TEXT values.
I'd like to understand the semantics a bit more. What hapens when you get a mix of text and binary, or text and array, stuff like that? I.e., what are the allowed type mixtures and how does spark/databricks coerce/promote values of weaker types?
There was a problem hiding this comment.
Claude, summarize all that stuff we did.
The behavior we verified (Spark docs + experimentation):
CONCAT
- All string-family inputs (
STRING/VARCHAR/CHAR/literals) →STRING.- Mixed string + non-string scalar (
CONCAT('x', date_col),CONCAT('x', int_col)) → Spark implicitly casts the non-string to string and returnsSTRING. No error.- All
BINARYinputs →BINARY. (Spark has a separate overload.)- Mixed
STRING+BINARY→ Spark coerces toSTRING. The known-text arg wins.- All
ARRAY<T>inputs of identicalT→ARRAY<T>(array concat overload).ARRAY<T1>andARRAY<T2>with compatible elements → Spark narrows to the common element type, recursively. SoCONCAT(ARRAY<STRING>, ARRAY<INT>)→ARRAY<STRING>(the int elements coerce to string).CONCAT(ARRAY<ARRAY<STRING>>, ARRAY<ARRAY<INT>>)→ARRAY<ARRAY<STRING>>.ARRAYmixed with non-ARRAY(CONCAT(ARRAY<INT>, INT)) — Spark raises an analysis error. We annotateUNKNOWN.- Any
UNKNOWNarg with no string sibling →UNKNOWN. (If there's a string sibling, the query either coerces the unknown to string or errors out — no valid execution returns non-text, so we settle onTEXT.)LPAD / RPAD
- Same scalar dispatch as CONCAT (string-family →
STRING; all-BINARY→BINARY).- No array overload.
LPAD(ARRAY<…>, …)is invalid →UNKNOWN.LPAD('x', 10, date_col)→ Spark coerces the fill arg to string →STRING.
Thanks, Claude.
tosses the claude underhand back into the void from which it formed
As far as I can tell, all disparate types either collapse to TEXT if they possibly can, or they throw an exception. Honestly, the only thing that requires the complex refactoring that I did is the array-handling, the way that you can concat Array with Array to make another, longer Array, or concat Array with Array to almost always get Array, etc. Concatenating complex arrays is an edge case, and the code is not immediately intuitive, but it is a thing that people do; it's up to you whether you want SQLGLot to spin that particular plate.
There was a problem hiding this comment.
In short, we can make this PR enormously simpler if you want to punt on Arrays. But: then it won't handle Arrays.
SQLGlot and the native spark query analyzer were returning disparate results in testing for a query that had a clause of the form:
...Where SQLGlot parsed this CONCAT as emitting a Date type, and the native parser saw it as emitting a Varchar type.
We tracked down the root cause to a subtle error in the logic of
sqlglot/typing/spark2.py:_annotate_by_similar_args; when it swept the arguments of CONCAT (or LPAD) for types, it searched for exp.DType.TEXT, but it didn't have any special handling for other text types like VARCHAR2. So it went with whatever the last type was: in this case, date.The first draft of this got some pushback, which let to me taking a deep dive into the source code, and also testing the behavior of Spark and Databricks type derivation against a local SparkSession and a Databricks cluster. I determined that there's really only three possible data types that CONCAT can return: a TEXT, a BINARY, or an ARRAY. Admittedly, 'array' is a very broad category - but even more bizarrely, it's possible for Spark to concatenate an Array of strings and an array of ints and emit it as an array of TEXT values.
So, I took the time to spin up something that would handle arrays neatly as well. Most of this was a Claude instance generating something that would handle recursive nesting, then fine-tuning the tests to be tests.
The resultant code is less flexible. But it's only called in two places; it doesn't need to be flexible so much as it needs to be clear and correct, and I'm confident this is both.