feat: fix array_compact for Spark 4.0 and correct return type metadata#3796
Open
andygrove wants to merge 3 commits intoapache:mainfrom
Open
feat: fix array_compact for Spark 4.0 and correct return type metadata#3796andygrove wants to merge 3 commits intoapache:mainfrom
andygrove wants to merge 3 commits intoapache:mainfrom
Conversation
- Handle KnownNotContainsNull wrapper in Spark 4.0 shim so that array_compact runs natively on Spark 4.0 (previously always fell back) - Fix return type passed to DataFusion: use expr.dataType instead of hardcoded ArrayType(elementType) so that containsNull=false is correctly propagated on Spark 4.0 - Mark CometArrayCompact as Compatible() instead of Incompatible(None) - Expand SQL test coverage: add string, double, and nested array types; change spark_answer_only to query to verify native execution - Remove assume(\!isSpark40Plus) skip from Scala test - Update expressions.md: ArrayCompact is now supported
…DataFusion DataFusion's array_remove_all function always returns a list with nullable elements (nullable=true), but array_compact's Spark dataType has containsNull=false. Passing the Spark type as the promised return type caused a runtime type-mismatch assertion in DataFusion's ScalarFunctionExpr. Fix by passing ArrayType(elementType, containsNull=true) in both the Spark 3.x CometArrayCompact serde and the Spark 4.0 KnownNotContainsNull shim.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes several issues with the
array_compactexpression found during an audit:Spark 4.0 support: On Spark 4.0,
array_compactrewrites toKnownNotContainsNull(ArrayFilter(IsNotNull)). BecauseKnownNotContainsNullwas not handled by Comet, the expression always fell back to Spark on 4.0. This PR adds a case in the Spark 4.0 shim that unwrapsKnownNotContainsNull, serializes the innerArrayFilterusing DataFusion'sarray_remove_all, and propagates thecontainsNull=falsereturn type correctly.Return type bug:
CometArrayCompact.convertwas hardcodingArrayType(elementType = elementType)as the return type passed to DataFusion, which always setcontainsNull=true. Changed to useexpr.dataTypeso the correct nullability metadata is emitted — matching Spark's behavior wherecontainsNull=falseafter compaction on Spark 4.0.Incorrect
Incompatibleclassification:CometArrayCompact.getSupportLevelwas returningIncompatible(None)with no explanation. The implementation is semantically correct (DataFusion'sarray_remove_all(arr, null)behaves the same as Spark'sArrayFilter(IsNotNull)), so this is changed toCompatible().SQL test improvements: Added
string,double, andarray<array<int>>(nested) type coverage; changedspark_answer_onlytoqueryto verify native execution.Remove Scala test skip: The
assume(\!isSpark40Plus)guard with a// TODO fix for Spark 4.0.0comment is removed now that Spark 4.0 is supported.Docs: Updated
expressions.mdto markArrayCompactas supported.