Skip to content

feat: fix array_compact for Spark 4.0 and correct return type metadata#3796

Open
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:fix-array-compact
Open

feat: fix array_compact for Spark 4.0 and correct return type metadata#3796
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:fix-array-compact

Conversation

@andygrove
Copy link
Copy Markdown
Member

Summary

Fixes several issues with the array_compact expression found during an audit:

  • Spark 4.0 support: On Spark 4.0, array_compact rewrites to KnownNotContainsNull(ArrayFilter(IsNotNull)). Because KnownNotContainsNull was not handled by Comet, the expression always fell back to Spark on 4.0. This PR adds a case in the Spark 4.0 shim that unwraps KnownNotContainsNull, serializes the inner ArrayFilter using DataFusion's array_remove_all, and propagates the containsNull=false return type correctly.

  • Return type bug: CometArrayCompact.convert was hardcoding ArrayType(elementType = elementType) as the return type passed to DataFusion, which always set containsNull=true. Changed to use expr.dataType so the correct nullability metadata is emitted — matching Spark's behavior where containsNull=false after compaction on Spark 4.0.

  • Incorrect Incompatible classification: CometArrayCompact.getSupportLevel was returning Incompatible(None) with no explanation. The implementation is semantically correct (DataFusion's array_remove_all(arr, null) behaves the same as Spark's ArrayFilter(IsNotNull)), so this is changed to Compatible().

  • SQL test improvements: Added string, double, and array<array<int>> (nested) type coverage; changed spark_answer_only to query to verify native execution.

  • Remove Scala test skip: The assume(\!isSpark40Plus) guard with a // TODO fix for Spark 4.0.0 comment is removed now that Spark 4.0 is supported.

  • Docs: Updated expressions.md to mark ArrayCompact as supported.

- Handle KnownNotContainsNull wrapper in Spark 4.0 shim so that array_compact runs natively on Spark 4.0 (previously always fell back) - Fix return type passed to DataFusion: use expr.dataType instead of hardcoded ArrayType(elementType) so that containsNull=false is correctly propagated on Spark 4.0 - Mark CometArrayCompact as Compatible() instead of Incompatible(None) - Expand SQL test coverage: add string, double, and nested array types; change spark_answer_only to query to verify native execution - Remove assume(\!isSpark40Plus) skip from Scala test - Update expressions.md: ArrayCompact is now supported
…DataFusion DataFusion's array_remove_all function always returns a list with nullable elements (nullable=true), but array_compact's Spark dataType has containsNull=false. Passing the Spark type as the promised return type caused a runtime type-mismatch assertion in DataFusion's ScalarFunctionExpr. Fix by passing ArrayType(elementType, containsNull=true) in both the Spark 3.x CometArrayCompact serde and the Spark 4.0 KnownNotContainsNull shim.
@andygrove andygrove marked this pull request as ready for review March 26, 2026 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant