fix: Emit stats for all partitions including empty/deleted and tables without readable_metrics#515
Open
srawat98-dev wants to merge 4 commits intolinkedin:mainfrom
Conversation
… readable_metrics is empty Previously, `populateStatsForUnpartitionedTable()` returned an empty list when `getColumnNamesFromReadableMetrics()` found no column-level metrics, silently skipping S3 (partition stats) emission entirely. This meant that unpartitioned tables without readable_metrics in their data_files metadata never got rowCount or columnCount published — even though both values are always independently available (rowCount from `sum(record_count)` over data_files, columnCount from `table.schema().columns().size()`). This caused downstream consumers (e.g., Data Quality unified summary) to permanently have null stats for affected tables, since the first-arrival stats are immutable once set. The fix: 1. Remove the early return when columnNames is empty — log a warning but continue to aggregate row count and build the stats object. 2. Handle empty columnAggExpressions in the SQL query builder to avoid a trailing comma producing invalid SQL. Column-level metrics (nullCount, nanCount, min, max, columnSize) will be null when readable_metrics is unavailable, which is correct — the stats object is still emitted with valid rowCount and columnCount. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same bug exists in populateStatsForPartitionedTable(): early return when getColumnNamesFromReadableMetrics() is empty skips S3 emission entirely, even though rowCount and columnCount are independently available. Applied the same fix pattern: 1. Remove early return when columnNames is empty in partitioned path 2. Handle empty columnAggExpressions in aggregatePartitionStats SQL Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change joinStatsWithCommitMetadata from INNER to LEFT join so that partitions with commits but no current data_files (e.g., fully deleted or truncated partitions) are still emitted with rowCount=0 instead of being silently dropped. The downstream transform already handles null total_row_count safely: rowCount != null ? rowCount : 0L This ensures DQ consumers can detect "partition went to 0 rows" events. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add testPartitionStatsForDeletedPartition: creates partitioned table with two partitions, deletes all rows from one, verifies the deleted partition emits stats with rowCount=0 (not silently dropped). - Enhance warn log messages to include schema column count for debugging. - Add comment explaining .drop() safety with LEFT JOIN in Spark 3.x. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three fixes to ensure partition stats (S3) are emitted reliably for all tables and partitions:
Root Cause
In
TableStatsCollectorUtil:populateStatsForPartitionedTable()andpopulateStatsForUnpartitionedTable()returnedCollections.emptyList()whengetColumnNamesFromReadableMetrics()returned empty — skipping stats emission entirely, even thoughrowCount(fromsum(record_count)) andcolumnCount(fromtable.schema()) are always independently available.joinStatsWithCommitMetadata()used INNER JOIN between latest commits anddata_filesstats — silently dropping partitions that have commits but no current data files (e.g., after DELETE/truncate operations).Impact
rowCount/columnCountfor affected tables (first-arrival stats are immutable once set)Changes
Fix 1: Empty readable_metrics (partitioned + unpartitioned)
populateStatsForPartitionedTable()andpopulateStatsForUnpartitionedTable()whencolumnNamesis emptycolumnAggExpressionsin SQL query builders to avoid invalid SQL with trailing commaFix 2: LEFT JOIN for deleted partitions
joinStatsWithCommitMetadata()from INNER to LEFT joinrowCount != null ? rowCount : 0LTest: Deleted partition coverage
testPartitionStatsForDeletedPartition: creates partitioned table with 2 partitions, deletes all rows from one, verifies deleted partition emits stats withrowCount=0Testing Done
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
Some other form of testing like staging or soak time in production. Please explain.
New test:
testPartitionStatsForDeletedPartition— verifies LEFT JOIN retains deleted partitions with rowCount=0Existing integration test
testPartitionStatsForUnpartitionedTablecontinues to passExisting unit test
testBuildColumnAggregationExpressions_withEmptyListconfirms empty list produces empty expressionsextractColumnMetricsFromAggregatedRowwith empty columnNames returns empty metric lists (no iteration) — safetransformRowsToPartitionStatsFromAggregatedSQLalready null-guardstotal_row_countwith fallback to 0LProduction trace: null stats for
u_daliview.non_hive_migration_hdfs_path_to_table_phase_mapping_waron holdem root-caused to the early return pathAdditional Information
No breaking changes. Fully backward compatible: