Skip to content

fix: Emit stats for all partitions including empty/deleted and tables without readable_metrics#515

Open
srawat98-dev wants to merge 4 commits intolinkedin:mainfrom
srawat98-dev:srawat/fix-unpartitioned-stats-empty-metrics
Open

fix: Emit stats for all partitions including empty/deleted and tables without readable_metrics#515
srawat98-dev wants to merge 4 commits intolinkedin:mainfrom
srawat98-dev:srawat/fix-unpartitioned-stats-empty-metrics

Conversation

@srawat98-dev
Copy link
Contributor

@srawat98-dev srawat98-dev commented Mar 26, 2026

Summary

Three fixes to ensure partition stats (S3) are emitted reliably for all tables and partitions:

  1. Emit rowCount/columnCount when readable_metrics is empty — affects both partitioned and unpartitioned tables
  2. Use LEFT JOIN to retain deleted/empty partitions — ensures partitions with 0 rows after DELETE/truncate still emit stats with rowCount=0

Root Cause

In TableStatsCollectorUtil:

  • Both populateStatsForPartitionedTable() and populateStatsForUnpartitionedTable() returned Collections.emptyList() when getColumnNamesFromReadableMetrics() returned empty — skipping stats emission entirely, even though rowCount (from sum(record_count)) and columnCount (from table.schema()) are always independently available.
  • joinStatsWithCommitMetadata() used INNER JOIN between latest commits and data_files stats — silently dropping partitions that have commits but no current data files (e.g., after DELETE/truncate operations).

Impact

  • Downstream DQ pipeline permanently had null rowCount/columnCount for affected tables (first-arrival stats are immutable once set)
  • ~34% of commits for certain unpartitioned tables had no S3 entry despite valid S1/S2 events
  • Partition DELETE/truncate events were invisible to downstream consumers

Changes

  • Bug Fixes
  • Tests

Fix 1: Empty readable_metrics (partitioned + unpartitioned)

  • Removed early return in both populateStatsForPartitionedTable() and populateStatsForUnpartitionedTable() when columnNames is empty
  • Handle empty columnAggExpressions in SQL query builders to avoid invalid SQL with trailing comma
  • Column-level metrics (nullCount, nanCount, min, max, columnSize) will be null when readable_metrics is unavailable — correct behavior

Fix 2: LEFT JOIN for deleted partitions

  • Changed joinStatsWithCommitMetadata() from INNER to LEFT join
  • Partitions with commits but no current data_files now emit stats with rowCount=0
  • Downstream transform already handles this: rowCount != null ? rowCount : 0L

Test: Deleted partition coverage

  • Added testPartitionStatsForDeletedPartition: creates partitioned table with 2 partitions, deletes all rows from one, verifies deleted partition emits stats with rowCount=0

Testing Done

  • Added new tests for the changes made.

  • Updated existing tests to reflect the changes made.

  • Some other form of testing like staging or soak time in production. Please explain.

  • New test: testPartitionStatsForDeletedPartition — verifies LEFT JOIN retains deleted partitions with rowCount=0

  • Existing integration test testPartitionStatsForUnpartitionedTable continues to pass

  • Existing unit test testBuildColumnAggregationExpressions_withEmptyList confirms empty list produces empty expressions

  • extractColumnMetricsFromAggregatedRow with empty columnNames returns empty metric lists (no iteration) — safe

  • transformRowsToPartitionStatsFromAggregatedSQL already null-guards total_row_count with fallback to 0L

  • Production trace: null stats for u_daliview.non_hive_migration_hdfs_path_to_table_phase_mapping_war on holdem root-caused to the early return path

Additional Information

No breaking changes. Fully backward compatible:

  • Tables with readable_metrics: unchanged behavior
  • Tables without readable_metrics: now emit stats with valid rowCount/columnCount, null column-level metrics (previously silently skipped)
  • Deleted partitions: now emit stats with rowCount=0 (previously silently dropped by INNER JOIN)
Shantanu Rawat and others added 2 commits March 26, 2026 12:27
… readable_metrics is empty Previously, `populateStatsForUnpartitionedTable()` returned an empty list when `getColumnNamesFromReadableMetrics()` found no column-level metrics, silently skipping S3 (partition stats) emission entirely. This meant that unpartitioned tables without readable_metrics in their data_files metadata never got rowCount or columnCount published — even though both values are always independently available (rowCount from `sum(record_count)` over data_files, columnCount from `table.schema().columns().size()`). This caused downstream consumers (e.g., Data Quality unified summary) to permanently have null stats for affected tables, since the first-arrival stats are immutable once set. The fix: 1. Remove the early return when columnNames is empty — log a warning but continue to aggregate row count and build the stats object. 2. Handle empty columnAggExpressions in the SQL query builder to avoid a trailing comma producing invalid SQL. Column-level metrics (nullCount, nanCount, min, max, columnSize) will be null when readable_metrics is unavailable, which is correct — the stats object is still emitted with valid rowCount and columnCount. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same bug exists in populateStatsForPartitionedTable(): early return when getColumnNamesFromReadableMetrics() is empty skips S3 emission entirely, even though rowCount and columnCount are independently available. Applied the same fix pattern: 1. Remove early return when columnNames is empty in partitioned path 2. Handle empty columnAggExpressions in aggregatePartitionStats SQL Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@srawat98-dev srawat98-dev changed the title fix: Emit rowCount/columnCount for unpartitioned tables when readable_metrics is empty fix: Emit rowCount/columnCount even when readable_metrics is empty Mar 27, 2026
Change joinStatsWithCommitMetadata from INNER to LEFT join so that partitions with commits but no current data_files (e.g., fully deleted or truncated partitions) are still emitted with rowCount=0 instead of being silently dropped. The downstream transform already handles null total_row_count safely: rowCount != null ? rowCount : 0L This ensures DQ consumers can detect "partition went to 0 rows" events. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@srawat98-dev srawat98-dev changed the title fix: Emit rowCount/columnCount even when readable_metrics is empty fix: Emit stats for all partitions including empty/deleted and tables without readable_metrics Mar 27, 2026
- Add testPartitionStatsForDeletedPartition: creates partitioned table with two partitions, deletes all rows from one, verifies the deleted partition emits stats with rowCount=0 (not silently dropped). - Enhance warn log messages to include schema column count for debugging. - Add comment explaining .drop() safety with LEFT JOIN in Spark 3.x. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant