Core: Support appending files with different specs by fqaiser94 · Pull Request #9860 · apache/iceberg

fqaiser94 · 2024-03-03T17:07:35Z

What is the problem?

Currently the table.newAppend() API expects users to provide Datafiles with the same PartitionSpec via .appendFile().
Failure to do so raises a ValidationException("Invalid data file, expected spec id: %d", dataSpec.specId()).

CMIIW but the Iceberg spec doesn't seem to impose any such restriction.
The only related restriction I could find was in the manifests section which says:

A manifest stores files for a single partition spec.

We can easily work around this by writing multiple manifests, one for each spec for which files are being appended.

Why is this change needed/valuable?

In the iceberg-kafka-connect project, we've seen that when users evolve the PartitionSpec of the table, often they'll end up in a situation where Datafiles with different PartitionSpecs might be inflight and committing these DataFiles together as part of the same snapshot becomes impossible due to the aforementioned ValidationException.

While we could work around this by committing DataFiles with different PartitionSpecs as separate snapshots, this makes it complex for us to correctly associate valuable (watermarking) metadata with each snapshot in the snapshot properties. In addition, it makes the table snapshot history unnecessarily longer. It would be more ideal if we could avoid these issues.

Related work

Core: Support committing delete files with multiple specs #2985

core/src/test/java/org/apache/iceberg/TestMergeAppend.java

danielcweeks

I have some concerns about the validation since we're really looking at possibilities of having a replace/overwrite that crosses partition structures. @szehon-ho and @rdblue will probably need to take a closer look as well.

~~After an initial glance, I'm not sure the validations still hold.~~

After looking a little closer, I think this is ok. We're actually doubling up on the validation and checking for any changes across each partition spec, so I think this is ok.

core/src/main/java/org/apache/iceberg/BaseReplacePartitions.java

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

amogh-jahagirdar

Looks great, just had some nits!

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

core/src/test/java/org/apache/iceberg/TestMergeAppend.java

core/src/test/java/org/apache/iceberg/TestOverwrite.java

fqaiser94 · 2024-07-03T18:37:07Z

Simplified the PR to just multi-partition-append use case.

amogh-jahagirdar

Really sorry for the delayed review on this @fqaiser94 I see this PR came up in discussion on the kafka commit coordination PR #10351 (comment).

I do remember the context of this PR and I just took another pass, at least from me it looks good. I'll leave it open for a bit in case @nastra @danielcweeks are interested, otherwise I'll merge.

fqaiser94 · 2024-07-20T16:48:17Z

Really sorry for the delayed review on this @fqaiser94 I see this PR came up in discussion on the kafka commit coordination PR #10351 (comment).

I do remember the context of this PR and I just took another pass, at least from me it looks good. I'll leave it open for a bit in case @nastra @danielcweeks are interested, otherwise I'll merge.

No problems, and thanks all for the reviews 😄

github-actions bot added the core label Mar 3, 2024

fqaiser94 force-pushed the append_data_files_with_multiple_specs branch from 8320a0e to beeba63 Compare March 3, 2024 17:10

nastra reviewed Mar 4, 2024

View reviewed changes

core/src/test/java/org/apache/iceberg/TestMergeAppend.java Outdated Show resolved Hide resolved

fqaiser94 force-pushed the append_data_files_with_multiple_specs branch 2 times, most recently from 909d37c to 916a7b2 Compare March 4, 2024 16:25

nastra reviewed Mar 4, 2024

View reviewed changes

core/src/test/java/org/apache/iceberg/TestMergeAppend.java Outdated Show resolved Hide resolved

danielcweeks reviewed Mar 4, 2024

View reviewed changes

amogh-jahagirdar reviewed Mar 5, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Show resolved Hide resolved

core/src/test/java/org/apache/iceberg/TestMergeAppend.java Outdated Show resolved Hide resolved

amogh-jahagirdar reviewed Mar 5, 2024

View reviewed changes

core/src/test/java/org/apache/iceberg/TestMergeAppend.java Show resolved Hide resolved

fqtab mentioned this pull request Mar 5, 2024

Handle partition spec evolutions gracefully databricks/iceberg-kafka-connect#202

Merged

fqaiser94 force-pushed the append_data_files_with_multiple_specs branch 2 times, most recently from 126b4fd to 724a1d8 Compare March 11, 2024 13:20