Core: Support appending files with different specs#9860
Core: Support appending files with different specs#9860amogh-jahagirdar merged 1 commit intoapache:mainfrom
Conversation
8320a0e to beeba63 Compare 909d37c to 916a7b2 Compare There was a problem hiding this comment.
I have some concerns about the validation since we're really looking at possibilities of having a replace/overwrite that crosses partition structures. @szehon-ho and @rdblue will probably need to take a closer look as well.
After an initial glance, I'm not sure the validations still hold.
After looking a little closer, I think this is ok. We're actually doubling up on the validation and checking for any changes across each partition spec, so I think this is ok.
core/src/main/java/org/apache/iceberg/BaseReplacePartitions.java Outdated Show resolved Hide resolved
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved
amogh-jahagirdar left a comment
There was a problem hiding this comment.
Looks great, just had some nits!
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved
126b4fd to 724a1d8 Compare 660bab4 to 989f622 Compare | Simplified the PR to just multi-partition-append use case. |
amogh-jahagirdar left a comment
There was a problem hiding this comment.
Really sorry for the delayed review on this @fqaiser94 I see this PR came up in discussion on the kafka commit coordination PR #10351 (comment).
I do remember the context of this PR and I just took another pass, at least from me it looks good. I'll leave it open for a bit in case @nastra @danielcweeks are interested, otherwise I'll merge.
No problems, and thanks all for the reviews 😄 |
What is the problem?
Currently the
table.newAppend()API expects users to provide Datafiles with the same PartitionSpec via.appendFile().Failure to do so raises a
ValidationException("Invalid data file, expected spec id: %d", dataSpec.specId()).CMIIW but the Iceberg spec doesn't seem to impose any such restriction.
The only related restriction I could find was in the manifests section which says:
We can easily work around this by writing multiple manifests, one for each spec for which files are being appended.
Why is this change needed/valuable?
In the iceberg-kafka-connect project, we've seen that when users evolve the PartitionSpec of the table, often they'll end up in a situation where Datafiles with different PartitionSpecs might be inflight and committing these DataFiles together as part of the same snapshot becomes impossible due to the aforementioned
ValidationException.While we could work around this by committing DataFiles with different PartitionSpecs as separate snapshots, this makes it complex for us to correctly associate valuable (watermarking) metadata with each snapshot in the snapshot properties. In addition, it makes the table snapshot history unnecessarily longer. It would be more ideal if we could avoid these issues.
Related work