Skip to content

[Kernel][Type Widening] 9/ Implement TypeChange Parsing#4593

Merged
vkorukanti merged 2 commits intodelta-io:masterfrom
emkornfield:stack/add_type_change_parsing
Jun 5, 2025
Merged

[Kernel][Type Widening] 9/ Implement TypeChange Parsing#4593
vkorukanti merged 2 commits intodelta-io:masterfrom
emkornfield:stack/add_type_change_parsing

Conversation

@emkornfield
Copy link
Collaborator

@emkornfield emkornfield commented May 21, 2025

🥞 Stacked PR

Use this link to review incremental changes.


Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

This PR implements parsing of type changes from the JSON schema to Kernel's schema model.

How was this patch tested?

Add tests for parsing as well as updated schema evolution tests to ensure this works end-to-end.

Does this PR introduce any user-facing changes?

Type changes are now appropriately parsed, and can be round-tripped conforming to the delta spec.

@emkornfield emkornfield force-pushed the stack/add_type_change_parsing branch 11 times, most recently from 8e24e94 to 14e49de Compare May 21, 2025 21:47
@emkornfield emkornfield changed the title Add parsing for type changes. [Kernel][Type Widening] 8/ refactor collation parsing/setting not be intertwined with type parsing May 22, 2025
@emkornfield emkornfield force-pushed the stack/add_type_change_parsing branch 2 times, most recently from 5f89234 to 64652bf Compare May 22, 2025 05:38
@emkornfield emkornfield marked this pull request as ready for review May 22, 2025 05:38
@emkornfield emkornfield changed the title [Kernel][Type Widening] 8/ refactor collation parsing/setting not be intertwined with type parsing [Kernel][Type Widening] 9/ Implement TypeChange Parsing May 22, 2025
@emkornfield emkornfield requested a review from vkorukanti May 22, 2025 07:05
@emkornfield emkornfield requested a review from johanl-db May 22, 2025 17:06
@emkornfield emkornfield force-pushed the stack/add_type_change_parsing branch 2 times, most recently from 583909c to 0377d14 Compare May 23, 2025 17:34
@emkornfield emkornfield force-pushed the stack/add_type_change_parsing branch 2 times, most recently from ddad757 to 538ec9c Compare May 28, 2025 00:22
vkorukanti pushed a commit that referenced this pull request May 28, 2025
…schemas (#4560) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/4560/files) to review incremental changes. - [**stack/tw_iterable**](#4560) [[Files changed](https://github.com/delta-io/delta/pull/4560/files)] - [stack/tw_cleanup_signature](#4561) [[Files changed](https://github.com/delta-io/delta/pull/4561/files/d85f1ee710f766b82228f81e0face2518ccaaa0c..96edf0378c81d9a156571fa1fd2f688d0d04ef27)] - [stack/tw_make_update_real](#4566) [[Files changed](https://github.com/delta-io/delta/pull/4566/files/96edf0378c81d9a156571fa1fd2f688d0d04ef27..66fd80cd94099869489ef2fded10b04a2977da35)] - [stack/tw_proper_addressing](#4567) [[Files changed](https://github.com/delta-io/delta/pull/4567/files/66fd80cd94099869489ef2fded10b04a2977da35..9adb29923a50f91785ac1d2527ac8fa05d52d1e2)] - [stack/tw_type_changes](#4568) [[Files changed](https://github.com/delta-io/delta/pull/4568/files/9adb29923a50f91785ac1d2527ac8fa05d52d1e2..ddff25fa0456d1a4022aea7749b47833b315bdaa)] - [stack/tw_type_changes_to_field_metadata](#4588) [[Files changed](https://github.com/delta-io/delta/pull/4588/files/ddff25fa0456d1a4022aea7749b47833b315bdaa..ff43b8e6af15ee95ced406fbd39d13d104fc7a66)] - [stack/tw_type_changes_to_field_metadata_impl](#4589) [[Files changed](https://github.com/delta-io/delta/pull/4589/files/ff43b8e6af15ee95ced406fbd39d13d104fc7a66..3642c6851822446c5be59dc0a0104a09ccbb551d)] - [stack/tw_serde_refactor](#4592) [[Files changed](https://github.com/delta-io/delta/pull/4592/files/3642c6851822446c5be59dc0a0104a09ccbb551d..b819ff738c45f8899f57408107c4fc41de045e25)] - [stack/add_type_change_parsing](#4593) [[Files changed](https://github.com/delta-io/delta/pull/4593/files/b819ff738c45f8899f57408107c4fc41de045e25..538ec9c5bf6afbf3473715bf94eb324b237138da)] - [stack/tw_upgrade_downgrade](#4603) [[Files changed](https://github.com/delta-io/delta/pull/4603/files/538ec9c5bf6afbf3473715bf94eb324b237138da..a6fcb20a3c4caf8e74dc0ca254beadc766fd0083)] --------- #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description Adds a utility class for iterating over Schemas and modifying them. This will be used in follow-ups to help with type widing, but can also be used to avoid recursion code which has been duplicated in many places. ## How was this patch tested? Unit tests. ## Does this PR introduce _any_ user-facing changes? No. Internal helper only.
@emkornfield emkornfield force-pushed the stack/add_type_change_parsing branch 2 times, most recently from 0a5fdcb to 985006f Compare May 28, 2025 05:23
vkorukanti pushed a commit that referenced this pull request May 28, 2025
…nstead of map (#4561) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/4561/files) to review incremental changes. - [**stack/tw_cleanup_signature**](#4561) [[Files changed](https://github.com/delta-io/delta/pull/4561/files)] - [stack/tw_make_update_real](#4566) [[Files changed](https://github.com/delta-io/delta/pull/4566/files/dc9493f8ecf74cd647c1056ca018017a96b36ee3..eb36c872f0beecf592a30269009b6acf06490bcb)] - [stack/tw_proper_addressing](#4567) [[Files changed](https://github.com/delta-io/delta/pull/4567/files/eb36c872f0beecf592a30269009b6acf06490bcb..7a825be9356dd2cdae2add3d76e890b0f09a403f)] - [stack/tw_type_changes](#4568) [[Files changed](https://github.com/delta-io/delta/pull/4568/files/7a825be9356dd2cdae2add3d76e890b0f09a403f..b67836ee303012f759b8be1be64b8a59477597ad)] - [stack/tw_type_changes_to_field_metadata](#4588) [[Files changed](https://github.com/delta-io/delta/pull/4588/files/b67836ee303012f759b8be1be64b8a59477597ad..952fb08f703c880cbd652fe23fb147e28b37a040)] - [stack/tw_type_changes_to_field_metadata_impl](#4589) [[Files changed](https://github.com/delta-io/delta/pull/4589/files/952fb08f703c880cbd652fe23fb147e28b37a040..a988cd6201ca97c39ef0d7f27700a67922c929d7)] - [stack/tw_serde_refactor](#4592) [[Files changed](https://github.com/delta-io/delta/pull/4592/files/a988cd6201ca97c39ef0d7f27700a67922c929d7..4c587b511087a7852b6eb9098ca329595106d372)] - [stack/add_type_change_parsing](#4593) [[Files changed](https://github.com/delta-io/delta/pull/4593/files/4c587b511087a7852b6eb9098ca329595106d372..985006f94d99909ea1d5aba9cc26b24f823a8250)] - [stack/tw_upgrade_downgrade](#4603) [[Files changed](https://github.com/delta-io/delta/pull/4603/files/985006f94d99909ea1d5aba9cc26b24f823a8250..03747ccc7e85a5bf7fff2b62c446a7cb7eb615ac)] --------- #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description This changes the signature of computeSchemaChangesById to take schemas instead of Maps. This makes it possible to change the implementation of from not requiring the specific maps in question, and provides better testing by passing in real schemas instead of transformed values. ## How was this patch tested? Updated unit tests. ## Does this PR introduce _any_ user-facing changes? No
@emkornfield emkornfield force-pushed the stack/add_type_change_parsing branch from 985006f to fc3da02 Compare May 28, 2025 07:17
vkorukanti pushed a commit that referenced this pull request May 28, 2025
## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/4566/files) to review incremental changes. - [**stack/tw_make_update_real**](#4566) [[Files changed](https://github.com/delta-io/delta/pull/4566/files)] - [stack/tw_proper_addressing](#4567) [[Files changed](https://github.com/delta-io/delta/pull/4567/files/10c63b52905094fab67fffdbed2c89689b56dc10..a6c5a4b29316b9fe7ac04bf15200ce5f647fafb8)] - [stack/tw_type_changes](#4568) [[Files changed](https://github.com/delta-io/delta/pull/4568/files/a6c5a4b29316b9fe7ac04bf15200ce5f647fafb8..f7f67a943ab6e196a5ae70fa50959d0ab1b25654)] - [stack/tw_type_changes_to_field_metadata](#4588) [[Files changed](https://github.com/delta-io/delta/pull/4588/files/f7f67a943ab6e196a5ae70fa50959d0ab1b25654..6046391b6586de29a907bd2150f6d799ecfc236f)] - [stack/tw_type_changes_to_field_metadata_impl](#4589) [[Files changed](https://github.com/delta-io/delta/pull/4589/files/6046391b6586de29a907bd2150f6d799ecfc236f..2e41a183d4ff76203d503fe5857aa77a807e3316)] - [stack/tw_serde_refactor](#4592) [[Files changed](https://github.com/delta-io/delta/pull/4592/files/2e41a183d4ff76203d503fe5857aa77a807e3316..6d1a8a3cd96a14393e7be9342fdf1e5ad2aa6d7d)] - [stack/add_type_change_parsing](#4593) [[Files changed](https://github.com/delta-io/delta/pull/4593/files/6d1a8a3cd96a14393e7be9342fdf1e5ad2aa6d7d..fc3da02a407bc1d08cd97928e21fda01565fc85e)] - [stack/tw_upgrade_downgrade](#4603) [[Files changed](https://github.com/delta-io/delta/pull/4603/files/fc3da02a407bc1d08cd97928e21fda01565fc85e..943921b9b1d8ec4f648461d6e617cc8d62083f67)] --------- #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description This pull request adds a specific class instead of a Tuple for tracking Schema updates. ## How was this patch tested? Existing unit tests ## Does this PR introduce _any_ user-facing changes? No, this is private API
@emkornfield emkornfield force-pushed the stack/add_type_change_parsing branch 5 times, most recently from c28d9da to 1da4e2f Compare May 29, 2025 22:01
vkorukanti pushed a commit that referenced this pull request May 30, 2025
…#4567) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/4567/files) to review incremental changes. - [**stack/tw_proper_addressing**](#4567) [[Files changed](https://github.com/delta-io/delta/pull/4567/files)] - [stack/tw_type_changes](#4568) [[Files changed](https://github.com/delta-io/delta/pull/4568/files/7a07781a45a757873a5dbf519f84c3e25f3af4e3..2c2f71a4a6b775abb4da3b141058ed20196cd099)] - [stack/tw_type_changes_to_field_metadata](#4588) [[Files changed](https://github.com/delta-io/delta/pull/4588/files/2c2f71a4a6b775abb4da3b141058ed20196cd099..d8f0abb884b33b12333f1b53b6d1aed124d3277a)] - [stack/tw_type_changes_to_field_metadata_impl](#4589) [[Files changed](https://github.com/delta-io/delta/pull/4589/files/d8f0abb884b33b12333f1b53b6d1aed124d3277a..1426562ddfbcb0954b481ecf8432785b84059e84)] - [stack/tw_serde_refactor](#4592) [[Files changed](https://github.com/delta-io/delta/pull/4592/files/1426562ddfbcb0954b481ecf8432785b84059e84..9a8462ad2abd1eb920b021fc5de9497223e844a4)] - [stack/add_type_change_parsing](#4593) [[Files changed](https://github.com/delta-io/delta/pull/4593/files/9a8462ad2abd1eb920b021fc5de9497223e844a4..1da4e2fe660ebd2d4d9eb9aa3702a9ec6d4ed837)] - [stack/tw_upgrade_downgrade](#4603) [[Files changed](https://github.com/delta-io/delta/pull/4603/files/1da4e2fe660ebd2d4d9eb9aa3702a9ec6d4ed837..6148941b110f389616819cf3e64ec10243f32776)] --------- #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description This PR uses the new SchemaIterable class introduced in a prior PR to calculate schema changes. It takes the opportunity to eliminate recursive methods by doing change comparison at every element in a Schema tree (by creating a custom ID object and finding changes at ever level first). This trades off some memory for, IMO, cleaner code. In theory in a future iteration we can also avoid the implicit recursion on `DataType.equivelant` and `StructField.equal` but this can be done as a follow-up. ## How was this patch tested? Existing tests. ## Does this PR introduce _any_ user-facing changes? No.
@emkornfield emkornfield force-pushed the stack/add_type_change_parsing branch 4 times, most recently from bec387c to 582ff88 Compare June 3, 2025 17:12
vkorukanti pushed a commit that referenced this pull request Jun 3, 2025
…4568) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/4568/files) to review incremental changes. - [**stack/tw_type_changes**](#4568) [[Files changed](https://github.com/delta-io/delta/pull/4568/files)] - [stack/tw_type_changes_to_field_metadata](#4588) [[Files changed](https://github.com/delta-io/delta/pull/4588/files/3cc7f9fce24cf8d1c3cd13fff9cb9ac8b2aa3f61..cd9d0a002af5ae669ff0f5951053fc9f0f696509)] - [stack/tw_type_changes_to_field_metadata_impl](#4589) [[Files changed](https://github.com/delta-io/delta/pull/4589/files/cd9d0a002af5ae669ff0f5951053fc9f0f696509..61dd9cf9cc4620c2f2123acae907eb3df1b56326)] - [stack/tw_serde_refactor](#4592) [[Files changed](https://github.com/delta-io/delta/pull/4592/files/61dd9cf9cc4620c2f2123acae907eb3df1b56326..071397255cb5b4ef6e1c4521169de5fac4ab3397)] - [stack/add_type_change_parsing](#4593) [[Files changed](https://github.com/delta-io/delta/pull/4593/files/071397255cb5b4ef6e1c4521169de5fac4ab3397..582ff88d12131a235e9898df84ce8d120011618f)] - [stack/tw_upgrade_downgrade](#4603) [[Files changed](https://github.com/delta-io/delta/pull/4603/files/582ff88d12131a235e9898df84ce8d120011618f..e321508cfa79786493b3500cab62f36f4581d6fa)] --------- #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description This PR updates schema validation to take into account type widening rules. It does the following: 1. Add switch logic to appropriately validate type changes and not throw if they type widening is enabled and the change is allowed under type widening (also accounting for iceberg ability. 2. Updates the new schema by copying over existing type widening values if none are present on the new schema, and adds in the type changes detected as changes. ## How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to test the changes thoroughly including negative and positive cases if possible. If the changes were tested in any way other than unit tests, please clarify how you tested step by step (ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future). If the changes were not tested, please explain why. --> ## Does this PR introduce _any_ user-facing changes? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Delta Lake versions or within the unreleased branches such as master. If no, write 'No'. -->
vkorukanti pushed a commit that referenced this pull request Jun 4, 2025
…rable and refactor collation fetching (#4588) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/4588/files/3cc7f9fce24cf8d1c3cd13fff9cb9ac8b2aa3f61..cd9d0a002af5ae669ff0f5951053fc9f0f696509) to review incremental changes. - [stack/tw_type_changes](#4568) [[Files changed](https://github.com/delta-io/delta/pull/4568/files)] - [**stack/tw_type_changes_to_field_metadata**](#4588) [[Files changed](https://github.com/delta-io/delta/pull/4588/files/3cc7f9fce24cf8d1c3cd13fff9cb9ac8b2aa3f61..cd9d0a002af5ae669ff0f5951053fc9f0f696509)] - [stack/tw_type_changes_to_field_metadata_impl](#4589) [[Files changed](https://github.com/delta-io/delta/pull/4589/files/cd9d0a002af5ae669ff0f5951053fc9f0f696509..61dd9cf9cc4620c2f2123acae907eb3df1b56326)] - [stack/tw_serde_refactor](#4592) [[Files changed](https://github.com/delta-io/delta/pull/4592/files/61dd9cf9cc4620c2f2123acae907eb3df1b56326..071397255cb5b4ef6e1c4521169de5fac4ab3397)] - [stack/add_type_change_parsing](#4593) [[Files changed](https://github.com/delta-io/delta/pull/4593/files/071397255cb5b4ef6e1c4521169de5fac4ab3397..582ff88d12131a235e9898df84ce8d120011618f)] - [stack/tw_upgrade_downgrade](#4603) [[Files changed](https://github.com/delta-io/delta/pull/4603/files/582ff88d12131a235e9898df84ce8d120011618f..e321508cfa79786493b3500cab62f36f4581d6fa)] --------- #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description This PR adds the ability for SchemaIterable to avoid recursion on specific nested types by passing in the classes to skip. It uses this updated API to refactor collecting collations using the iterable, this is in preparation for doing the same logic with TypeChanges in the follow-up PR. ## How was this patch tested? Unit tests (both new and existing). ## Does this PR introduce _any_ user-facing changes? No
@emkornfield emkornfield force-pushed the stack/add_type_change_parsing branch 2 times, most recently from 320751c to cd75097 Compare June 4, 2025 20:20
vkorukanti pushed a commit that referenced this pull request Jun 4, 2025
## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/4589/files) to review incremental changes. - [**stack/tw_type_changes_to_field_metadata_impl**](#4589) [[Files changed](https://github.com/delta-io/delta/pull/4589/files)] - [stack/tw_serde_refactor](#4592) [[Files changed](https://github.com/delta-io/delta/pull/4592/files/ff20437822a7ab56ef64e739a846de4edb252e89..0f4aad0f937fad59f017c3c150cb721af4bfa566)] - [stack/add_type_change_parsing](#4593) [[Files changed](https://github.com/delta-io/delta/pull/4593/files/0f4aad0f937fad59f017c3c150cb721af4bfa566..cd75097216be5bf7070e5bd46b7e0ec67f0d1ff7)] - [stack/tw_upgrade_downgrade](#4603) [[Files changed](https://github.com/delta-io/delta/pull/4603/files/cd75097216be5bf7070e5bd46b7e0ec67f0d1ff7..a0aa5e0af91c89c646de81b944187ef36d78b03c)] --------- #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description This PR adds logic to collect type changes from leaf fields and appropriately populate FieldMetadata on the nearest StructField, following the pattern already established by collations. Because this is on the StructField it means extra work is performed for nested maps/elements, fixing this is outside the scope of the PR> ## How was this patch tested? New unit tests. ## Does this PR introduce _any_ user-facing changes? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Delta Lake versions or within the unreleased branches such as master. If no, write 'No'. -->
@emkornfield emkornfield force-pushed the stack/add_type_change_parsing branch from cd75097 to 48e303c Compare June 5, 2025 04:49
return structField;
}

private static boolean isNested(DataType dataType) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we should move this under DataType class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has been sprinkled in a few places, I'll do a cleanup to consolidate in a separate PR to keep things focused.

vkorukanti pushed a commit that referenced this pull request Jun 5, 2025
…intertwined with type parsing (#4592) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/4592/files) to review incremental changes. - [**stack/tw_serde_refactor**](#4592) [[Files changed](https://github.com/delta-io/delta/pull/4592/files)] - [stack/add_type_change_parsing](#4593) [[Files changed](https://github.com/delta-io/delta/pull/4593/files/6e5bbbeb27f0b6d0418c22e0e0b39907cbd2cc4e..48e303c7921df89414e75f9b45d78defcd17847a)] - [stack/tw_upgrade_downgrade](#4603) [[Files changed](https://github.com/delta-io/delta/pull/4603/files/48e303c7921df89414e75f9b45d78defcd17847a..2d6fffbe062fb6aab01d4ce59d0736caf8c13a87)] --------- <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md 2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 3. Be sure to keep the PR description updated to reflect all changes. 4. Please write your PR title to summarize what this PR proposes. 5. If possible, provide a concise example to reproduce the issue for a faster review. 6. If applicable, include the corresponding issue number in the PR title and link it in the body. --> #### Which Delta project/connector is this regarding? <!-- Please add the component selected below to the beginning of the pull request title For example: [Spark] Title of my pull request --> - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description This refactors the population of collation on StringTypes to be separate from parsing. This simplifies flows/arguments or parsing and consolidates the logic in a single place using StreamIterable. ## How was this patch tested? Existing tests. ## Does this PR introduce _any_ user-facing changes? No.
@emkornfield emkornfield force-pushed the stack/add_type_change_parsing branch from 48e303c to 76e99c5 Compare June 5, 2025 22:15
@vkorukanti vkorukanti merged commit aa6bc12 into delta-io:master Jun 5, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants