Skip to content

WIP: async stream of Arrow record batches from Parquet file#258

Closed
kylebarron wants to merge 2 commits intomainfrom
kyle/async-parquet-stream
Closed

WIP: async stream of Arrow record batches from Parquet file#258
kylebarron wants to merge 2 commits intomainfrom
kyle/async-parquet-stream

Conversation

@kylebarron
Copy link
Copy Markdown
Owner

@kylebarron kylebarron commented Nov 13, 2024

This takes just 1.1s for the stream to start and then 1.0s more for the first record batch to be fetched. While it's >60s for the full file to download on my internet.

from time import time t0 = time() url = "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00217-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet" store = HTTPStore.from_url(url) stream = await read_parquet_async("", store=store) t1 = time() first = await stream.__anext__() t2 = time() print(t1 - t0) # 1.1302871704101562 print(t2 - t1) # 1.0420188903808594
@kylebarron kylebarron enabled auto-merge (squash) November 13, 2024 22:28
@kylebarron kylebarron disabled auto-merge November 13, 2024 22:28
@kylebarron kylebarron marked this pull request as draft November 13, 2024 22:28
@kylebarron
Copy link
Copy Markdown
Owner Author

superseded by #313

@kylebarron kylebarron closed this Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant