- Notifications
You must be signed in to change notification settings - Fork 586
RRD footers 2: RRD manifests #12047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RRD footers 2: RRD manifests #12047
Conversation
| Web viewer failed to build. | Result | Commit | Link | Manifest | View image diff on kitdiff. Note: This comment is updated whenever you push a commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces RRD manifest support as the second part of the RRD footers series. The manifest provides an index of chunks in an RRD recording, enabling efficient query planning without loading actual data. Key changes include:
- Extending
RrdFooterand addingRrdManifestprotobuf definitions - Implementing
RrdManifestandRrdManifestBuilderto catalog chunks with their metadata (ID, size, offset, timeline ranges) - Adding transport layer conversions between application-level and protobuf types
- Refactoring schema hash computation to a shared utility method
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
crates/store/re_protos/proto/rerun/v1alpha1/log_msg.proto | Added RrdManifest message definition with schema, store ID, and dataframe fields |
crates/store/re_protos/src/v1alpha1/rerun.log_msg.v1alpha1.rs | Generated Rust code for the new protobuf definitions |
crates/store/re_log_encoding/src/rrd/footer/instances.rs | Implemented RrdManifest with schema validation, column accessors, and sanity checks |
crates/store/re_log_encoding/src/rrd/footer/builders.rs | Added RrdManifestBuilder for constructing manifests from chunks |
crates/store/re_log_encoding/src/transport_to_app.rs | Implemented bidirectional conversions between transport and application types |
crates/store/re_log_encoding/src/rrd/encoder.rs | Updated encoder to initialize empty manifests map |
crates/store/re_server/src/store/layer.rs | Refactored to use shared schema hash computation utility |
crates/store/re_server/Cargo.toml | Removed unused sha2 dependency |
crates/store/re_log_encoding/Cargo.toml | Added required dependencies (re_arrow_util, re_types_core, sha2) |
crates/store/re_log_encoding/src/rrd/footer/mod.rs | Updated exports to include new manifest types |
crates/store/re_log_encoding/src/rrd/mod.rs | Exposed RrdManifest and RrdManifestBuilder publicly |
Comments suppressed due to low confidence (1)
crates/store/re_log_encoding/src/rrd/footer/instances.rs:1
- The
has_static_datashould befalsefor temporal component-level data, nottrue. This contradicts line 147 where temporal chunk-level columns correctly set it tofalse, and the comment indicates this is temporal data.
use std::collections::HashMap; 💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
964c086 to c1e6776 Compare There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very dumbed down version of the implementation we've been using since forever on the PM side. I haven't really changed anything besides removing a lot of columns and adding docs.
| /// can generally be found (Lance, external dataframe libraries, etc). | ||
| /// | ||
| /// If caller doesn't provide any part (i.e. all are `None`), an empty string is returned. | ||
| pub fn compute_column_name( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a copypasta of the column naming routine that we've been using for ages on the Platform side, so it has been very well battle tested.
| } | ||
| } | ||
| | ||
| // Sanity checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RRD manifests are effectively user input so we need to be at least a little on the defensive so we don't register complete non-sense. All the checks marked cheap basically only look at schema concerns, never the data itself.
There are some extra paranoid layers on the dataplatform side.
zehiko left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know manifest builder and I know it works and this looks as you said, just a slightly simplified version of it, hence - ship it!
LLM summary: > This PR introduces the foundational framing infrastructure for RRD stream footers, which will enable richer metadata and improved stream navigation. It establishes the protocol-level types and state machine changes needed to support footers without yet populating them with content. > > Key changes include: > - Addition of `RrdFooter` message types at both transport (protobuf) and application levels > - Introduction of `StreamFooter` frame for locating and validating RRD footers > - Enhanced decoder state machine to handle the new footer frames > - Updated encoder to emit footers with basic state tracking infrastructure To which I would add the following: This introduces all the framing infrastructure so that RRD streams are now capable of carrying footers in all circumstances (including pipes). Specifically, this introduces a couple new types: * `RrdFooter`, a(n empty) protobuf message that will from now on act as the payload for messages of type `MessageKind::End`. * `MessageKind::End` isn't new, it's something that was always there, but until now was always empty. * As always, this has both a cheap transport-level definition and a less cheap app-level definition. * `StreamFooter`, a simple binary frame that mirrors `StreamHeader`, and whose main job is to keep track of where the `RrdFooter`. * This is used for O(1) access to the `RrdFooter`, e.g. when registering data on a Redap-compliant server. For now, these footers are always an empty protobuf messages. We will be filling them up in the following PRs. --- Part of RRD footers series of PRs: * #12044 * #12047 * #12048 * rerun-io/dataplatform#2060 ## TODO * [x] `@rerun-bot full-check` --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
eae5d66 to fd1e2ef Compare LLM summary: > This PR implements encoding and decoding of RRD manifests in footers for the Rerun RRD file format. The changes enable random access to chunks within RRD files by storing metadata about chunk locations and properties in footer manifests. > > Key changes: > - Adds manifest building during encoding, tracking chunk metadata (offsets, sizes, entity paths, etc.) > - Implements manifest parsing during decoding with transport-to-application conversion > - Adds CLI support for displaying parsed footers (`--footers` flag) and recomputing manifests during routing (`--recompute-manifests` flag) To which I actually don't have all that much to add. This PR is basically all the remaining glue so that, whenever one uses our `Encoder` or one of our `Decoder` variants, RRD footers and manifests will automagically be computed, injected and serialized/deserialized. The most important part of this PR is arguably the addition of a `footer_roundtrip` test, that encodes a recording and then manually decodes all of its chunks directly using the generated RRD manifest, instead of using a `Decoder`. --- Part of RRD footers series of PRs: * #12044 * #12047 * #12048 * rerun-io/dataplatform#2060
LLM summary:
To which I'll add:
This PR introduces the core datastructure that this whole footer business is all about: the
RrdManifest.The
RrdManifestis a protobuf message (which, as always, comes with both a transport-level and application-level implementations) which carries various important metadata about the associated recording.The most important piece of metadata is the actual manifest record-batch, which lists all the chunks in the recording at at fine-grained enough level of detail that relevancy queries become possible without ever having to load the data in a local chunk-store first.
Effectively, you can view record-batch is a dataframe representation of the time panel.
There is a lot of code, but it's all fairly trivial and mechanical, bordering on boilerplate in some cases.
In fact, what you should really focus on is the fact that this code:
rerun/crates/store/re_log_encoding/tests/footers_and_manifests.rs
Lines 116 to 173 in ad754fa
yields this manifest record-batch:
with this schema:
rerun/crates/store/re_log_encoding/tests/snapshots/footers_and_manifests__simple_manifest_batch_schema.snap
Lines 5 to 117 in ad754fa
Part of RRD footers series of PRs: