Skip to content

nssalian/parx

Repository files navigation

PARX

CI crates.io docs.rs

Early Development: This project is in active development. Format and APIs may change.

Persistent metadata caching for Parquet files.

What It Does

PARX caches Parquet metadata in sidecar files (.parx) to eliminate repeated metadata fetches.

The problem: Parquet stores metadata at the end of files. Reading it requires 3 requests: HEAD for file size, GET_RANGE for footer length, GET_RANGE for footer. When 10 workers read the same files, that's 30 requests per file.

The solution: Cache metadata once in a .parx sidecar. All workers fetch the sidecar (1 request) instead of reading the footer (3 requests).

file.parquet (2.7 MB) file.parquet.parx (282 KB) 

Format

Single-file format:

┌──────────────────────────────────────────┐ │ Header (16 bytes) │ │ - Magic: "PARX" │ │ - Version, Flags │ ├──────────────────────────────────────────┤ │ Footer Payload (variable, raw/compressed)│ │ - Raw Parquet footer bytes │ ├──────────────────────────────────────────┤ │ Page Index Payload (optional) │ │ - ColumnIndex + OffsetIndex │ ├──────────────────────────────────────────┤ │ Manifest (Protobuf) │ │ - Offsets, lengths, checksums │ │ - Source file size │ ├──────────────────────────────────────────┤ │ Trailer (12 bytes) │ │ - Manifest length, CRC32C │ │ - Magic: "PARX" │ └──────────────────────────────────────────┘ 

Bundle format (for directories):

┌──────────────────────────────────────────┐ │ Bundle Header (24 bytes) │ │ - Magic: "PRXB" │ │ - Version, Flags │ │ - Entry count │ ├──────────────────────────────────────────┤ │ Entry 0: Footer (+ optional page indexes)│ ├──────────────────────────────────────────┤ │ Entry 1: Footer (+ optional page indexes)│ ├──────────────────────────────────────────┤ │ ... (N entries) │ ├──────────────────────────────────────────┤ │ Bundle Manifest (Protobuf) │ │ - Path→Entry mapping │ ├──────────────────────────────────────────┤ │ Trailer (12 bytes) │ │ - Manifest length, CRC32C │ │ - Magic: "PRXB" │ └──────────────────────────────────────────┘ 

See FORMAT_SPEC.md for detailed byte-level layout. Bundle entries can optionally include page-index payloads using policy-controlled caps.

Building

# Core library cd implementations/rust/parx cargo build --release # CLI tool (install to ~/.cargo/bin) cd implementations/rust/parx-cli cargo install --path . --locked # Benchmarks cd benchmarks/parx_benchmarks make all

If parx is not found after install, ensure ~/.cargo/bin is on your PATH.

CLI Usage

# Build sidecar for single file parx build file.parquet # Verify sidecar parx verify file.parquet.parx # Inspect contents parx inspect file.parquet.parx # Bundle directory parx bundle build /data/events/ # Creates: /data/events/_parx_bundle.parx # Bundle directory with page indexes (optional, capped) parx bundle build /data/events/ \ --include-page-indexes \ --max-page-index-bytes-per-file 262144 \ --max-total-page-index-bytes 16777216 # Extract bundle parx bundle extract /data/events/_parx_bundle.parx --output /output/

For a full local CLI validation walkthrough (install + generated fixtures + end-to-end commands), see: docs/CLI_SMOKE_TEST.md

Library Usage

use parx_rs::{ParxReader, ParxWriter}; // Write: build from Parquet file directly let mut writer = ParxWriter::from_parquet_file("file.parquet")?; let parx_bytes = writer.finish(); std::fs::write("file.parquet.parx", parx_bytes)?; // Read: load cached footer from .parx sidecar let parx_data = std::fs::read("file.parquet.parx")?; let reader = ParxReader::open(&parx_data)?; let footer = reader.footer_bytes(); // Raw Parquet footer, ready to use

Benchmarks

Local tests with 4 schema types (simple, medium, wide, nested):

Arrow async vs PARX:

  • Requests: 3.0 → 1.0 per file (66.7% reduction)
  • Latency: ~428 µs → ~208 µs (~2x faster)
  • Bytes: ~25 KB → ~26 KB (~2% overhead)

Note: this benchmark measures metadata read path with prebuilt .parx sidecars; one-time sidecar creation is excluded.

Run benchmarks:

cd benchmarks/parx_benchmarks make arrow-vs-parx # Arrow vs PARX comparison make prefetch # Prefetch hint testing

When to Use

Use it:

  • Cloud storage (S3/GCS/Azure)
  • Multiple processes reading same files
  • Immutable or versioned files
  • Parquet V2 with page indexes (including bundle mode with policy caps)

Skip it:

  • Single-process work (in-memory cache is fine)
  • Local SSD (minimal benefit)
  • Delta/Iceberg/Hudi tables (built-in metadata)
  • Frequently updated files

Project Structure

parx/ ├── implementations/rust/ │ ├── parx/ # Core library │ └── parx-cli/ # CLI tool ├── benchmarks/ │ └── parx_benchmarks/ # Performance tests ├── spec/ │ └── proto/ # Protobuf schema └── FORMAT_SPEC.md # Format specification 

Testing

# Unit tests cargo test # Integration tests cd implementations/rust/parx-cli cargo test --test cli_integration

License

Apache 2.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors