PARX

Early Development: This project is in active development. Format and APIs may change.

Persistent metadata caching for Parquet files.

What It Does

PARX caches Parquet metadata in sidecar files (.parx) to eliminate repeated metadata fetches.

The problem: Parquet stores metadata at the end of files. Reading it requires 3 requests: HEAD for file size, GET_RANGE for footer length, GET_RANGE for footer. When 10 workers read the same files, that's 30 requests per file.

The solution: Cache metadata once in a .parx sidecar. All workers fetch the sidecar (1 request) instead of reading the footer (3 requests).

file.parquet (2.7 MB) file.parquet.parx (282 KB)

Format

Single-file format:

┌──────────────────────────────────────────┐ │ Header (16 bytes) │ │ - Magic: "PARX" │ │ - Version, Flags │ ├──────────────────────────────────────────┤ │ Footer Payload (variable, raw/compressed)│ │ - Raw Parquet footer bytes │ ├──────────────────────────────────────────┤ │ Page Index Payload (optional) │ │ - ColumnIndex + OffsetIndex │ ├──────────────────────────────────────────┤ │ Manifest (Protobuf) │ │ - Offsets, lengths, checksums │ │ - Source file size │ ├──────────────────────────────────────────┤ │ Trailer (12 bytes) │ │ - Manifest length, CRC32C │ │ - Magic: "PARX" │ └──────────────────────────────────────────┘

Bundle format (for directories):

┌──────────────────────────────────────────┐ │ Bundle Header (24 bytes) │ │ - Magic: "PRXB" │ │ - Version, Flags │ │ - Entry count │ ├──────────────────────────────────────────┤ │ Entry 0: Footer (+ optional page indexes)│ ├──────────────────────────────────────────┤ │ Entry 1: Footer (+ optional page indexes)│ ├──────────────────────────────────────────┤ │ ... (N entries) │ ├──────────────────────────────────────────┤ │ Bundle Manifest (Protobuf) │ │ - Path→Entry mapping │ ├──────────────────────────────────────────┤ │ Trailer (12 bytes) │ │ - Manifest length, CRC32C │ │ - Magic: "PRXB" │ └──────────────────────────────────────────┘

See FORMAT_SPEC.md for detailed byte-level layout. Bundle entries can optionally include page-index payloads using policy-controlled caps.

Building

# Core library cd implementations/rust/parx cargo build --release # CLI tool (install to ~/.cargo/bin) cd implementations/rust/parx-cli cargo install --path . --locked # Benchmarks cd benchmarks/parx_benchmarks make all

If parx is not found after install, ensure ~/.cargo/bin is on your PATH.

CLI Usage

# Build sidecar for single file parx build file.parquet # Verify sidecar parx verify file.parquet.parx # Inspect contents parx inspect file.parquet.parx # Bundle directory parx bundle build /data/events/ # Creates: /data/events/_parx_bundle.parx # Bundle directory with page indexes (optional, capped) parx bundle build /data/events/ \ --include-page-indexes \ --max-page-index-bytes-per-file 262144 \ --max-total-page-index-bytes 16777216 # Extract bundle parx bundle extract /data/events/_parx_bundle.parx --output /output/

For a full local CLI validation walkthrough (install + generated fixtures + end-to-end commands), see: docs/CLI_SMOKE_TEST.md

Library Usage

use parx_rs::{ParxReader, ParxWriter}; // Write: build from Parquet file directly let mut writer = ParxWriter::from_parquet_file("file.parquet")?; let parx_bytes = writer.finish(); std::fs::write("file.parquet.parx", parx_bytes)?; // Read: load cached footer from .parx sidecar let parx_data = std::fs::read("file.parquet.parx")?; let reader = ParxReader::open(&parx_data)?; let footer = reader.footer_bytes(); // Raw Parquet footer, ready to use

Benchmarks

Local tests with 4 schema types (simple, medium, wide, nested):

Arrow async vs PARX:

Requests: 3.0 → 1.0 per file (66.7% reduction)
Latency: ~428 µs → ~208 µs (~2x faster)
Bytes: ~25 KB → ~26 KB (~2% overhead)

Note: this benchmark measures metadata read path with prebuilt .parx sidecars; one-time sidecar creation is excluded.

Run benchmarks:

cd benchmarks/parx_benchmarks make arrow-vs-parx # Arrow vs PARX comparison make prefetch # Prefetch hint testing

When to Use

Use it:

Cloud storage (S3/GCS/Azure)
Multiple processes reading same files
Immutable or versioned files
Parquet V2 with page indexes (including bundle mode with policy caps)

Skip it:

Single-process work (in-memory cache is fine)
Local SSD (minimal benefit)
Delta/Iceberg/Hudi tables (built-in metadata)
Frequently updated files

Project Structure

parx/ ├── implementations/rust/ │ ├── parx/ # Core library │ └── parx-cli/ # CLI tool ├── benchmarks/ │ └── parx_benchmarks/ # Performance tests ├── spec/ │ └── proto/ # Protobuf schema └── FORMAT_SPEC.md # Format specification

Testing

# Unit tests cargo test # Integration tests cd implementations/rust/parx-cli cargo test --test cli_integration

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
benchmarks/parx_benchmarks		benchmarks/parx_benchmarks
docs		docs
implementations/rust		implementations/rust
spec/proto		spec/proto
.gitignore		.gitignore
BLOG.md		BLOG.md
Cargo.toml		Cargo.toml
FORMAT_SPEC.md		FORMAT_SPEC.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PARX

What It Does

Format

Building

CLI Usage

Library Usage

Benchmarks

When to Use

Project Structure

Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PARX

What It Does

Format

Building

CLI Usage

Library Usage

Benchmarks

When to Use

Project Structure

Testing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages