Skip to content

cameronraysmith/xet-core

Β 
Β 

License GitHub release Contributor Covenant

πŸ€— xet-core - xet client tech, used in huggingface_hub

Welcome

xet-core enables huggingface_hub to utilize xet storage for uploading and downloading to HF Hub. Xet storage provides chunk-based deduplication, efficient storage/retrieval with local disk caching, and backwards compatibility with Git LFS. This library is not meant to be used directly, and is instead intended to be used from huggingface_hub.

Key features

β™» chunk-based deduplication implementation: avoid transferring and storing chunks that are shared across binary files (models, datasets, etc).

πŸ€— Python bindings: bindings for huggingface_hub package.

↔ network communications: concurrent communication to HF Hub Xet backend services (CAS).

πŸ”– local disk caching: chunk-based cache that sits alongside the existing huggingface_hub disk cache.

Contributions (feature requests, bugs, etc.) are encouraged & appreciated πŸ’™πŸ’šπŸ’›πŸ’œπŸ§‘β€οΈ

Please join us in making xet-core better. We value everyone's contributions. Code is not the only way to help. Answering questions, helping each other, improving documentation, filing issues all help immensely. If you are interested in contributing (please do!), check out the contribution guide for this repository.

Issues, Diagnostics & Debugging

If you encounter an issue with hf-xet, please collect diagnostic information and attach it when creating a new Issue.

The scripts/diag/ directory contains platform-specific scripts that download debug symbols, configure logging, and capture periodic stack traces and core dumps:

OS Script
Linux scripts/diag/hf-xet-diag-linux.sh
macOS scripts/diag/hf-xet-diag-macos.sh
Windows (Git-Bash) scripts/diag/hf-xet-diag-windows.sh
# prefix your failing command with the script for your OS, e.g.: ./scripts/diag/hf-xet-diag-macos.sh -- python my-script.py

See scripts/diag/README.md for full usage, output layout, dump analysis instructions, and how to install debug symbols manually.

Quick debugging environment variables:

RUST_BACKTRACE=full # full Rust backtraces on panic RUST_LOG=info # enable hf-xet logging HF_XET_LOG_FILE=/tmp/xet.log # write logs to a file (defaults to stdout)

Local Development

Repo Organization - Rust Crates

  • cas_client: communication with CAS backend services, which include APIs for Xorbs and Shards.
  • cas_object: CAS object (Xorb) format and associated APIs, including chunks (ranges within Xorbs).
  • cas_types: common types shared across crates in xet-core and xetcas.
  • chunk_cache: local disk cache of Xorb chunks.
  • chunk_cache_bench: benchmarking crate for chunk_cache.
  • data: main driver for client operations - FilePointerTranslator drives hydrating or shrinking files, chunking + deduplication here.
  • error_printer: utility for printing errors conveniently.
  • file_utils: SafeFileCreator utility, used by chunk_cache.
  • hf_xet: Python integration with Rust code, uses maturin to build hf-xet Python package. Main integration with HF Hub Python package.
  • mdb_shard: Shard operations, including Shard format, dedupe probing, benchmarks, and utilities.
  • merklehash: MerkleHash type, 256-bit hash, widely used across many crates.
  • progress_reporting: offers ReportedWriter so progress for Writer operations can be displayed.
  • utils: general utilities, including singleflight, progress, serialization_utils and threadpool.

Build, Test & Benchmark

To build xet-core, look at requirements in GitHub Actions CI Workflow for the Rust toolchain to install. Follow Rust documentation for installing rustup and that version of the toolchain. Use the following steps for building, testing, benchmarking.

Many of us on the team use VSCode, so we have checked in some settings in the .vscode directory. Install the rust-analyzer extension.

Build:

cargo build 

Test:

cargo test 

Benchmark:

cargo bench 

Linting:

cargo clippy -r --verbose -- -D warnings 

Formatting (requires nightly toolchain):

cargo +nightly fmt --manifest-path ./Cargo.toml --all 

Building Python package and running locally (on *nix systems):

  1. Create Python3 virtualenv: python3 -mvenv ~/venv
  2. Activate virtualenv: source ~/venv/bin/activate
  3. Install maturin: pip3 install maturin ipython
  4. Go to hf_xet crate: cd hf_xet
  5. Build: maturin develop
  6. Test:
ipython import hf_xet as hfxet hfxet.upload_files() hfxet.download_files() 

Developing with tokio console

Prerequisite is installing tokio-console (cargo install tokio-console). See https://github.com/tokio-rs/console

To use tokio-console with hf-xet there are compile hf_xet with the following command:

RUSTFLAGS="--cfg tokio_unstable" maturin develop -r --features tokio-console

Then while hf_xet is running (via a hf cli command or huggingface_hub python code), tokio-console will be able to connect.

Ex.

# In one terminal: pip install huggingface_hub RUSTFLAGS="--cfg tokio_unstable" maturin develop -r --features tokio-console hf download openai/gpt-oss-20b # In another terminal cargo install tokio-console tokio-console

Building universal whl for MacOS:

From hf_xet directory:

MACOSX_DEPLOYMENT_TARGET=10.9 maturin build --release --target universal2-apple-darwin --features openssl_vendored 

Note: You may need to install x86_64: rustup target add x86_64-apple-darwin

Testing

Unit-tests are run with cargo test, benchmarks are run with cargo bench. Some crates have a main.rs that can be run for manual testing.

References & History

About

xet client tech, used in huggingface_hub

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Rust 97.3%
  • Shell 1.6%
  • Other 1.1%