The httpfs extension introduces support for the hf:// protocol to access datasets hosted in Hugging Face repositories. See the announcement blog post for details.
Usage
Hugging Face repositories can be queried using the following URL pattern:
hf://datasets/⟨my_username⟩/⟨my_dataset⟩/⟨path_to_file⟩ For example, to read a CSV file, you can use the following query:
SELECT * FROM 'hf://datasets/datasets-examples/doc-formats-csv-1/data.csv'; Where:
datasets-examplesis the name of the user/organizationdoc-formats-csv-1is the name of the dataset repositorydata.csvis the file path in the repository
The result of the query is:
| kind | sound |
|---|---|
| dog | woof |
| cat | meow |
| pokemon | pika |
| human | hello |
To read a JSONL file, you can run:
SELECT * FROM 'hf://datasets/datasets-examples/doc-formats-jsonl-1/data.jsonl'; Finally, for reading a Parquet file, use the following query:
SELECT * FROM 'hf://datasets/datasets-examples/doc-formats-parquet-1/data/train-00000-of-00001.parquet'; Each of these commands reads the data from the specified file format and displays it in a structured tabular format. Choose the appropriate command based on the file format you are working with.
Creating a Local Table
To avoid accessing the remote endpoint for every query, you can save the data in a DuckDB table by running a CREATE TABLE ... AS command. For example:
CREATE TABLE data AS SELECT * FROM 'hf://datasets/datasets-examples/doc-formats-csv-1/data.csv'; Then, simply query the data table as follows:
SELECT * FROM data; Multiple Files
To query all files under a specific directory, you can use a glob pattern. For example:
SELECT count(*) AS count FROM 'hf://datasets/cais/mmlu/astronomy/*.parquet'; | count |
|---|
| 173 |
By using glob patterns, you can efficiently handle large datasets and perform comprehensive queries across multiple files, simplifying your data inspections and processing tasks. Here, you can see how you can look for questions that contain the word “planet” in astronomy:
SELECT count(*) AS count FROM 'hf://datasets/cais/mmlu/astronomy/*.parquet' WHERE question LIKE '%planet%'; | count |
|---|
| 21 |
Versioning and Revisions
In Hugging Face repositories, dataset versions or revisions are different dataset updates. Each version is a snapshot at a specific time, allowing you to track changes and improvements. In git terms, it can be understood as a branch or specific commit.
You can query different dataset versions/revisions by using the following URL:
hf://datasets/my_username/my_dataset@my_branch/path_to_file For example:
SELECT * FROM 'hf://datasets/datasets-examples/doc-formats-csv-1@~parquet/**/*.parquet'; | kind | sound |
|---|---|
| dog | woof |
| cat | meow |
| pokemon | pika |
| human | hello |
The previous query will read all Parquet files under the ~parquet revision. This is a special branch where Hugging Face automatically generates the Parquet files of every dataset to enable efficient scanning.
Authentication
Configure your Hugging Face Token in the DuckDB Secrets Manager to access private or gated datasets. First, visit Hugging Face Settings – Tokens to obtain your access token. Second, set it in your DuckDB session using DuckDB’s Secrets Manager. DuckDB supports two providers for managing secrets:
CONFIG
The user must pass all configuration information into the CREATE SECRET statement. To create a secret using the CONFIG provider, use the following command:
CREATE SECRET hf_token ( TYPE huggingface, TOKEN 'your_hf_token' ); credential_chain
Automatically tries to fetch credentials. For the Hugging Face token, it will try to get it from ~/.cache/huggingface/token. To create a secret using the credential_chain provider, use the following command:
CREATE SECRET hf_token ( TYPE huggingface, PROVIDER credential_chain );