bench: improve Iceberg TPC workflow and plan capture by Shekharrajak · Pull Request #3783 · apache/datafusion-comet

Shekharrajak · 2026-03-24T17:41:23Z

This PR improves the Iceberg benchmarking workflow for TPC‑H/TPC‑DS in Comet by:

adding a Spark JVM Iceberg baseline engine (spark-iceberg.toml) to compare directly against Comet native Iceberg scans
making Iceberg table creation accept both table.parquet files and table/ directories
adding per‑query physical plan capture in tpcbench.py for Spark vs Comet plan comparison

Steps to run benchmark against TPC

mkdir -p /tmp/comet-bench/{spark,jars} curl -L -o /tmp/comet-bench/spark/spark-3.5.8-bin-hadoop3.tgz \ https://dlcdn.apache.org/spark/spark-3.5.8/spark-3.5.8-bin-hadoop3.tgz tar xzf /tmp/comet-bench/spark/spark-3.5.8-bin-hadoop3.tgz -C /tmp/comet-bench/spark curl -L -o /tmp/comet-bench/jars/iceberg-spark-runtime-3.5_2.12-1.8.1.jar \ https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.8.1/iceberg-spark-runtime-3.5_2.12-1.8.1.jar # Build Comet JAR (includes native library) cd ~/Documents/apache/datafusion-comet make release # Generate TPC‑H Parquet (SF1) cd ~/Documents/apache/datafusion-comet make benchmark-org.apache.spark.sql.GenTPCHData -- \ --location /tmp/comet-bench/tpch-data \ --scaleFactor 1 # Generate using DuckDB python3 -m venv /tmp/comet-bench/duckdb-venv /tmp/comet-bench/duckdb-venv/bin/pip install duckdb /tmp/comet-bench/duckdb-venv/bin/python - <<'PY' import duckdb, os out = "/tmp/tpcds/sf1_parquet" os.makedirs(out, exist_ok=True) con = duckdb.connect() con.execute("INSTALL tpcds; LOAD tpcds;") con.execute("CALL dsdgen(sf=1);") tables = [r[0] for r in con.execute("SHOW TABLES").fetchall()] for t in tables: con.execute(f"COPY {t} TO '{out}/{t}.parquet' (FORMAT PARQUET);") PY # Create Iceberg Tables export SPARK_HOME=/tmp/comet-bench/spark/spark-3.5.8-bin-hadoop3 SPARK_LOCAL_IP=127.0.0.1 $SPARK_HOME/bin/spark-submit \ --master "local[*]" \ --driver-class-path /tmp/comet-bench/jars/iceberg-spark-runtime-3.5_2.12-1.8.1.jar \ --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --conf "spark.driver.bindAddress=127.0.0.1" \ --conf "spark.driver.host=127.0.0.1" \ ~/Documents/apache/datafusion-comet/benchmarks/tpc/create-iceberg-tables.py \ --benchmark tpch \ --parquet-path /tmp/comet-bench/tpch-data/tpch/sf1_parquet \ --warehouse /tmp/comet-bench/iceberg-warehouse SPARK_LOCAL_IP=127.0.0.1 $SPARK_HOME/bin/spark-submit \ --master "local[*]" \ --driver-class-path /tmp/comet-bench/jars/iceberg-spark-runtime-3.5_2.12-1.8.1.jar \ --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --conf "spark.driver.bindAddress=127.0.0.1" \ --conf "spark.driver.host=127.0.0.1" \ ~/Documents/apache/datafusion-comet/benchmarks/tpc/create-iceberg-tables.py \ --benchmark tpcds \ --parquet-path /tmp/tpcds/sf1_parquet \ --warehouse /tmp/comet-bench/iceberg-warehouse # Run Spark JVM (Iceberg Baseline) SPARK_LOCAL_IP=127.0.0.1 $SPARK_HOME/bin/spark-submit \ --master "local[*]" \ --driver-class-path /tmp/comet-bench/jars/iceberg-spark-runtime-3.5_2.12-1.8.1.jar \ --conf "spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.local.type=hadoop" \ --conf "spark.sql.catalog.local.warehouse=/tmp/comet-bench/iceberg-warehouse" \ --conf "spark.sql.defaultCatalog=local" \ ~/Documents/apache/datafusion-comet/benchmarks/tpc/tpcbench.py \ --benchmark tpch --catalog local --database tpch \ --iterations 1 --output /tmp/comet-bench/results \ --name spark-iceberg \ --plan-dir /tmp/comet-bench/plans/tpch-sf1 SPARK_LOCAL_IP=127.0.0.1 $SPARK_HOME/bin/spark-submit \ --master "local[*]" \ --driver-class-path /tmp/comet-bench/jars/iceberg-spark-runtime-3.5_2.12-1.8.1.jar \ --conf "spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.local.type=hadoop" \ --conf "spark.sql.catalog.local.warehouse=/tmp/comet-bench/iceberg-warehouse" \ --conf "spark.sql.defaultCatalog=local" \ ~/Documents/apache/datafusion-comet/benchmarks/tpc/tpcbench.py \ --benchmark tpcds --catalog local --database tpcds \ --iterations 1 --output /tmp/comet-bench/results \ --name spark-iceberg \ --plan-dir /tmp/comet-bench/plans/tpcds-sf1 # Run Comet Native (Iceberg Rust Scan) export COMET_JAR=~/Documents/apache/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.15.0-SNAPSHOT.jar SPARK_LOCAL_IP=127.0.0.1 $SPARK_HOME/bin/spark-submit \ --master "local[*]" \ --driver-class-path "$COMET_JAR:/tmp/comet-bench/jars/iceberg-spark-runtime-3.5_2.12-1.8.1.jar" \ --jars "$COMET_JAR,/tmp/comet-bench/jars/iceberg-spark-runtime-3.5_2.12-1.8.1.jar" \ --conf "spark.plugins=org.apache.spark.CometPlugin" \ --conf "spark.comet.enabled=true" \ --conf "spark.comet.exec.enabled=true" \ --conf "spark.comet.scan.icebergNative.enabled=true" \ --conf "spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager" \ --conf "spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.local.type=hadoop" \ --conf "spark.sql.catalog.local.warehouse=/tmp/comet-bench/iceberg-warehouse" \ --conf "spark.sql.defaultCatalog=local" \ ~/Documents/apache/datafusion-comet/benchmarks/tpc/tpcbench.py \ --benchmark tpch --catalog local --database tpch \ --iterations 1 --output /tmp/comet-bench/results \ --name comet-iceberg \ --plan-dir /tmp/comet-bench/plans/tpch-sf1 # Generate Comparison Graphs cd ~/Documents/apache/datafusion-comet/benchmarks/tpc # TPC-H charts python3 generate-comparison.py \ --benchmark tpch \ --labels "Spark Iceberg (JVM)" "Comet Iceberg (Native Rust)" \ --title "TPC-H SF1: Iceberg JVM Scan vs Native Rust Scan" \ /tmp/comet-bench/results/spark-iceberg-tpch-*.json \ /tmp/comet-bench/results/comet-iceberg-tpch-*.json # TPC-DS charts (Q1–Q22) python3 generate-comparison.py \ --benchmark tpcds \ --labels "Spark Iceberg (JVM)" "Comet Iceberg (Native Rust)" \ --title "TPC-DS SF1 (Q1–Q22): Iceberg JVM vs Native Rust" \ /tmp/comet-bench/results/spark-iceberg-tpcds-q1to22-*.json \ /tmp/comet-bench/results/comet-iceberg-tpcds-*.json

Shekharrajak · 2026-03-26T06:14:56Z

Results generated in Apple M4 Pro 48 GB memory:

Shekharrajak · 2026-03-26T06:16:06Z

Improvement results in TPC DS dataset:

Shekharrajak · 2026-03-26T08:49:52Z

Open to suggestions to reduce the commands and simple cli/bash script to have these benchmark graph generated.

Shekharrajak added 2 commits March 24, 2026 22:56

bench: improve iceberg tpc benchmarking

e47601c

bench: drop empty metrics export

2734370

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: improve Iceberg TPC workflow and plan capture#3783

bench: improve Iceberg TPC workflow and plan capture#3783
Shekharrajak wants to merge 2 commits intoapache:mainfrom
Shekharrajak:iceberg_scan_benchmark

Shekharrajak commented Mar 24, 2026 •

edited

Loading

Shekharrajak commented Mar 26, 2026

Shekharrajak commented Mar 26, 2026

Shekharrajak commented Mar 26, 2026

Labels

1 participant

Conversation

Shekharrajak commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Steps to run benchmark against TPC

Shekharrajak commented Mar 26, 2026

Shekharrajak commented Mar 26, 2026

Shekharrajak commented Mar 26, 2026

Labels

1 participant

Shekharrajak commented Mar 24, 2026 •

edited

Loading