Skip to content

bench: improve Iceberg TPC workflow and plan capture#3783

Open
Shekharrajak wants to merge 2 commits intoapache:mainfrom
Shekharrajak:iceberg_scan_benchmark
Open

bench: improve Iceberg TPC workflow and plan capture#3783
Shekharrajak wants to merge 2 commits intoapache:mainfrom
Shekharrajak:iceberg_scan_benchmark

Conversation

@Shekharrajak
Copy link
Copy Markdown
Contributor

@Shekharrajak Shekharrajak commented Mar 24, 2026

This PR improves the Iceberg benchmarking workflow for TPC‑H/TPC‑DS in Comet by:

  • adding a Spark JVM Iceberg baseline engine (spark-iceberg.toml) to compare directly against Comet native Iceberg scans
  • making Iceberg table creation accept both table.parquet files and table/ directories
  • adding per‑query physical plan capture in tpcbench.py for Spark vs Comet plan comparison

Steps to run benchmark against TPC

mkdir -p /tmp/comet-bench/{spark,jars} curl -L -o /tmp/comet-bench/spark/spark-3.5.8-bin-hadoop3.tgz \ https://dlcdn.apache.org/spark/spark-3.5.8/spark-3.5.8-bin-hadoop3.tgz tar xzf /tmp/comet-bench/spark/spark-3.5.8-bin-hadoop3.tgz -C /tmp/comet-bench/spark curl -L -o /tmp/comet-bench/jars/iceberg-spark-runtime-3.5_2.12-1.8.1.jar \ https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.8.1/iceberg-spark-runtime-3.5_2.12-1.8.1.jar # Build Comet JAR (includes native library) cd ~/Documents/apache/datafusion-comet make release # Generate TPC‑H Parquet (SF1) cd ~/Documents/apache/datafusion-comet make benchmark-org.apache.spark.sql.GenTPCHData -- \ --location /tmp/comet-bench/tpch-data \ --scaleFactor 1 # Generate using DuckDB python3 -m venv /tmp/comet-bench/duckdb-venv /tmp/comet-bench/duckdb-venv/bin/pip install duckdb /tmp/comet-bench/duckdb-venv/bin/python - <<'PY' import duckdb, os out = "/tmp/tpcds/sf1_parquet" os.makedirs(out, exist_ok=True) con = duckdb.connect() con.execute("INSTALL tpcds; LOAD tpcds;") con.execute("CALL dsdgen(sf=1);") tables = [r[0] for r in con.execute("SHOW TABLES").fetchall()] for t in tables: con.execute(f"COPY {t} TO '{out}/{t}.parquet' (FORMAT PARQUET);") PY # Create Iceberg Tables export SPARK_HOME=/tmp/comet-bench/spark/spark-3.5.8-bin-hadoop3 SPARK_LOCAL_IP=127.0.0.1 $SPARK_HOME/bin/spark-submit \ --master "local[*]" \ --driver-class-path /tmp/comet-bench/jars/iceberg-spark-runtime-3.5_2.12-1.8.1.jar \ --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --conf "spark.driver.bindAddress=127.0.0.1" \ --conf "spark.driver.host=127.0.0.1" \ ~/Documents/apache/datafusion-comet/benchmarks/tpc/create-iceberg-tables.py \ --benchmark tpch \ --parquet-path /tmp/comet-bench/tpch-data/tpch/sf1_parquet \ --warehouse /tmp/comet-bench/iceberg-warehouse SPARK_LOCAL_IP=127.0.0.1 $SPARK_HOME/bin/spark-submit \ --master "local[*]" \ --driver-class-path /tmp/comet-bench/jars/iceberg-spark-runtime-3.5_2.12-1.8.1.jar \ --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --conf "spark.driver.bindAddress=127.0.0.1" \ --conf "spark.driver.host=127.0.0.1" \ ~/Documents/apache/datafusion-comet/benchmarks/tpc/create-iceberg-tables.py \ --benchmark tpcds \ --parquet-path /tmp/tpcds/sf1_parquet \ --warehouse /tmp/comet-bench/iceberg-warehouse # Run Spark JVM (Iceberg Baseline) SPARK_LOCAL_IP=127.0.0.1 $SPARK_HOME/bin/spark-submit \ --master "local[*]" \ --driver-class-path /tmp/comet-bench/jars/iceberg-spark-runtime-3.5_2.12-1.8.1.jar \ --conf "spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.local.type=hadoop" \ --conf "spark.sql.catalog.local.warehouse=/tmp/comet-bench/iceberg-warehouse" \ --conf "spark.sql.defaultCatalog=local" \ ~/Documents/apache/datafusion-comet/benchmarks/tpc/tpcbench.py \ --benchmark tpch --catalog local --database tpch \ --iterations 1 --output /tmp/comet-bench/results \ --name spark-iceberg \ --plan-dir /tmp/comet-bench/plans/tpch-sf1 SPARK_LOCAL_IP=127.0.0.1 $SPARK_HOME/bin/spark-submit \ --master "local[*]" \ --driver-class-path /tmp/comet-bench/jars/iceberg-spark-runtime-3.5_2.12-1.8.1.jar \ --conf "spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.local.type=hadoop" \ --conf "spark.sql.catalog.local.warehouse=/tmp/comet-bench/iceberg-warehouse" \ --conf "spark.sql.defaultCatalog=local" \ ~/Documents/apache/datafusion-comet/benchmarks/tpc/tpcbench.py \ --benchmark tpcds --catalog local --database tpcds \ --iterations 1 --output /tmp/comet-bench/results \ --name spark-iceberg \ --plan-dir /tmp/comet-bench/plans/tpcds-sf1 # Run Comet Native (Iceberg Rust Scan) export COMET_JAR=~/Documents/apache/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.15.0-SNAPSHOT.jar SPARK_LOCAL_IP=127.0.0.1 $SPARK_HOME/bin/spark-submit \ --master "local[*]" \ --driver-class-path "$COMET_JAR:/tmp/comet-bench/jars/iceberg-spark-runtime-3.5_2.12-1.8.1.jar" \ --jars "$COMET_JAR,/tmp/comet-bench/jars/iceberg-spark-runtime-3.5_2.12-1.8.1.jar" \ --conf "spark.plugins=org.apache.spark.CometPlugin" \ --conf "spark.comet.enabled=true" \ --conf "spark.comet.exec.enabled=true" \ --conf "spark.comet.scan.icebergNative.enabled=true" \ --conf "spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager" \ --conf "spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.local.type=hadoop" \ --conf "spark.sql.catalog.local.warehouse=/tmp/comet-bench/iceberg-warehouse" \ --conf "spark.sql.defaultCatalog=local" \ ~/Documents/apache/datafusion-comet/benchmarks/tpc/tpcbench.py \ --benchmark tpch --catalog local --database tpch \ --iterations 1 --output /tmp/comet-bench/results \ --name comet-iceberg \ --plan-dir /tmp/comet-bench/plans/tpch-sf1 # Generate Comparison Graphs cd ~/Documents/apache/datafusion-comet/benchmarks/tpc # TPC-H charts python3 generate-comparison.py \ --benchmark tpch \ --labels "Spark Iceberg (JVM)" "Comet Iceberg (Native Rust)" \ --title "TPC-H SF1: Iceberg JVM Scan vs Native Rust Scan" \ /tmp/comet-bench/results/spark-iceberg-tpch-*.json \ /tmp/comet-bench/results/comet-iceberg-tpch-*.json # TPC-DS charts (Q1–Q22) python3 generate-comparison.py \ --benchmark tpcds \ --labels "Spark Iceberg (JVM)" "Comet Iceberg (Native Rust)" \ --title "TPC-DS SF1 (Q1–Q22): Iceberg JVM vs Native Rust" \ /tmp/comet-bench/results/spark-iceberg-tpcds-q1to22-*.json \ /tmp/comet-bench/results/comet-iceberg-tpcds-*.json 
@Shekharrajak
Copy link
Copy Markdown
Contributor Author

Results generated in Apple M4 Pro 48 GB memory:

9856e55617c5d186fc40bd2f10a878d3
@Shekharrajak
Copy link
Copy Markdown
Contributor Author

Improvement results in TPC DS dataset:

tpcds_queries_compare
@Shekharrajak
Copy link
Copy Markdown
Contributor Author

Open to suggestions to reduce the commands and simple cli/bash script to have these benchmark graph generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant