The Dataproc Docker on YARN feature allows you to create and use a Docker image to customize your Spark job runtime environment. The image can include customizations to Java, Python, and R dependencies, and to your job jar.
Limitations
Feature availability or support is not available with:
- Dataproc image versions prior to 2.0.49 (not available in 1.5 images)
- MapReduce jobs (only supported for Spark jobs )
- Spark client mode (only supported with Spark cluster mode)
- Kerberos clusters: cluster creation fails if you create a cluster with Docker on YARN and Kerberos enabled.
- Customizations of JDK, Hadoop and Spark: the host JDK, Hadoop, and Spark are used, not your customizations.
Create a Docker image
The first step to customize your Spark environment is building a Docker image.
Dockerfile
You can use the following Dockerfile as an example, making changes and additions to meet you needs.
FROM debian:10-slim # Suppress interactive prompts. ENV DEBIAN_FRONTEND=noninteractive # Required: Install utilities required by Spark scripts. RUN apt update && apt install -y procps tini # Optional: Add extra jars. ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/ ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*' RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}" COPY *.jar "${SPARK_EXTRA_JARS_DIR}" # Optional: Install and configure Miniconda3. ENV CONDA_HOME=/opt/miniconda3 ENV PYSPARK_PYTHON=${CONDA_HOME}/bin/python ENV PYSPARK_DRIVER_PYTHON=${CONDA_HOME}/bin/python ENV PATH=${CONDA_HOME}/bin:${PATH} COPY Miniconda3-py39_4.10.3-Linux-x86_64.sh . RUN bash Miniconda3-py39_4.10.3-Linux-x86_64.sh -b -p /opt/miniconda3 \ && ${CONDA_HOME}/bin/conda config --system --set always_yes True \ && ${CONDA_HOME}/bin/conda config --system --set auto_update_conda False \ && ${CONDA_HOME}/bin/conda config --system --prepend channels conda-forge \ && ${CONDA_HOME}/bin/conda config --system --set channel_priority strict # Optional: Install Conda packages. # # The following packages are installed in the default image. It is strongly # recommended to include all of them. # # Use mamba to install packages quickly. RUN ${CONDA_HOME}/bin/conda install mamba -n base -c conda-forge \ && ${CONDA_HOME}/bin/mamba install \ conda \ cython \ fastavro \ fastparquet \ gcsfs \ google-cloud-bigquery-storage \ google-cloud-bigquery[pandas] \ google-cloud-bigtable \ google-cloud-container \ google-cloud-datacatalog \ google-cloud-dataproc \ google-cloud-datastore \ google-cloud-language \ google-cloud-logging \ google-cloud-monitoring \ google-cloud-pubsub \ google-cloud-redis \ google-cloud-spanner \ google-cloud-speech \ google-cloud-storage \ google-cloud-texttospeech \ google-cloud-translate \ google-cloud-vision \ koalas \ matplotlib \ nltk \ numba \ numpy \ openblas \ orc \ pandas \ pyarrow \ pysal \ pytables \ python \ regex \ requests \ rtree \ scikit-image \ scikit-learn \ scipy \ seaborn \ sqlalchemy \ sympy \ virtualenv # Optional: Add extra Python modules. ENV PYTHONPATH=/opt/python/packages RUN mkdir -p "${PYTHONPATH}" COPY test_util.py "${PYTHONPATH}" # Required: Create the 'yarn_docker_user' group/user. # The GID and UID must be 1099. Home directory is required. RUN groupadd -g 1099 yarn_docker_user RUN useradd -u 1099 -g 1099 -d /home/yarn_docker_user -m yarn_docker_user USER yarn_docker_user Build and push the image
The following is commands for building and pushing the example Docker image, you can make changes according to your customizations.
# Increase the version number when there is a change to avoid referencing # a cached older image. Avoid reusing the version number, including the default # `latest` version. IMAGE=gcr.io/my-project/my-image:1.0.1 # Download the BigQuery connector. gcloud storage cp \ gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar . # Download the Miniconda3 installer. wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.10.3-Linux-x86_64.sh # Python module example: cat >test_util.py <<EOF def hello(name): print("hello {}".format(name)) def read_lines(path): with open(path) as f: return f.readlines() EOF # Build and push the image. docker build -t "${IMAGE}" . docker push "${IMAGE}" Create a Dataproc cluster
After creating a Docker image that customizes your Spark environment, create a Dataproc cluster that will use your Docker image when running Spark jobs.
gcloud
gcloud dataproc clusters create CLUSTER_NAME \ --region=REGION \ --image-version=DP_IMAGE \ --optional-components=DOCKER \ --properties=dataproc:yarn.docker.enable=true,dataproc:yarn.docker.image=DOCKER_IMAGE \ other flags
Replace the following;
- CLUSTER_NAME: The cluster name.
- REGION: The cluster region.
- DP_IMAGE: Dataproc image version must be
2.0.49or later (--image-version=2.0will use a qualified minor version later than2.0.49). --optional-components=DOCKER: Enables the Docker component on the cluster.--propertiesflag:dataproc:yarn.docker.enable=true: Required property to enable the Dataproc Docker on YARN feature.dataproc:yarn.docker.image: Optional property that you can add to specify your DOCKER_IMAGE using the following Container Registry image naming format:{hostname}/{project-id}/{image}:{tag}.Example:
dataproc:yarn.docker.image=gcr.io/project-id/image:1.0.1
Requirement: You must host your Docker image on Container Registry or Artifact Registry. (Dataproc cannot fetch containers from other registries).
Recommendation: Add this property when you create your cluster to cache your Docker image and avoid YARN timeouts later when you submit a job that uses the image.
When dataproc:yarn.docker.enable is set to true, Dataproc updates Hadoop and Spark configurations to enable the Docker on YARN feature in the cluster. For example, spark.submit.deployMode is set to cluster, and spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS and spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS are set to mount directories from the host into the container.
Submit a Spark job to the cluster
After creating a Dataproc cluster, submit a Spark job to the cluster that uses your Docker image. The example in this section submits a PySpark job to the cluster.
Set job properties:
# Set the Docker image URI. IMAGE=(e.g., gcr.io/my-project/my-image:1.0.1) # Required: Use `#` as the delimiter for properties to avoid conflicts. JOB_PROPERTIES='^#^' # Required: Set Spark properties with the Docker image. JOB_PROPERTIES="${JOB_PROPERTIES}#spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE}" JOB_PROPERTIES="${JOB_PROPERTIES}#spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE}" # Optional: Add custom jars to Spark classpath. Don't set these properties if # there are no customizations. JOB_PROPERTIES="${JOB_PROPERTIES}#spark.driver.extraClassPath=/opt/spark/jars/*" JOB_PROPERTIES="${JOB_PROPERTIES}#spark.executor.extraClassPath=/opt/spark/jars/*" # Optional: Set custom PySpark Python path only if there are customizations. JOB_PROPERTIES="${JOB_PROPERTIES}#spark.pyspark.python=/opt/miniconda3/bin/python" JOB_PROPERTIES="${JOB_PROPERTIES}#spark.pyspark.driver.python=/opt/miniconda3/bin/python" # Optional: Set custom Python module path only if there are customizations. # Since the `PYTHONPATH` environment variable defined in the Dockerfile is # overridden by Spark, it must be set as a job property. JOB_PROPERTIES="${JOB_PROPERTIES}#spark.yarn.appMasterEnv.PYTHONPATH=/opt/python/packages" JOB_PROPERTIES="${JOB_PROPERTIES}#spark.executorEnv.PYTHONPATH=/opt/python/packages" Notes:
- See Launching Applications Using Docker Containers information on related properties.
gcloud
Submit the job to the cluster.
gcloud dataproc jobs submit pyspark PYFILE \ --cluster=CLUSTER_NAME \ --region=REGION \ --properties=${JOB_PROPERTIES} Replace the following;
- PYFILE: The file path to your PySpark job file. It can be a local file path or the URI of the file in Cloud Storage (
gs://BUCKET_NAME/PySpark filename). - CLUSTER_NAME: The cluster name.
- REGION: The cluster region.