Minio in docker cluster is not reachable from spark container

Question

I have created network

docker network create app-tier --driver bridge

and used this docker compose file

networks: default: external: name: app-tier services: minio: image: 'bitnami/minio:latest' container_name: my-minio-server environment: - MINIO_ROOT_USER=theroot - MINIO_ROOT_PASSWORD=theroot123 ports: - '9000:9000' - '9001:9001' volumes: - ${HOME}/minio/data:/data spark: image: docker.io/bitnami/spark:3 environment: - SPARK_MODE=master - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no ports: - '8080:8080' - '7077:7077' volumes: - ./conf/spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf spark-worker1: image: docker.io/bitnami/spark:3 links: - "spark:spark" environment: - SPARK_MODE=worker - SPARK_MASTER_URL=spark://spark:7077 - SPARK_WORKER_MEMORY=1G - SPARK_WORKER_CORES=1 - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no ports: - '7181:8081' volumes: - ./work1:/opt/bitnami/spark/work - ./conf/spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf spark-worker2: image: docker.io/bitnami/spark:3 links: - "spark:spark" environment: - SPARK_MODE=worker - SPARK_MASTER_URL=spark://spark:7077 - SPARK_WORKER_MEMORY=1G - SPARK_WORKER_CORES=1 - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no ports: - '7182:8082' volumes: - ./work2:/opt/bitnami/spark/work - ./conf/spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf

I connected to minio at http://127.0.0.1:9001 with the above credentials and I created a service account and an "asiatrip" bucket.

It has the following

s3accessKeyAws = "n1Z8USynE2uOBJmc" s3secretKeyAws = "RjK4uL35tFNTROo2WsPVZhA77AJ5qJEx"

I can successfully connect to it via the minio client

docker run -it --rm --name minio-client \ --env MINIO_SERVER_HOST="my-minio-server" \ --env MINIO_SERVER_ACCESS_KEY="theroot" \ --env MINIO_SERVER_SECRET_KEY="theroot123" \ --network app-tier --volume $HOME/mcconf:/.mc \ bitnami/minio-client alias set minio http://my-minio-server:9000 n1Z8USynE2uOBJmc RjK4uL35tFNTROo2WsPVZhA77AJ5qJEx --api S3v4

and

docker run -it --rm --name minio-client \ --env MINIO_SERVER_HOST="my-minio-server" \ --env MINIO_SERVER_ACCESS_KEY="theroot" \ --env MINIO_SERVER_SECRET_KEY="theroot123" \ --network app-tier --volume $HOME/mcconf:/.mc \ bitnami/minio-client ls minio

I also can use minio via a docker jupyter in that network

docker run -it --network app-tier -p 8888:8888 jupyter/scipy-notebook:latest

after installing minio package with

!pip install minio

and execute python script

from minio import Minio from minio.error import S3Error client = Minio( "my-minio-server:9000", access_key="n1Z8USynE2uOBJmc", secret_key="RjK4uL35tFNTROo2WsPVZhA77AJ5qJEx", secure=False, ) # Make 'asiatrip' bucket if not exist. found = client.bucket_exists("asiatrip") if not found: client.make_bucket("asiatrip") else: print("Bucket 'asiatrip' already exists") list(client.list_objects("asiatrip"))

So everything seems set

I installed hadoop-3.3.2 and spark-3.2.1-bin-without-hadoop

I setup my env as follows

export HADOOP_HOME=$HOME/Downloads/hadoop-3.3.2 export SPARK_HOME=$HOME/Downloads/spark-3.2.1-bin-without-hadoop export PATH=$SPARK_HOME/bin:$HADOOP_HOME/bin:$PATH export HADOOP_OPTIONAL_TOOLS="hadoop-aws" export SPARK_DIST_CLASSPATH=$(hadoop classpath)

when I run this python file as

from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .appName("Test json")\ .getOrCreate() s3accessKeyAws = "n1Z8USynE2uOBJmc" s3secretKeyAws = "RjK4uL35tFNTROo2WsPVZhA77AJ5qJEx" connectionTimeOut = "1000" s3endPointLoc = "http://127.0.0.1:9000" sourceBucket = "asiatrip" spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", s3endPointLoc) spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", s3accessKeyAws) spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", s3secretKeyAws) spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.connection.timeout", connectionTimeOut) spark.sparkContext._jsc.hadoopConfiguration().set("spark.sql.debug.maxToStringFields", "100") spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true") spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false") inputPath = f"s3a://{sourceBucket}/addresses.csv" outputPath = f"s3a://{sourceBucket}/output_survey.csv" df = spark.read.option("header", "true").format("s3selectCSV").csv(inputPath) df.write.mode("overwrite").parquet(outputPath) spark.stop()

as

spark-submit miniospark.py

it works fine for the addresses.csv file

a,b 1,2 3,4 6,7 8,9

in asiatrip bucket.

When I submit as

spark-submit --master spark://127.0.0.1:7077 miniospark.py

with

s3endPointLoc = "http://my-minio-server:9000"

It gives up after some time because it cannot resolve my-minio-server.

2022-05-18 15:12:32,246 WARN streaming.FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3a://asiatrip/addresses.csv. org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on s3a://asiatrip/addresses.csv: com.amazonaws.SdkClientException: Unable to execute HTTP request: my-minio-server: nodename nor servname provided, or not known: Unable to execute HTTP request: my-minio-server: nodename nor servname provided, or not known

I am on a Mac x64 with Docker Desktop

Vasileios Anagnostopoulos · Accepted Answer · 2022-08-26 07:52:25Z

1

After a lot of trial and error I have a solution

Do yourself a favor and modify in /etc/hosts the line

127.0.0.1 localhost

to

127.0.0.1 localhost my-minio-server

answered Aug 26, 2022 at 7:52

Vasileios Anagnostopoulos

3051 gold badge2 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

realslimjp Over a year ago

Thank you for sharing this, worked for me too. Do you know why we have to do this? shouldn't the docker compose network be able to configure this?

Vasileios Anagnostopoulos Over a year ago

Yes because the driver looks for dns outside of Docker too. Yes, I know. I have an article on Medium for all my adventure.

Danh Nguyen · Accepted Answer · 2023-12-16 05:14:01Z

You can push your pyspark code into other container (example: Jupyterlab) in docker-compose.yaml (it will have the same network with your Spark)

Then you can get Minio container IP and replace s3endPointLoc = "http://127.0.0.1:9000" with s3endPointLoc = "http://<minio_container_ip>:9000" and the try run pyspark code again

To get minio_container_ip (from container_name=my-minio-server)

docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' my-minio-server

Collectives™ on Stack Overflow

Minio in docker cluster is not reachable from spark container

2 Answers 2

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Linked

Related