1

Getting the error as:

Traceback (most recent call last): File “/opt/application/main.py”, line 6, in from pyspark import SparkConf, SparkContext ModuleNotFoundError: No module named ‘pyspark’

While running pyspark in docker.

And my dockerfile is as follows:

FROM centos ENV DAEMON_RUN=true ENV SPARK_VERSION=2.4.7 ENV HADOOP_VERSION=2.7 WORKDIR /opt/application RUN yum -y install python36 RUN yum -y install wget ENV PYSPARK_PYTHON python3.6 ENV PYSPARK_DRIVER_PYTHON python3.6 RUN ln -s /usr/bin/python3.6 /usr/local/bin/python RUN wget https://bootstrap.pypa.io/get-pip.py RUN python get-pip.py RUN pip3.6 install numpy RUN pip3.6 install pandas RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \ && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark \ && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz ENV SPARK_HOME=/usr/local/bin/spark RUN yum -y install java-1.8.0-openjdk ENV JAVA_HOME /usr/lib/jvm/jre COPY main.py . RUN chmod +x /opt/application/main.py CMD ["/opt/application/main.py"] 
1
  • @MatthewMartin definitely not a dupe of that one Commented Feb 21, 2021 at 17:35

1 Answer 1

1

You forgot to install pyspark in your dockerfile.

FROM centos ENV DAEMON_RUN=true ENV SPARK_VERSION=2.4.7 ENV HADOOP_VERSION=2.7 WORKDIR /opt/application RUN yum -y install python36 RUN yum -y install wget ENV PYSPARK_PYTHON python3.6 ENV PYSPARK_DRIVER_PYTHON python3.6 RUN ln -s /usr/bin/python3.6 /usr/local/bin/python RUN wget https://bootstrap.pypa.io/get-pip.py RUN python get-pip.py RUN pip3.6 install numpy RUN pip3.6 install pandas RUN pip3.6 install pyspark # add this line. RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \ && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark \ && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz ENV SPARK_HOME=/usr/local/bin/spark RUN yum -y install java-1.8.0-openjdk ENV JAVA_HOME /usr/lib/jvm/jre COPY main.py . RUN chmod +x /opt/application/main.py CMD ["/opt/application/main.py"] 

Edit: dockerfile improvment:

FROM centos ENV DAEMON_RUN=true ENV SPARK_VERSION=2.4.7 ENV HADOOP_VERSION=2.7 WORKDIR /opt/application RUN yum -y install python36 wget java-1.8.0-openjdk # you could install python36 and wget in once ENV PYSPARK_PYTHON python3.6 ENV PYSPARK_DRIVER_PYTHON python3.6 RUN ln -s /usr/bin/python3.6 /usr/local/bin/python RUN wget https://bootstrap.pypa.io/get-pip.py \ && python get-pip.py \ && pip3.6 install numpy==1.19 pandas==1.1.5 pyspark==3.0.2 # you should also pin the version you need, pandas 1.2.x does not support python 3.6 RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \ && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark \ && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz ENV SPARK_HOME=/usr/local/bin/spark ENV JAVA_HOME /usr/lib/jvm/jre COPY main.py . RUN chmod +x /opt/application/main.py CMD ["/opt/application/main.py"] 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.