Question about rdd.pipe() operator on Apache Spark

Question

I'm attempting to run an external c++ script on Apache Spark using rdd.pipe().I can't find enough info in documentation so i'm asking here.

Does the external script need to be available on all nodes in the cluster when using rdd.pipe()?

What if i don't have permission to install anything on the nodes of the cluster? Is there any other way to make the script available to the worker nodes?

vaquar khan · Accepted Answer · 2019-01-17 21:18:24Z

Apache Spark, there is a special Rdd, pipedRdd, which provides calls to external programs such as CUDA-based C++ programs to enable faster calculations.

I am adding small exmaple to explain here.

Shell script : test.sh

#!/bin/sh echo "Running shell script" while read LINE; do echo ${LINE}! done

Pipe rdd data to shell script

val scriptPath = "/home/hadoop/test.sh" val pipeRDD = dataRDD.pipe(scriptPath) pipeRDD.collect()

Now create scala program to call this pipe RDD

val proc = Runtime.getRuntime.exec(Array(command)) new Thread("stderr reader for " + command) { override def run() { for(line <- Source.fromInputStream(proc.getErrorStream).getLines) System.err.println(line) } }.start() val lineList = List("hello","how","are","you") new Thread("stdin writer for " + command) { override def run() { val out = new PrintWriter(proc.getOutputStream) for(elem <- lineList) out.println(elem) out.close() } }.start()

Spark RDD

val data = sc.parallelize(List("hi","hello","how","are","you")) val scriptPath = "/root/echo.sh" val pipeRDD = dataRDD.pipe(scriptPath) pipeRDD.collect()

Results :

Array[String] = Array(Running shell script, hi!, Running shell script, hello!, Running shell script, how!, Running shell script, are!, you!)

Thanks for the input but it seems having the external script only on driver node or hdfs is not enough. Executors are throwing error: "Cannot run program "path/to/program": error=2, No such file or directory"

MD K · Accepted Answer · 2019-01-20 16:55:56Z

It seems, after all, that the external script should be present on all executor nodes. One way to do this is to pass your script via spark-submit (e.g. --files script.sh) and then you should be able to refer that (e.g. "./script.sh") in rdd.pipe.

Collectives™ on Stack Overflow

Question about rdd.pipe() operator on Apache Spark

2 Answers 2

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Related