2

I'm attempting to run an external c++ script on Apache Spark using rdd.pipe().I can't find enough info in documentation so i'm asking here.

Does the external script need to be available on all nodes in the cluster when using rdd.pipe()?

What if i don't have permission to install anything on the nodes of the cluster? Is there any other way to make the script available to the worker nodes?

2 Answers 2

2

Apache Spark, there is a special Rdd, pipedRdd, which provides calls to external programs such as CUDA-based C++ programs to enable faster calculations.

I am adding small exmaple to explain here.

Shell script : test.sh

#!/bin/sh echo "Running shell script" while read LINE; do echo ${LINE}! done 

Pipe rdd data to shell script

val scriptPath = "/home/hadoop/test.sh" val pipeRDD = dataRDD.pipe(scriptPath) pipeRDD.collect() 

Now create scala program to call this pipe RDD

val proc = Runtime.getRuntime.exec(Array(command)) new Thread("stderr reader for " + command) { override def run() { for(line <- Source.fromInputStream(proc.getErrorStream).getLines) System.err.println(line) } }.start() val lineList = List("hello","how","are","you") new Thread("stdin writer for " + command) { override def run() { val out = new PrintWriter(proc.getOutputStream) for(elem <- lineList) out.println(elem) out.close() } }.start() 

Spark RDD

val data = sc.parallelize(List("hi","hello","how","are","you")) val scriptPath = "/root/echo.sh" val pipeRDD = dataRDD.pipe(scriptPath) pipeRDD.collect() 

Results :

Array[String] = Array(Running shell script, hi!, Running shell script, hello!, Running shell script, how!, Running shell script, are!, you!) 
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the input but it seems having the external script only on driver node or hdfs is not enough. Executors are throwing error: "Cannot run program "path/to/program": error=2, No such file or directory"
2

It seems, after all, that the external script should be present on all executor nodes. One way to do this is to pass your script via spark-submit (e.g. --files script.sh) and then you should be able to refer that (e.g. "./script.sh") in rdd.pipe.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.