Apache Spark, there is a special Rdd, pipedRdd, which provides calls to external programs such as CUDA-based C++ programs to enable faster calculations.
I am adding small exmaple to explain here.
Shell script : test.sh
#!/bin/sh echo "Running shell script" while read LINE; do echo ${LINE}! done
Pipe rdd data to shell script
val scriptPath = "/home/hadoop/test.sh" val pipeRDD = dataRDD.pipe(scriptPath) pipeRDD.collect()
Now create scala program to call this pipe RDD
val proc = Runtime.getRuntime.exec(Array(command)) new Thread("stderr reader for " + command) { override def run() { for(line <- Source.fromInputStream(proc.getErrorStream).getLines) System.err.println(line) } }.start() val lineList = List("hello","how","are","you") new Thread("stdin writer for " + command) { override def run() { val out = new PrintWriter(proc.getOutputStream) for(elem <- lineList) out.println(elem) out.close() } }.start()
Spark RDD
val data = sc.parallelize(List("hi","hello","how","are","you")) val scriptPath = "/root/echo.sh" val pipeRDD = dataRDD.pipe(scriptPath) pipeRDD.collect()
Results :
Array[String] = Array(Running shell script, hi!, Running shell script, hello!, Running shell script, how!, Running shell script, are!, you!)