My task is to write a code that reads a big file (doesn't fit into memory) reverse it and output most five frequent words .
i have written the code below and it does the job .
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object ReverseFile { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Reverse File") conf.set("spark.hadoop.validateOutputSpecs", "false") val sc = new SparkContext(conf) val txtFile = "path/README_mid.md" val txtData = sc.textFile(txtFile) txtData.cache() val tmp = txtData.map(l => l.reverse).zipWithIndex().map{ case(x,y) => (y,x)}.sortByKey(ascending = false).map{ case(u,v) => v} tmp.coalesce(1,true).saveAsTextFile("path/out.md") val txtOut = "path/out.md" val txtOutData = sc.textFile(txtOut) txtOutData.cache() val wcData = txtOutData.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(ascending = false) wcData.collect().take(5).foreach(println) } } The problem is that i'm new to spark and scala, and as you can see in the code first i read the file reverse it save it then reads it reversed and output the five most frequent words .
- Is there a way to tell spark to save tmp and process wcData (without the need to save,open file) at the same time because otherwise its like reading the file twice .
- From now on i'm going to tackle with spark a lot, so if there is any part of the code (not like the absolute path name ... spark specific) that you might think could be written better i'de appreciate it.