I apologise for what will probably be a simple question but I'm struggling to get to grips with parsing rdd's with scala/spark. I have an RDD created from a CSV, read in with
val partitions: RDD[(String, String, String, String, String)] = withoutHeader.mapPartitions(lines => { val parser = new CSVParser(',') lines.map(line => { val columns = parser.parseLine(line) (columns(0), columns(1), columns(2), columns(3), columns(4)) }) }) When I output this to a file with
partitions.saveAsTextFile(file) I get the output with the parentheses on each line. I don't want these parentheses. I'm struggling in general to understand what is happening here. My background is with low level languages and I'm struggling to see through the abstractions to what it's actually doing. I understand the mappings but it's the output that is escaping me. Can someone either explain to me what is going on in the line (columns(0), columns(1), columns(2), columns(3), columns(4)) or point me to a guide that simply explains what is happening?
My ultimate goal is to be able to manipulate files that are on hdsf in spark to put them in formats suitable for mllib.I'm unimpressed with the spark or scala guides as they look like they have been produced with poorly annotated javadocs and don't really explain anything.
Thanks in advance.
Dean