1

I'm learning Scala, curious how to optimize this code. What I have is an RDD loaded from Spark. It's a tab delimited dataset. I want to combine the first column with the second column, and append it as a new column to the end of the dataset, with a "-" separating the two.

For example: column1\tcolumn2\tcolumn3

becomes

column1\tcolumn2\tcolumn3\tcolumn1-column2

val f = sc.textFile("path/to/dataset") f.map(line => if (line.split("\t").length > 1) line.split("\t") :+ line.split("\t")(0)+"-"+line.split("\t")(1) else Array[String]()).map(a => a.mkString("\t") ) .saveAsTextFile("output/path") 
1
  • 1
    I'd start by doing that string split only once. Commented May 18, 2015 at 21:39

1 Answer 1

1

Try:

f.map{ line => val cols = line.split("\t") if (cols.length > 1) line + "\t" + cols(0) + "-" + cols(1) else line } 
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, yeah change it from map{ to map( allowed me to set a val inside.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.