how to change default output delimiter in Spark

Question

In spark shell, i'm reading an input file and trimming the field values next saving the final rdd using saveAsTextFile() method. The field separator in the input file is '|' but the in the output file I'm getting the field separator as ','.

Input Format: abc | def | xyz Default Output Format: abc,def,xyz

Required output format something like abc|def|xyz

Is there anyway to change the default output delimiter value to '|', if yes than please suggest.

Possible duplicate of remove parentheses from output in spark — Utkarsh
– Utkarsh, Commented Oct 24, 2016 at 6:33

eliasah · Accepted Answer · 2016-10-24 06:39:16Z

For an RDD, you'll just need to make a string with a pipe separated value on the product iterator :

scala> val rdd = sc.parallelize(Seq(("a", 1, 3), ("b", 2, 10))) // rdd: org.apache.spark.rdd.RDD[(String, Int, Int)] = ParallelCollectionRDD[11] at parallelize at <console>:27 scala> rdd.map { x => x.productIterator.toSeq.mkString("|") } // res9: Array[String] = Array(a|1|3, b|2|10) scala> scala> rdd.map { x => x.productIterator.toSeq.mkString("|") }.saveAsTextFile("test")

Now let's check the content of the files :

$ cat test/part-0000* a|1|3 b|2|10

Collectives™ on Stack Overflow

how to change default output delimiter in Spark

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related