Return to Answer

added 764 characters in body

edited May 6, 2017 at 12:17

42.1k
6
75
101

The second version is returning Option itself for the bar case class field, thus you are not getting primitive value as the first version. you can use the following code for primitive values

def buildRowWithSchemaV2(record: MyCaseClass): Row = { val recordValues: Array[Any] = record.getClass.getDeclaredFields.map((field) => { field.setAccessible(true) val returnValue = field.get(record) if(returnValue.isInstanceOf[Option[String]]){ returnValue.asInstanceOf[Option[String]].get } else returnValue }) new GenericRowWithSchema(recordValues, Encoders.product[MyCaseClass].schema) }

But meanwhile I would suggest you to use DataFrame or DataSet.

DataFrame and DataSet are themselves collections of Row with schema.
So when you have a case class defined, you just need to encode your input data into case class For example: lets say you have input data as

val data = Seq(("test1", "value1"),("test2", "value2"),("test3", "value3"),("test4", null))

If you have a text file you can read it with sparkContext.textFile and split according to your need.
Now when you have converted your data to RDD, converting it to dataframe or dataset is two lines code

import sqlContext.implicits._ val dataFrame = data.map(d => MyCaseClass(d._1, Option(d._2))).toDF

.toDS would generate dataset Thus you have collection of Rows with schema
for validation you can do the followings

println(dataFrame.schema) //for checking if there is schema println(dataFrame.take(1).getClass.getName) //for checking if it is a collection of Rows

Hope you have the right answer.

val data = Seq(("test1", "value1"),("test2", "value2"),("test3", "value3"),("test4", null))

import sqlContext.implicits._ val dataFrame = data.map(d => MyCaseClass(d._1, Option(d._2))).toDF

.toDS would generate dataset Thus you have collection of Rows with schema
for validation you can do the followings

println(dataFrame.schema) //for checking if there is schema println(dataFrame.take(1).getClass.getName) //for checking if it is a collection of Rows

Hope you have the right answer.

The second version is returning Option itself for the bar case class field, thus you are not getting primitive value as the first version. you can use the following code for primitive values

def buildRowWithSchemaV2(record: MyCaseClass): Row = { val recordValues: Array[Any] = record.getClass.getDeclaredFields.map((field) => { field.setAccessible(true) val returnValue = field.get(record) if(returnValue.isInstanceOf[Option[String]]){ returnValue.asInstanceOf[Option[String]].get } else returnValue }) new GenericRowWithSchema(recordValues, Encoders.product[MyCaseClass].schema) }

But meanwhile I would suggest you to use DataFrame or DataSet.

val data = Seq(("test1", "value1"),("test2", "value2"),("test3", "value3"),("test4", null))

import sqlContext.implicits._ val dataFrame = data.map(d => MyCaseClass(d._1, Option(d._2))).toDF

.toDS would generate dataset Thus you have collection of Rows with schema
for validation you can do the followings

println(dataFrame.schema) //for checking if there is schema println(dataFrame.take(1).getClass.getName) //for checking if it is a collection of Rows

Hope you have the right answer.

Source Link

answered May 6, 2017 at 6:57

Anahcolus

42.1k
6
75
101

val data = Seq(("test1", "value1"),("test2", "value2"),("test3", "value3"),("test4", null))

import sqlContext.implicits._ val dataFrame = data.map(d => MyCaseClass(d._1, Option(d._2))).toDF

.toDS would generate dataset Thus you have collection of Rows with schema
for validation you can do the followings

println(dataFrame.schema) //for checking if there is schema println(dataFrame.take(1).getClass.getName) //for checking if it is a collection of Rows

Hope you have the right answer.

Collectives™ on Stack Overflow

Return to Answer