Skip to main content
added 764 characters in body
Source Link
Anahcolus
  • 42.1k
  • 6
  • 75
  • 101

The second version is returning Option itself for the bar case class field, thus you are not getting primitive value as the first version. you can use the following code for primitive values

def buildRowWithSchemaV2(record: MyCaseClass): Row = { val recordValues: Array[Any] = record.getClass.getDeclaredFields.map((field) => { field.setAccessible(true) val returnValue = field.get(record) if(returnValue.isInstanceOf[Option[String]]){ returnValue.asInstanceOf[Option[String]].get } else returnValue }) new GenericRowWithSchema(recordValues, Encoders.product[MyCaseClass].schema) } 

But meanwhile I would suggest you to use DataFrame or DataSet.

DataFrame and DataSet are themselves collections of Row with schema.
So when you have a case class defined, you just need to encode your input data into case class For example: lets say you have input data as

val data = Seq(("test1", "value1"),("test2", "value2"),("test3", "value3"),("test4", null)) 

If you have a text file you can read it with sparkContext.textFile and split according to your need.
Now when you have converted your data to RDD, converting it to dataframe or dataset is two lines code

import sqlContext.implicits._ val dataFrame = data.map(d => MyCaseClass(d._1, Option(d._2))).toDF 

.toDS would generate dataset Thus you have collection of Rows with schema
for validation you can do the followings

println(dataFrame.schema) //for checking if there is schema println(dataFrame.take(1).getClass.getName) //for checking if it is a collection of Rows 

Hope you have the right answer.

DataFrame and DataSet are themselves collections of Row with schema.
So when you have a case class defined, you just need to encode your input data into case class For example: lets say you have input data as

val data = Seq(("test1", "value1"),("test2", "value2"),("test3", "value3"),("test4", null)) 

If you have a text file you can read it with sparkContext.textFile and split according to your need.
Now when you have converted your data to RDD, converting it to dataframe or dataset is two lines code

import sqlContext.implicits._ val dataFrame = data.map(d => MyCaseClass(d._1, Option(d._2))).toDF 

.toDS would generate dataset Thus you have collection of Rows with schema
for validation you can do the followings

println(dataFrame.schema) //for checking if there is schema println(dataFrame.take(1).getClass.getName) //for checking if it is a collection of Rows 

Hope you have the right answer.

The second version is returning Option itself for the bar case class field, thus you are not getting primitive value as the first version. you can use the following code for primitive values

def buildRowWithSchemaV2(record: MyCaseClass): Row = { val recordValues: Array[Any] = record.getClass.getDeclaredFields.map((field) => { field.setAccessible(true) val returnValue = field.get(record) if(returnValue.isInstanceOf[Option[String]]){ returnValue.asInstanceOf[Option[String]].get } else returnValue }) new GenericRowWithSchema(recordValues, Encoders.product[MyCaseClass].schema) } 

But meanwhile I would suggest you to use DataFrame or DataSet.

DataFrame and DataSet are themselves collections of Row with schema.
So when you have a case class defined, you just need to encode your input data into case class For example: lets say you have input data as

val data = Seq(("test1", "value1"),("test2", "value2"),("test3", "value3"),("test4", null)) 

If you have a text file you can read it with sparkContext.textFile and split according to your need.
Now when you have converted your data to RDD, converting it to dataframe or dataset is two lines code

import sqlContext.implicits._ val dataFrame = data.map(d => MyCaseClass(d._1, Option(d._2))).toDF 

.toDS would generate dataset Thus you have collection of Rows with schema
for validation you can do the followings

println(dataFrame.schema) //for checking if there is schema println(dataFrame.take(1).getClass.getName) //for checking if it is a collection of Rows 

Hope you have the right answer.

Source Link
Anahcolus
  • 42.1k
  • 6
  • 75
  • 101

DataFrame and DataSet are themselves collections of Row with schema.
So when you have a case class defined, you just need to encode your input data into case class For example: lets say you have input data as

val data = Seq(("test1", "value1"),("test2", "value2"),("test3", "value3"),("test4", null)) 

If you have a text file you can read it with sparkContext.textFile and split according to your need.
Now when you have converted your data to RDD, converting it to dataframe or dataset is two lines code

import sqlContext.implicits._ val dataFrame = data.map(d => MyCaseClass(d._1, Option(d._2))).toDF 

.toDS would generate dataset Thus you have collection of Rows with schema
for validation you can do the followings

println(dataFrame.schema) //for checking if there is schema println(dataFrame.take(1).getClass.getName) //for checking if it is a collection of Rows 

Hope you have the right answer.