I'd like to create a Row with a schema from a case class to test one of my map functions. The most straightforward way I can think of doing this is:
import org.apache.spark.sql.Row case class MyCaseClass(foo: String, bar: Option[String]) def buildRowWithSchema(record: MyCaseClass): Row = { sparkSession.createDataFrame(Seq(record)).collect.head } However, this seemed like a lot of overhead to just get a single Row, so I looked into how I could directly create a Row with a schema. This led me to:
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema import org.apache.spark.sql.{Encoders, Row} def buildRowWithSchemaV2(record: MyCaseClass): Row = { val recordValues: Array[Any] = record.getClass.getDeclaredFields.map((field) => { field.setAccessible(true) field.get(record) }) new GenericRowWithSchema(recordValues, Encoders.product[MyCaseClass].schema) } Unfortunately, the Row that the second version returns is different from the first Row. Option fields in the first version are reduced to their primitive values, while they are still Options in the second version. Also, the second version is quite unwieldy.
Is there a better way to do this?