required: org.apache.spark.sql.Row

Question

I am running into a problem trying to convert one of the columns of a spark dataframe from a hexadecimal string to a double. I have the following code:

import spark.implicits._ case class MsgRow(block_number: Long, to: String, from: String, value: Double ) def hex2int (hex: String): Double = (new BigInteger(hex.substring(2),16)).doubleValue txs = txs.map(row=> MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3))) )

I can't share the content of my txs dataframe but here is the metadata:

>txs org.apache.spark.sql.DataFrame = [blockNumber: bigint, to: string ... 4 more fields]

but when I run this I get the error:

error: type mismatch; found : MsgRow required: org.apache.spark.sql.Row MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3))) ^

I don't understand -- why is spark/scala expecting a row object? None of the examples I have seen involve an explicit conversion to a row, and in fact most of them involve an anonymous function returning a case class object, as I have above. And for some reason, googling "required: org.apache.spark.sql.Row" returns only five results, none of which pertains to my situation. Which is why I made the title so non-specific since there is little chance of a false positive. Thanks in advance!

Anahcolus · Accepted Answer · 2017-11-09 04:59:07Z

Your error is because you are storing the output to the same variable and txs is expecting Row while you are returning MsgRow. so changing

txs = txs.map(row=> MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3))) )

to

val newTxs = txs.map(row=> MsgRow(row.getLong(0), row.getString(1), row.getString(2), (new BigInteger(row.getString(3).substring(2),16)).doubleValue) )

should solve your issue.

I have excluded the hex2int function as its giving serialization error.

Ahh, I see. I did not realize that the actual rows needed to be the same type in order to overwrite a dataframe var. I think the more fundamental problem (see solution) is that the result returned from map is a dataset, not a dataframe.

Paul · Accepted Answer · 2017-11-09 17:50:03Z

Thank you @Ramesh for pointing out the bug in my code. His solution works, though it also does not mention the problem that pertains more directly to my OP, which is that the result returned from map is not a dataframe but rather a dataset. Rather than creating a new variable, all I need to do was change

txs = txs.map(row=> MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3))) )

to

txs = txs.map(row=> MsgRow(row.getLong(0), row.getString(1), row.getString(2), hex2int(row.getString(3))) ).toDF

This would probably be the easy answer for most errors containing my title. While @Ramesh's answer got rid of that error, I ran into another error later raleted to the same fundamental issue when I tried to join this result to another dataframe.

Collectives™ on Stack Overflow

required: org.apache.spark.sql.Row

2 Answers 2

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Related