0

What's the easy way to combine two columns in SparkR? Consider following Spark DF:

salary_from salary_to position 1500 null a null 1300 b 800 1000 c 

I would like to get combined salary column with logic like this. From salary_from and salary_to take the one that is not null, and if both present, then take a value in the middle.

salary_from salary_to position salary 1500 null a 1500 null 1300 b 1300 800 1000 c 900 

Is there a way to walk through every line and apply my logic, like I would do with apply method in R?

1
  • I heard about a package combining sparkr and dplyr, sparkrext but i didn't use it yer github.com/hoxo-m/SparkRext. Maybe it could help you .. Commented Apr 4, 2016 at 15:18

1 Answer 1

1

You can use coalesce function:

withColumn( sdf, "salary", expr("coalesce((salary_from + salary_to) / 2, salary_from, salary_to)") ) 

which returns the first not null expression.

Sign up to request clarification or add additional context in comments.

7 Comments

I can't find coalesce in SparkR API reference spark.apache.org/docs/latest/api/R/index.html. Could you please point me out to where I can find more about it?
You cannot because it is not there yet. Thats why you need expr . Otherwise is just a plain SQL coalesce so any SQL reference will do. For example w3schools.com/sql/sql_isnull.asp. Or PySpark docstrings: github.com/apache/spark/blob/master/python/pyspark/sql/…
is there a way to loop through rows of Spark Data Frame?
Other than fetching a complete structure to driver? No.
This all sounds to me like SparkR is incomplete and inefficient. Don't you think so? Or maybe I just misunderstand the purposes of it's use.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.