Spark - Scala - Remove Columns from a dataframe based on condition

Question

We have a specific need wherein I will have to drop columns from a dataframe which has only one unique value in that column. The following is what we are doing

val rawdata = spark.read.format("csv").option("header","true").option("inferSchema","true").load(filename)

Subsequently to find unique values in all columns we are using the HyperLog++ algorithm supported in spark

val cd_cols = rawdata.select(rawdata.columns.map(column => approxCountDistinct(col(column)).alias(column)): _*)

The output is

scala> cd_cols.show +----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+ | ID|First Name|Last Name|Age|Attrition|BusinessTravel|DailyRate|Department|DistanceFromHome|Education|EducationField|EmployeeCount| +----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+ |1491| 172| 154| 43| 2| 3| 913| 3| 30| 1| 6| 1| +----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+

Notice that I have two columns which has a 1 as the unique value. I want to create another dataframe which has all columns except those two columns (Education and EmployeeCount)

I tried using a for loop, but was not very happy and also tried

cd_cols.columns.filter(colName => cd_cols.filter(colName) <= 1)

that is also not working.

Is there a smarter way to do this please.

Thanks

Bala

if you don't want those columns then just select the rest of the columns. — Anahcolus
– Anahcolus, Commented Jun 6, 2017 at 9:40
@RameshMaharjan, i won't know it before hand. It all depends on the unique values that the previous statement returns. I cannot 'hard-code' those columns. — Balaji Krishnan
– Balaji Krishnan, Commented Jun 6, 2017 at 9:41
So it means that you want to drop those columns which has 1 value? — Anahcolus
– Anahcolus, Commented Jun 6, 2017 at 9:44
@RameshMaharjan Yes. Dropping columns which has only ONE unique value — Balaji Krishnan
– Balaji Krishnan, Commented Jun 6, 2017 at 9:47
Is it true that by looking at one row, we can know which columns to drop? I mean to say is it symmetric that if a column in a row is 1 then all the rest of the rows will be 1 in that particular column? — Anahcolus
– Anahcolus, Commented Jun 6, 2017 at 10:00

Rajat Mishra · Accepted Answer · 2017-06-06 10:51:38Z

3

You try the following command:

df.selectExpr(df.first().getValuesMap[Int](df.columns).filter(_._2 != 1).keys.toSeq: _*).show

Here we are first taking the first row of the dataframe and converting it into a map using getValueMap with the column names and just filtering the columns whose value is not 1.

edited Jun 6, 2017 at 10:51

answered Jun 6, 2017 at 10:11

Rajat Mishra

3,7904 gold badges32 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Balaji Krishnan Over a year ago

That worked like charm. A small correction i had to make was to change that [Int] to [Long]. val select_cols = df.selectExpr(df.first().getValuesMap[Long](df.columns).filter(_._2 != 1).keys.toSeq: _*) Thank you so much for the help

Sade Over a year ago

What is _._2? is this the column name?

Rajat Mishra Over a year ago

_._2 is the second element of the tuple

Kevin Kehoe · Accepted Answer · 2017-06-06 14:37:52Z

If you want to go on from what you were originally trying, the following should also work. Also, note that with Spark 2.0 onwards you can pass a list into drop and remove the columns that way. This may be more clear as to what you are doing.

Or worst case is another way of doing it.

This should work with most spark versions.

 val keptCols: Seq[Column] = df.columns .map(c => (c, df.select(c).first.getLong(0))) .filter{case (c, v) => v!=1} .map{case (c, v) => df(c)} .toSeq df.select(keptCols: _*).show

For > Spark 2.0

 val droppedCols: Seq[String] = df.columns .map(c => (c, df.select(c).first.getLong(0))) .filter{case (c, v) => v==1} .map{case (c, v) => c} .toSeq df.drop(droppedCols: _*).show

Both should use the same results.

Collectives™ on Stack Overflow

Spark - Scala - Remove Columns from a dataframe based on condition

2 Answers 2

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Related