3

We have a specific need wherein I will have to drop columns from a dataframe which has only one unique value in that column. The following is what we are doing

val rawdata = spark.read.format("csv").option("header","true").option("inferSchema","true").load(filename) 

Subsequently to find unique values in all columns we are using the HyperLog++ algorithm supported in spark

val cd_cols = rawdata.select(rawdata.columns.map(column => approxCountDistinct(col(column)).alias(column)): _*) 

The output is

scala> cd_cols.show +----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+ | ID|First Name|Last Name|Age|Attrition|BusinessTravel|DailyRate|Department|DistanceFromHome|Education|EducationField|EmployeeCount| +----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+ |1491| 172| 154| 43| 2| 3| 913| 3| 30| 1| 6| 1| +----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+ 

Notice that I have two columns which has a 1 as the unique value. I want to create another dataframe which has all columns except those two columns (Education and EmployeeCount)

I tried using a for loop, but was not very happy and also tried

cd_cols.columns.filter(colName => cd_cols.filter(colName) <= 1) 

that is also not working.

Is there a smarter way to do this please.

Thanks

Bala

6
  • if you don't want those columns then just select the rest of the columns. Commented Jun 6, 2017 at 9:40
  • @RameshMaharjan, i won't know it before hand. It all depends on the unique values that the previous statement returns. I cannot 'hard-code' those columns. Commented Jun 6, 2017 at 9:41
  • So it means that you want to drop those columns which has 1 value? Commented Jun 6, 2017 at 9:44
  • @RameshMaharjan Yes. Dropping columns which has only ONE unique value Commented Jun 6, 2017 at 9:47
  • Is it true that by looking at one row, we can know which columns to drop? I mean to say is it symmetric that if a column in a row is 1 then all the rest of the rows will be 1 in that particular column? Commented Jun 6, 2017 at 10:00

2 Answers 2

3

You try the following command:

df.selectExpr(df.first().getValuesMap[Int](df.columns).filter(_._2 != 1).keys.toSeq: _*).show 

Here we are first taking the first row of the dataframe and converting it into a map using getValueMap with the column names and just filtering the columns whose value is not 1.

Sign up to request clarification or add additional context in comments.

3 Comments

That worked like charm. A small correction i had to make was to change that [Int] to [Long]. val select_cols = df.selectExpr(df.first().getValuesMap[Long](df.columns).filter(_._2 != 1).keys.toSeq: _*) Thank you so much for the help
What is _._2? is this the column name?
_._2 is the second element of the tuple
0

If you want to go on from what you were originally trying, the following should also work. Also, note that with Spark 2.0 onwards you can pass a list into drop and remove the columns that way. This may be more clear as to what you are doing.

Or worst case is another way of doing it.

This should work with most spark versions.

 val keptCols: Seq[Column] = df.columns .map(c => (c, df.select(c).first.getLong(0))) .filter{case (c, v) => v!=1} .map{case (c, v) => df(c)} .toSeq df.select(keptCols: _*).show 

For > Spark 2.0

 val droppedCols: Seq[String] = df.columns .map(c => (c, df.select(c).first.getLong(0))) .filter{case (c, v) => v==1} .map{case (c, v) => c} .toSeq df.drop(droppedCols: _*).show 

Both should use the same results.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.