We have a specific need wherein I will have to drop columns from a dataframe which has only one unique value in that column. The following is what we are doing
val rawdata = spark.read.format("csv").option("header","true").option("inferSchema","true").load(filename) Subsequently to find unique values in all columns we are using the HyperLog++ algorithm supported in spark
val cd_cols = rawdata.select(rawdata.columns.map(column => approxCountDistinct(col(column)).alias(column)): _*) The output is
scala> cd_cols.show +----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+ | ID|First Name|Last Name|Age|Attrition|BusinessTravel|DailyRate|Department|DistanceFromHome|Education|EducationField|EmployeeCount| +----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+ |1491| 172| 154| 43| 2| 3| 913| 3| 30| 1| 6| 1| +----+----------+---------+---+---------+--------------+---------+----------+----------------+---------+--------------+-------------+ Notice that I have two columns which has a 1 as the unique value. I want to create another dataframe which has all columns except those two columns (Education and EmployeeCount)
I tried using a for loop, but was not very happy and also tried
cd_cols.columns.filter(colName => cd_cols.filter(colName) <= 1) that is also not working.
Is there a smarter way to do this please.
Thanks
Bala