Drop if all entries in a spark dataframe's specific column is null

Question

Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.

Edited: As per Suresh Request,

for column in media.columns: if media.select(media[column]).distinct().count() == 1: media = media.drop(media[column])

Here I assumed that if count is one, then it should be Nan. But I wanted to check whether that is Nan. And if there's any other inbuilt spark function, let me know.

Possible duplicate of Difference between na().drop() and filter(col.isNotNull) (Apache Spark) — Chitral Verma
– Chitral Verma, Commented Aug 11, 2017 at 8:00
So, you got to remove even if a column has one null value or all values as null ?? can you post what you have tried along with input and output samples . — Suresh
– Suresh, Commented Aug 11, 2017 at 9:57

Suresh · Accepted Answer · 2021-03-25 05:02:16Z

I tried my way. Say, I have a dataframe as below,

from pyspark.sql import functions as F >>> df.show() +----+----+----+ |col1|col2|col3| +----+----+----+ | 1| 2|null| |null| 3|null| | 5|null|null| +----+----+----+ >>> df1 = df.agg(*[F.count(c).alias(c) for c in df.columns]) >>> df1.show() +----+----+----+ |col1|col2|col3| +----+----+----+ | 2| 2| 0| +----+----+----+ >>> nonNull_cols = [c for c in df1.columns if df1[[c]].first()[c] > 0] >>> df = df.select(*nonNull_cols) >>> df.show() +----+----+ |col1|col2| +----+----+ | 1| 2| |null| 3| | 5|null| +----+----+

I think it should work. If all values of a column is null, I believe datatype won't matter. Just try and let us know.

Chris Barber · Accepted Answer · 2020-12-18 17:03:04Z

Here's a much more efficient solution that doesn't involve looping over the columns. It is much faster when you have many columns. I tested the other methods here on a dataframe with 800 columns, which took 17 mins to run. The following method takes only 1 min in my tests on the same dataset.

def drop_fully_null_columns(df, but_keep_these=[]): """Drops DataFrame columns that are fully null (i.e. the maximum value is null) Arguments: df {spark DataFrame} -- spark dataframe but_keep_these {list} -- list of columns to keep without checking for nulls Returns: spark DataFrame -- dataframe with fully null columns removed """ # skip checking some columns cols_to_check = [col for col in df.columns if col not in but_keep_these] if len(cols_to_check) > 0: # drop columns for which the max is None rows_with_data = df.select(*cols_to_check).groupby().agg(*[F.max(c).alias(c) for c in cols_to_check]).take(1)[0] cols_to_drop = [c for c, const in rows_with_data.asDict().items() if const == None] new_df = df.drop(*cols_to_drop) return new_df else: return df

YonGU · Accepted Answer · 2019-10-28 13:35:06Z

for me it worked in a bit different way than @Suresh answer:

nonNull_cols = [c for c in original_df.columns if original_df.filter(func.col(c).isNotNull()).count() > 0] new_df = original_df.select(*nonNull_cols)

Pintu · Accepted Answer · 2017-08-11 09:23:12Z

One of the indirect way to do so is

import pyspark.sql.functions as func for col in sdf.columns: if (sdf.filter(func.isnan(func.col(col)) == True).count() == sdf.select(func.col(col)).count()): sdf = sdf.drop(col)

Update:
The above code drops columns with all nan. If you are looking for all nulls then

import pyspark.sql.functions as func for col in sdf.columns: if (sdf.filter(func.col(col).isNull()).count() == sdf.select(func.col(col)).count()): sdf = sdf.drop(col)

Will update my answer if I find some optimal way :-)

Abhishek · Accepted Answer · 2018-10-14 05:21:59Z

This is a function I have in my pipeline to remove null columns. Hope it helps!

# Function to drop the empty columns of a DF def dropNullColumns(df): # A set of all the null values you can encounter null_set = {"none", "null" , "nan"} # Iterate over each column in the DF for col in df.columns: # Get the distinct values of the column unique_val = df.select(col).distinct().collect()[0][0] # See whether the unique value is only none/nan or null if str(unique_val).lower() in null_set: print("Dropping " + col + " because of all null values.") df = df.drop(col) return(df)

@Abhisek Your function also drops columns that have one distinct value. Try your function with the following example data. data_2 = { 'furniture': [np.NaN ,np.NaN ,True], 'myid': ['1-12', '0-11', '2-12'], 'clothing': ["pants", "shoes", "socks"]} df_1 = pd.DataFrame(data_2) ddf_1 = spark.createDataFrame(df_1) You will see that the furniture column will be dropped although in fact it should not be dropped.

wojo · Accepted Answer · 2018-11-25 01:07:31Z

1

Or just

from pyspark.sql.functions import col for c in df.columns: if df.filter(col(c).isNotNull()).count() == 0: df = df.drop(c)

answered Nov 25, 2018 at 1:07

wojo

111 bronze badge

2 Comments

DataBach Over a year ago

this code still leaves columns containing all zeros.

soMuchToLearnAndShare Over a year ago

in scala once can do something like this extension methods

def dropAllNulColsUsedByTest: DataFrame = { val cols = df.columns cols.foldLeft(df) { (df, c) => if (df.filter(col(c).isNotNull).count == 0) df.drop(c) else df } }

DataBach · Accepted Answer · 2020-02-11 09:16:45Z

This is a robust solution that takes into consideration all possible combinations of nulls that could be in a column. First, all null columns are found and then they are dropped. It looks lengthy and cumbersome, but in fact this is a robust solution. Only one loop is used for the finding of the null columns and no memory intensive function such as collect() is applied, which should make this solution fast and efficient.

rows = [(None, 18, None, None), (1, None, None, None), (1, 9, 4.0, None), (None, 0, 0., None)] schema = "a: int, b: int, c: float, d:int" df = spark.createDataFrame(data=rows, schema=schema) def get_null_column_names(df): column_names = [] for col_name in df.columns: min_ = df.select(F.min(col_name)).first()[0] max_ = df.select(F.max(col_name)).first()[0] if min_ is None and max_ is None: column_names.append(col_name) return column_names null_columns = get_null_column_names(df) def drop_column(null_columns, df): for column_ in null_columns: df = df.drop(column_) return df df = drop_column(null_columns, df) df.show()

Output:

Yash Karle · Accepted Answer · 2020-02-21 10:01:25Z

Just picking up pieces from the answers above, wrote my own solution for my use case.

What I essentially was trying to do is remove all columns from my pyspark dataframe which had 100% null values.

# identify and remove all columns having 100% null values df_summary_count = your_df.summary("count") null_cols = [c for c in df_summary_count .columns if df_summary_count.select(c).first()[c] == '0'] filtered_df = df_summary_count .drop(*null_cols)

gbeaven · Accepted Answer · 2021-03-24 22:40:18Z

Create a dataframe:

df = spark.createDataFrame( [ (1, 'baz'), (2, 'baz') ], ['foo','bar'] )

Add a null column:

df = df.withColumn('foobar', lit(None))

Make a list of non-null columns:

non_null_columns = df.summary('count').drop('summary').columns

Create a list comprehension of columns in your df that also exist in non_null_columns and use those columns to select from your df:

df.select([col for col in df.columns if col in non_null_columns]).show()

which prints:

+---+---+ |foo|bar| +---+---+ | 1|baz| | 2|baz| +---+---+

Collectives™ on Stack Overflow

Drop if all entries in a spark dataframe's specific column is null

9 Answers 9

2 Comments

1 Comment

Comments

Comments

1 Comment

2 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

2 Comments

1 Comment

Comments

Comments

1 Comment

2 Comments

Comments

Comments

Comments

Linked

Related