8

Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.

Edited: As per Suresh Request,

for column in media.columns: if media.select(media[column]).distinct().count() == 1: media = media.drop(media[column]) 

Here I assumed that if count is one, then it should be Nan. But I wanted to check whether that is Nan. And if there's any other inbuilt spark function, let me know.

3
  • Possible duplicate of Difference between na().drop() and filter(col.isNotNull) (Apache Spark) Commented Aug 11, 2017 at 8:00
  • 1
    This is about removing columns, not rows. Commented Aug 11, 2017 at 8:18
  • 1
    So, you got to remove even if a column has one null value or all values as null ?? can you post what you have tried along with input and output samples . Commented Aug 11, 2017 at 9:57

9 Answers 9

11

I tried my way. Say, I have a dataframe as below,

from pyspark.sql import functions as F >>> df.show() +----+----+----+ |col1|col2|col3| +----+----+----+ | 1| 2|null| |null| 3|null| | 5|null|null| +----+----+----+ >>> df1 = df.agg(*[F.count(c).alias(c) for c in df.columns]) >>> df1.show() +----+----+----+ |col1|col2|col3| +----+----+----+ | 2| 2| 0| +----+----+----+ >>> nonNull_cols = [c for c in df1.columns if df1[[c]].first()[c] > 0] >>> df = df.select(*nonNull_cols) >>> df.show() +----+----+ |col1|col2| +----+----+ | 1| 2| |null| 3| | 5|null| +----+----+ 
Sign up to request clarification or add additional context in comments.

2 Comments

I think it should work. If all values of a column is null, I believe datatype won't matter. Just try and let us know.
What is F...?
7

Here's a much more efficient solution that doesn't involve looping over the columns. It is much faster when you have many columns. I tested the other methods here on a dataframe with 800 columns, which took 17 mins to run. The following method takes only 1 min in my tests on the same dataset.

def drop_fully_null_columns(df, but_keep_these=[]): """Drops DataFrame columns that are fully null (i.e. the maximum value is null) Arguments: df {spark DataFrame} -- spark dataframe but_keep_these {list} -- list of columns to keep without checking for nulls Returns: spark DataFrame -- dataframe with fully null columns removed """ # skip checking some columns cols_to_check = [col for col in df.columns if col not in but_keep_these] if len(cols_to_check) > 0: # drop columns for which the max is None rows_with_data = df.select(*cols_to_check).groupby().agg(*[F.max(c).alias(c) for c in cols_to_check]).take(1)[0] cols_to_drop = [c for c, const in rows_with_data.asDict().items() if const == None] new_df = df.drop(*cols_to_drop) return new_df else: return df 

1 Comment

Very nice solution! Got my processing time reduced a lot.
4

for me it worked in a bit different way than @Suresh answer:

nonNull_cols = [c for c in original_df.columns if original_df.filter(func.col(c).isNotNull()).count() > 0] new_df = original_df.select(*nonNull_cols) 

Comments

3

One of the indirect way to do so is

import pyspark.sql.functions as func for col in sdf.columns: if (sdf.filter(func.isnan(func.col(col)) == True).count() == sdf.select(func.col(col)).count()): sdf = sdf.drop(col) 

Update:
The above code drops columns with all nan. If you are looking for all nulls then

import pyspark.sql.functions as func for col in sdf.columns: if (sdf.filter(func.col(col).isNull()).count() == sdf.select(func.col(col)).count()): sdf = sdf.drop(col) 

Will update my answer if I find some optimal way :-)

Comments

1

This is a function I have in my pipeline to remove null columns. Hope it helps!

# Function to drop the empty columns of a DF def dropNullColumns(df): # A set of all the null values you can encounter null_set = {"none", "null" , "nan"} # Iterate over each column in the DF for col in df.columns: # Get the distinct values of the column unique_val = df.select(col).distinct().collect()[0][0] # See whether the unique value is only none/nan or null if str(unique_val).lower() in null_set: print("Dropping " + col + " because of all null values.") df = df.drop(col) return(df) 

1 Comment

@Abhisek Your function also drops columns that have one distinct value. Try your function with the following example data. data_2 = { 'furniture': [np.NaN ,np.NaN ,True], 'myid': ['1-12', '0-11', '2-12'], 'clothing': ["pants", "shoes", "socks"]} df_1 = pd.DataFrame(data_2) ddf_1 = spark.createDataFrame(df_1) You will see that the furniture column will be dropped although in fact it should not be dropped.
1

Or just

from pyspark.sql.functions import col for c in df.columns: if df.filter(col(c).isNotNull()).count() == 0: df = df.drop(c) 

2 Comments

this code still leaves columns containing all zeros.
in scala once can do something like this extension methods def dropAllNulColsUsedByTest: DataFrame = { val cols = df.columns cols.foldLeft(df) { (df, c) => if (df.filter(col(c).isNotNull).count == 0) df.drop(c) else df } }
0

This is a robust solution that takes into consideration all possible combinations of nulls that could be in a column. First, all null columns are found and then they are dropped. It looks lengthy and cumbersome, but in fact this is a robust solution. Only one loop is used for the finding of the null columns and no memory intensive function such as collect() is applied, which should make this solution fast and efficient.

rows = [(None, 18, None, None), (1, None, None, None), (1, 9, 4.0, None), (None, 0, 0., None)] schema = "a: int, b: int, c: float, d:int" df = spark.createDataFrame(data=rows, schema=schema) def get_null_column_names(df): column_names = [] for col_name in df.columns: min_ = df.select(F.min(col_name)).first()[0] max_ = df.select(F.max(col_name)).first()[0] if min_ is None and max_ is None: column_names.append(col_name) return column_names null_columns = get_null_column_names(df) def drop_column(null_columns, df): for column_ in null_columns: df = df.drop(column_) return df df = drop_column(null_columns, df) df.show() 

Output: enter image description here

Comments

0

Just picking up pieces from the answers above, wrote my own solution for my use case.

What I essentially was trying to do is remove all columns from my pyspark dataframe which had 100% null values.

# identify and remove all columns having 100% null values df_summary_count = your_df.summary("count") null_cols = [c for c in df_summary_count .columns if df_summary_count.select(c).first()[c] == '0'] filtered_df = df_summary_count .drop(*null_cols) 

Comments

0

Create a dataframe:

df = spark.createDataFrame( [ (1, 'baz'), (2, 'baz') ], ['foo','bar'] ) 

Add a null column:

df = df.withColumn('foobar', lit(None)) 

Make a list of non-null columns:

non_null_columns = df.summary('count').drop('summary').columns 

Create a list comprehension of columns in your df that also exist in non_null_columns and use those columns to select from your df:

df.select([col for col in df.columns if col in non_null_columns]).show() 

which prints:

+---+---+ |foo|bar| +---+---+ | 1|baz| | 2|baz| +---+---+ 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.