Linked Questions
13 questions linked to/from How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?
44 votes
5 answers
46k views
Count number of non-NaN entries in each column of Spark dataframe in PySpark
I have a very large dataset that is loaded in Hive (about 1.9 million rows and 1450 columns). I need to determine the "coverage" of each of the columns, meaning, the fraction of rows that ...
14 votes
5 answers
29k views
Pyspark: Is there an equivalent method to pandas info()?
Is there an equivalent method to pandas info() method in PySpark? I am trying to gain basic statistics about a dataframe in PySpark, such as: Number of columns and rows Number of nulls Size of ...
11 votes
2 answers
26k views
How to drop all columns with null values in a PySpark DataFrame?
I have a large dataset of which I would like to drop columns that contain null values and return a new dataframe. How can I do that? The following only drops a single column or rows containing null. ...
3 votes
3 answers
4k views
Efficient way to find columns that contain ANY null values
I want to return a list of all columns that contain at least 1 null value. All of the other similar questions I have seen on StackOverflow are filtering the column where the value is null, but this is ...
1 vote
2 answers
9k views
Check if value greater than zero exists in all columns of dataframe using pyspark
data.select([count(when(isnan(c), c)).alias(c) for c in data.columns]).show() This is the code I was trying to get the count of the nan values. I want to write an if-else condition where if a specific ...
2 votes
3 answers
2k views
pyspark counting number of nulls per group
I have a dataframe that has time series data in it and some categorical data | cat | TS1 | TS2 | ... | | A | 1 | null | ... | | A | 1 | 20 | ... | | B | null | null | ... | | A | ...
3 votes
1 answer
8k views
Design a data quality check application in Python
I am developing an application that performs data quality checks over input files and captures counts based on reported DQ failures in the data. Does the approach I use make sense or would recommend ...
0 votes
1 answer
4k views
Pyspark: Need to show a count of null/empty values per each column in a dataframe
I have a spark dataframe and need to do a count of null/empty values for each column. I need to show ALL columns in the output. I have looked online and found a few "similar questions" but ...
0 votes
1 answer
650 views
Check if a column is all empty
I have a column name and a dataframe. I want to check if all values in that column are empty and if it is empty drop the column from the dataframe. What i did was checked the count of the column with ...
0 votes
1 answer
543 views
Counting nulls or zeros in PySpark data frame with struct column types
I have a PySpark data frame that has a mix of integer columns, string columns, and also struct columns. A struct column could be a struct, but it could also just be null. For example: id | mystring |...
0 votes
1 answer
466 views
Sorting pyspark dataframe accroding to columns values
I am beginner in Spark and I am looking for a solution for my issue. I'm trying to sort a dataframe according to the number of null values each column contains in ascending order. For example: data: ...
0 votes
0 answers
260 views
How to improve performance of Pyspark method to get ratio of rows with missing data
I am trying to do data validation on a large data set ~26 million rows by 36 columns to identify the ratio of missing data in columns. The current solution however is really slow and was wondering if ...
0 votes
0 answers
38 views
Get all the columns in a PySpark dataframe whose value is not 0.0
I am trying to learn PySpark. I am using a dummy dataset and practicing some basic data pre-processing techniques such as dealing with NaN values. The problem I am encountering is I cannot seem for ...