Linked Questions

Question 1

I have a very large dataset that is loaded in Hive (about 1.9 million rows and 1450 columns). I need to determine the "coverage" of each of the columns, meaning, the fraction of rows that ...

Question 2

Is there an equivalent method to pandas info() method in PySpark? I am trying to gain basic statistics about a dataframe in PySpark, such as: Number of columns and rows Number of nulls Size of ...

Question 3

I have a large dataset of which I would like to drop columns that contain null values and return a new dataframe. How can I do that? The following only drops a single column or rows containing null. ...

Question 4

I want to return a list of all columns that contain at least 1 null value. All of the other similar questions I have seen on StackOverflow are filtering the column where the value is null, but this is ...

Question 5

data.select([count(when(isnan(c), c)).alias(c) for c in data.columns]).show() This is the code I was trying to get the count of the nan values. I want to write an if-else condition where if a specific ...

Question 6

I have a dataframe that has time series data in it and some categorical data | cat | TS1 | TS2 | ... | | A | 1 | null | ... | | A | 1 | 20 | ... | | B | null | null | ... | | A | ...

Question 7

I am developing an application that performs data quality checks over input files and captures counts based on reported DQ failures in the data. Does the approach I use make sense or would recommend ...

Question 8

I have a spark dataframe and need to do a count of null/empty values for each column. I need to show ALL columns in the output. I have looked online and found a few "similar questions" but ...

Question 9

I have a column name and a dataframe. I want to check if all values in that column are empty and if it is empty drop the column from the dataframe. What i did was checked the count of the column with ...

Question 10

I have a PySpark data frame that has a mix of integer columns, string columns, and also struct columns. A struct column could be a struct, but it could also just be null. For example: id | mystring |...

Question 11

I am beginner in Spark and I am looking for a solution for my issue. I'm trying to sort a dataframe according to the number of null values each column contains in ascending order. For example: data: ...

Question 12

I am trying to do data validation on a large data set ~26 million rows by 36 columns to identify the ratio of missing data in columns. The current solution however is really slow and was wondering if ...

Question 13

I am trying to learn PySpark. I am using a dummy dataset and practicing some basic data pre-processing techniques such as dealing with NaN values. The problem I am encountering is I cannot seem for ...

Collectives™ on Stack Overflow

Linked Questions

Count number of non-NaN entries in each column of Spark dataframe in PySpark

Pyspark: Is there an equivalent method to pandas info()?

How to drop all columns with null values in a PySpark DataFrame?

Efficient way to find columns that contain ANY null values

Check if value greater than zero exists in all columns of dataframe using pyspark

pyspark counting number of nulls per group

Design a data quality check application in Python

Pyspark: Need to show a count of null/empty values per each column in a dataframe

Check if a column is all empty

Counting nulls or zeros in PySpark data frame with struct column types

Sorting pyspark dataframe accroding to columns values

How to improve performance of Pyspark method to get ratio of rows with missing data

Get all the columns in a PySpark dataframe whose value is not 0.0

Hot Network Questions