Linked Questions

44 votes
5 answers
46k views

I have a very large dataset that is loaded in Hive (about 1.9 million rows and 1450 columns). I need to determine the "coverage" of each of the columns, meaning, the fraction of rows that ...
RKD314's user avatar
  • 1,185
14 votes
5 answers
29k views

Is there an equivalent method to pandas info() method in PySpark? I am trying to gain basic statistics about a dataframe in PySpark, such as: Number of columns and rows Number of nulls Size of ...
Brian Waters's user avatar
11 votes
2 answers
26k views

I have a large dataset of which I would like to drop columns that contain null values and return a new dataframe. How can I do that? The following only drops a single column or rows containing null. ...
PolarBear10's user avatar
  • 2,325
3 votes
3 answers
4k views

I want to return a list of all columns that contain at least 1 null value. All of the other similar questions I have seen on StackOverflow are filtering the column where the value is null, but this is ...
KOB's user avatar
  • 4,645
1 vote
2 answers
9k views

data.select([count(when(isnan(c), c)).alias(c) for c in data.columns]).show() This is the code I was trying to get the count of the nan values. I want to write an if-else condition where if a specific ...
4212extra's user avatar
2 votes
3 answers
2k views

I have a dataframe that has time series data in it and some categorical data | cat | TS1 | TS2 | ... | | A | 1 | null | ... | | A | 1 | 20 | ... | | B | null | null | ... | | A | ...
Aesir's user avatar
  • 2,491
3 votes
1 answer
8k views

I am developing an application that performs data quality checks over input files and captures counts based on reported DQ failures in the data. Does the approach I use make sense or would recommend ...
Hardy's user avatar
  • 139
0 votes
1 answer
4k views

I have a spark dataframe and need to do a count of null/empty values for each column. I need to show ALL columns in the output. I have looked online and found a few "similar questions" but ...
wally's user avatar
  • 7
0 votes
1 answer
650 views

I have a column name and a dataframe. I want to check if all values in that column are empty and if it is empty drop the column from the dataframe. What i did was checked the count of the column with ...
sks27's user avatar
  • 63
0 votes
1 answer
543 views

I have a PySpark data frame that has a mix of integer columns, string columns, and also struct columns. A struct column could be a struct, but it could also just be null. For example: id | mystring |...
formicaman's user avatar
  • 1,357
0 votes
1 answer
466 views

I am beginner in Spark and I am looking for a solution for my issue. I'm trying to sort a dataframe according to the number of null values each column contains in ascending order. For example: data: ...
Mus's user avatar
  • 303
0 votes
0 answers
260 views

I am trying to do data validation on a large data set ~26 million rows by 36 columns to identify the ratio of missing data in columns. The current solution however is really slow and was wondering if ...
AJR's user avatar
  • 197
0 votes
0 answers
38 views

I am trying to learn PySpark. I am using a dummy dataset and practicing some basic data pre-processing techniques such as dealing with NaN values. The problem I am encountering is I cannot seem for ...
sj6266's user avatar
  • 3