To calculate the number of NAs in the entire data.frame, I can use sum(is.na(df), however, how can I count the number of NA in each column of a big data.frame? I tried apply(df, 2, function (x) sum(is.na(df$x)) but that didn't seem to work.
12 Answers
Since the dplyr::summarise_all function has been superseded by using across inside the original function and dplyr::funs has been deprecated, the current tidyverse approach would probably be something like:
df %>% summarise(across(everything(), ~ sum(is.na(.x)))) Comments
You could try the following functions
Using colSums()
colSums(is.na(df))Using apply()
apply(df, 2, function(x) {sum(is.na(x))})Using a function
sum.na <- function (x) {sum(is.na(x))}print(sum.na(df))Using lapply()
lapply(df, function(x) sum(is.na(x)))Using sapply()
lapply(df, function(x) sum(is.na(x)))
Comments
We can also use the dplyr function to achieve this outcome:
df %>% select(everything()) %>% summarise_all(funs(sum(is.na(.)))) The above solution allows you to select specific columns by replacing the everything() with specific columns you are interested in analysing. This can be useful to meet specific needs. If you want to read further, you can check this page https://sebastiansauer.github.io/sum-isna/.
Comments
you can use
apply(is.na(df), 2, sum) this will return total sum of NAs available in each column
example
df <- data.frame(x= as.numeric(c(1,2,3,4,5,6,6,'fg',8,8,3,4,2)), y = as.numeric(c(1,2,3,4,5,'as',7,8,9,9,1,4,2)), z = as.numeric(c(1,4,6,7,'a',12,45,7,'as',1,23,12,'la'))) apply(is.na(df), 2, sum) output
x y z 1 1 3 Comments
A possible data.table approach:
library(data.table) egdf = data.frame(x=c(1, 10, NA, NA, 2), y=c(2.4, NA, 2, 3.5, NA)) setDT(egdf) # make it a data.table egdf[, z := x+y] # add another column # use .SDcols after 2nd ',' to specify only some columns egdf[, lapply(.SD, function(x) {return(sum(is.na(x)))}) ,] ### # x y z # 1: 2 2 4 1 Comment
Another, very simple, approach is just to use the summary() function on the dataframe. It will give at a glance the NA counts for each column.
From this website (intro2r) on the summary() function, "If a variable contains missing data then the number of NA values is also reported." (Note, I could not find confirmation in the official docs.)
A downside to summary() is it gets more involved if you want to extract the NA counts for use in another context. Though there are ways to parse a summary() object, for example, this link.
x <- data.frame(a = c(1, 2, NA, NA, 1), b = c(1, 1, 1, 1, NA));apply(x, 2, function(z) sum(is.na(z)))).