3

I have a file with filename = 'fn', which I am reading as follows:

age CALCIUM CREATININE GLUCOSE 64.3573 1.1 488 69.9043 8.1 1.1 472 65.6633 8.6 0.8 461 50.3693 8.1 1.3 418 57.0334 8.7 0.8 NEG 81.4939 1.1 NEG 56.954 9.8 1 76.9298 9.1 0.8 NEG > tmpData = read.table(fn, header = TRUE, sep= "\t" , na.strings = c('', 'NA', '<NA>'), blank.lines.skip = TRUE) > tmpData age CALCIUM CREATININE GLUCOSE 1 64.3573 NA 1.1 488 2 69.9043 8.1 1.1 472 3 65.6633 8.6 0.8 461 4 50.3693 8.1 1.3 418 5 57.0334 8.7 0.8 NEG 6 81.4939 NA 1.1 NEG 7 56.9540 9.8 1.0 <NA> 8 76.9298 9.1 0.8 NEG 

The file is read as above with missing values replaced as NA and < NA >. I guess that the 'glucose' column is treated as factor. Is there an easy way to interpret < NA > as real NA and convert any non-numeric values into NA (in this example NEG into NA)

2
  • 2
    What happens if you add "NEG" to na.strings? Commented Feb 15, 2013 at 15:30
  • Works if NEG is included. But for a general string, where it can be any character sequence, it there any read method that deals with such situation automatically Commented Feb 15, 2013 at 15:36

1 Answer 1

4

You can take advantage of the fact that as.numeric will coerce non-numeric values to NA. In other words, try something like this:

Here's your data:

temp <- structure(list(age = c(64.3573, 69.9043, 65.6633, 50.3693, 57.0334, 81.4939, 56.954, 76.9298), CALCIUM = c(1.1, 8.1, 8.6, 8.1, 8.7, 1.1, 9.8, 9.1), CREATININE = c(NA, 1.1, 0.8, 1.3, 0.8, NA, 1, 0.8), GLUCOSE = structure(c(5L, 4L, 3L, 2L, 6L, 6L, 1L, 6L), .Label = c("", "418", "461", "472", "488", "NEG"), class = "factor")), .Names = c("age", "CALCIUM", "CREATININE", "GLUCOSE"), class = "data.frame", row.names = c(NA, -8L)) 

And its current structure:

str(temp) # 'data.frame': 8 obs. of 4 variables: # $ age : num 64.4 69.9 65.7 50.4 57 ... # $ CALCIUM : num 1.1 8.1 8.6 8.1 8.7 1.1 9.8 9.1 # $ CREATININE: num NA 1.1 0.8 1.3 0.8 NA 1 0.8 # $ GLUCOSE : Factor w/ 6 levels "","418","461",..: 5 4 3 2 6 6 1 6 

Convert that last column to numeric, but since it's a factor, we need to convert it to character first. Note the warning. We're actually happy about that.

temp$GLUCOSE <- as.numeric(as.character(temp$GLUCOSE)) # Warning message: # NAs introduced by coercion 

The result:

temp # age CALCIUM CREATININE GLUCOSE # 1 64.3573 1.1 NA 488 # 2 69.9043 8.1 1.1 472 # 3 65.6633 8.6 0.8 461 # 4 50.3693 8.1 1.3 418 # 5 57.0334 8.7 0.8 NA # 6 81.4939 1.1 NA NA # 7 56.9540 9.8 1.0 NA # 8 76.9298 9.1 0.8 NA 

For fun, here's a little function I put together that provides an alternative approach:

makemeNA <- function (mydf, NAStrings, fixed = TRUE) { if (!isTRUE(fixed)) { mydf[] <- lapply(mydf, function(x) gsub(NAStrings, "", x)) NAStrings <- "" } mydf[] <- lapply(mydf, function(x) type.convert( as.character(x), na.strings = NAStrings)) mydf } 

This function lets you specify a regular expression to identify what should be an NA value. I haven't really tested it much, so use the regex feature at your own risk!

Using the same "temp" object as above, try these out to see what the function does:

# Change anything that is just text to NA makemeNA(temp, "[A-Za-z]", fixed = FALSE) # Change any exact matches with "NEG" to NA makemeNA(temp, "NEG") # Change any matches with 3-digit integers to NA makemeNA(temp, "^[0-9]{3}$", fixed = FALSE) 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.