R/Traps and pitfalls

From Wiki1

< R
Jump to: navigation, search


Some of these observations may change as R develops. It would be a good idea to add the version of R in which each behaviour was observed.


Many of the tricky silent traps are encountered in the use of factors.

Transformation of factors to characters or codes

In its raw form, a factor is a vector of integers that provides indices into a vector of 'levels' for the factor. The levels are attached as an attribute to the factor.

> fac <- factor( c('c','a','a','b',NA,'c'))

> unclass(fac)
[1]  3  1  1  2 NA  3
[1] "a" "b" "c"

A factor vector can be coerced to its character form or to its numerical indices:

> as.character(fac)
[1] "c" "a" "a" "b" NA  "c"
> as.numeric(fac)
[1]  3  1  1  2 NA  3

Most functions operating on factors use either the factor's character form or its numerical form. In most cases, the form used is the only sensible one and there are no surprises. Sometimes the result is not what the user expected and mysterious bugs or, worse, silent errors are produced.

Factors transformed to character

The following functions use the character form of the factor:

> matrix(fac,3)
     [,1] [,2]
[1,] "c"  "b" 
[2,] "a"  NA  
[3,] "a"  "c" 

> sub('a','A',fac)
[1] "c" "A" "A" "b" NA  "c"

> grep("a",fac, value = T)
[1] "a" "a"

Factors transformed to numeric

The following functions use the numeric form. In the first case (indexing) that might seem to be the only sensible interpretation. However, since it is possible to index by name in R, a user could intend to use the character values of a factor to index names but end up with an entirely different result. In the second case, ('rbind'), the use of numeric values seems contrary to expectation considering the behaviour of 'matrix' above.

> c('one','two','three')[fac]
[1] "three" "one"   "one"   "two"   NA      "three"
> rbind(fac)
    [,1] [,2] [,3] [,4] [,5] [,6]
fac    3    1    1    2   NA    3

When using 'rbind' with a factor and a character, the coercion of the factor to character occurs after extracting the numeric codes.

> rbind(fac, 'a')
    [,1] [,2] [,3] [,4] [,5] [,6]
fac "3"  "1"  "1"  "2"  NA   "3" 
    "a"  "a"  "a"  "a"  "a"  "a"

When using 'ifelse', a factor as a 'yes' or 'no' arguments is transformed to numeric.


When a factor is used for indexing, its numeric instead of its character value is used. This can create surprises. For example, a factor containing variable names cannot be used directly to select variables in a data frame.

 > df <- data.frame(c = 1:3, a = 11:13, c = 21:23)
 > fac <- factor(c('a','b','c'))
 > df[[fac[1]]]
 > df[[as.character(fac[1])]]

Factors operations that return a factor

Some operators on factors return a factor:

> fac:fac
[1] c:c  a:a  a:a  b:b  <NA> c:c 
Levels: a:a a:b a:c b:a b:b b:c c:a c:b c:c

Special pitfalls

A special pitfall can occur when attempting to transform a factor whose levels are character representations of numbers into a numeric object:

> facn <- factor( c(1,10,2))
> facn
[1] 1  10 2 
Levels: 1 2 10

Note in passing that the levels have been ordered numerically instead of lexicographically, as would have been the case if the argument to 'factor' had been c('1','10','2'). Thus the 'factor' function is 'numeric-smart'.

'facn' almost seems numeric but it is not:

> facn + 1
[1] NA NA NA
Warning message:
In Ops.factor(facn, 1) : + not meaningful for factors

Neither 'as.character' nor 'as.numeric' returns the original numeric vector:

> as.character(facn)
[1] "1"  "10" "2" 
> as.numeric(facn)
[1] 1 3 2

To get the original numeric vector, one must compose both:

> as.numeric(as.character(facn))
[1]  1 10  2

or, one can define a function:

> num <- function(x) as.numeric(as.character(x))
> num(facn)

'drop' doesn't work with subset

zz <- subset( dd, !(id %in% c('A,'B')), drop = TRUE)

doesn't drop levels in 'id' (as it should?). Instead, use:

zz <- droplevels(subset( dd, !(id %in% c('A,'B'))))


Many algorithms using eigenvalue or singular value decompositions (with 'eigen' or 'svd') form a diagonal matrix with the vector of eigen/singular values using the 'diag' function, e.g.

  > X <- matrix(rnorm(30),10)
  > sv <- svd(X)
  > d.inv <- 1/(sv$d[sv$d>0])
  > rk <- length(d.inv)
  > Xginv <- sv$v[1:rk,] %*% diag(d.inv) %*% t(sv$u[1:rk,])

This will fail if the rank of X is equal to 1 since, in that case, 'diag(d.inv)' will be an identity matrix of dimension 'floor(d.inv)', while what is needed is a 1 x 1 matrix with a single element 'd.inv'. One solution is to use:

  > Xginv <- sv$v[1:rk,] %*% diag(d.inv, nrow = length(d.inv)) %*% t(sv$u[1:rk,])

Another is to use the fact that matrix premultiplication by a diagonal matrix is the same as scalar premultiplication by the vector of diagonal elements:

  > Xginv <- sv$v[1:rk,] %*% (d.inv * t(sv$u[1:rk,]))

Note that extra parentheses are needed because these multiplications are not associative.

Reading and Writing Data Files

NA as a valid value (the Namibia problem)

Many commands that read data files, e.g. read.csv and read.xls in the package gdata, will, by default, treat the string 'NA' as a missing value whether it occurs in a character or a numeric variable. In numeric variables, blanks are also turned into missing values. If 'NA' occurs as a valid value, for example the two-character ISO country code for Namibia, then you may use the argument 'na.strings = NULL' to ensure that 'NA' is not turned into a missing value. However, NA's used to indicate missing numeric values will now be interpreted as valid character values and numeric variables with NA's will be read as factors.


Prediction with nlme

To get

   fit <- lme( y ~ x , data = dd, random = ~ 1 |id, na.action = na.omit)

   pp <- predict(fit, data = dd, level = 0)

to produce pp of length equal to 'nrow(dd)', you can use the following combination of 'na.action's:

   fit <- lme( y ~ x , data = dd, random = ~ 1 |id, na.action = na.exclude)

   pp <- predict(fit, data = dd, level = 0, na.action = na.pass)
Personal tools