R/Traps and pitfalls
Many of the tricky silent traps are encountered in the use of factors.
Transformation of factors to characters or codes
In its raw form, a factor is a vector of integers that provides indices into a vector of 'levels' for the factor. The levels are attached as an attribute to the factor.
> fac <- factor( c('c','a','a','b',NA,'c')) > unclass(fac)  3 1 1 2 NA 3 attr(,"levels")  "a" "b" "c"
A factor vector can be coerced to its character form or to its numerical indices:
> as.character(fac)  "c" "a" "a" "b" NA "c" > as.numeric(fac)  3 1 1 2 NA 3
Most functions operating on factors use either the factor's character form or its numerical form. In most cases, the form used is the only sensible one and there are no surprises. Sometimes the result is not what the user expected and mysterious bugs or outright errors can be produced.
Factors transformed to character
The following functions use the character form of the factor:
> matrix(fac,3) [,1] [,2] [1,] "c" "b" [2,] "a" NA [3,] "a" "c" > sub('a','A',fac)  "c" "A" "A" "b" NA "c" > grep("a",fac, value = T)  "a" "a"
Factors transformed to numeric
The following functions use the numeric form. In the first case (indexing) that might seem to be the only sensible interpretation. However, since it is possible to index by name in R, a user could intend to use the character values of a factor to index names but end up with an entirely different result. In the second case, ('rbind'), the use of numeric values seems contrary to expectation considering the behaviour of 'matrix' above.
> c('one','two','three')[fac]  "three" "one" "one" "two" NA "three"
> rbind(fac) [,1] [,2] [,3] [,4] [,5] [,6] fac 3 1 1 2 NA 3
When using 'rbind' with a factor and a character, the coercion of the factor to character occurs after extracting the numeric codes.
> rbind(fac, 'a') [,1] [,2] [,3] [,4] [,5] [,6] fac "3" "1" "1" "2" NA "3" "a" "a" "a" "a" "a" "a"
Factors operations that return a factor
Some operators on factors return a factor:
> fac:fac  c:c a:a a:a b:b <NA> c:c Levels: a:a a:b a:c b:a b:b b:c c:a c:b c:c
A special pitfall can occur when attempting to transform a factor whose levels are character representations of numbers into a numeric object:
> facn <- factor( c(1,10,2)) > facn  1 10 2 Levels: 1 2 10
Note in passing the that the levels have been ordered numerically instead of lexicographically, as would have been the case if the argument to 'factor' had been c('1','10','2').
'facn' almost seems numeric but it is not:
> facn + 1  NA NA NA Warning message: In Ops.factor(facn, 1) : + not meaningful for factors
Neither 'as.character' nor 'as.numeric' returns the original numeric vector:
> as.character(facn)  "1" "10" "2" > as.numeric(facn)  1 3 2
To get the original numeric vector, one must compose both:
> as.numeric(as.character(facn))  1 10 2
or, one can define a function:
> num <- function(x) as.numeric(as.character(x)) > num(facn)
'drop' doesn't work with subset
zz <- subset( dd, !(id %in% c('A,'B')), drop = TRUE)
doesn't drop levels in 'id' as it should.
Many algorithms using eigenvalue or singular value decompositions (with 'eigen' or 'svd') form a diagonal matrix with the vector of eigen/singular values using the 'diag' function, e.g.
> X <- matrix(rnorm(30),10) > sv <- svd(X) > d.inv <- 1/(sv$d[sv$d>0]) > rk <- length(d.inv) > Xginv <- sv$v[1:rk,] %*% diag(d.inv) %*% t(sv$u[1:rk,])
This will fail if the rank of X is equal to 1 since, in that case, 'diag(d.inv)' will be an identity matrix of dimension 'floor(d.inv)', while what is needed is a 1 x 1 matrix with a single element 'd.inv'. One solution is to use:
> Xginv <- sv$v[1:rk,] %*% diag(d.inv, nrow = length(d.inv)) %*% t(sv$u[1:rk,])
Another is to use the fact that matrix premultiplication by a diagonal matrix is the same as scalar premultiplication by the vector of diagonal elements:
> Xginv <- sv$v[1:rk,] %*% (d.inv * t(sv$u[1:rk,]))
Note that extra parentheses are needed because these multiplications are not associative.
Reading and Writing Data Files
NA as a valid value (the Namibia problem)
Many commands that read data files, e.g. read.csv and read.xls in the package gdata, will, by default, treat the string 'NA' as a missing value whether it occurs in a character or a numeric variable. In numeric variables, blanks are also turned into missing values. If 'NA' occurs as a valid value, for example the two-character ISO country code for Namibia, then you may use the argument 'na.strings = NULL' to ensure that 'NA' is not turned into a missing value. However, NA's used to indicate missing numeric values will now be interpreted as valid character values and numeric variables with NA's will be read as factors.
Prediction with nlme
fit <- lme( y ~ x , data = dd, random = ~ 1 |id, na.action = na.omit) pp <- predict(fit, data = dd, level = 0)
to produce pp of length equal to 'nrow(dd)', you can use the following combination of 'na.action's:
fit <- lme( y ~ x , data = dd, random = ~ 1 |id, na.action = na.exclude) pp <- predict(fit, data = dd, level = 0, na.action = na.pass)