# R/Predict with expand.grid

### From Wiki1

The R command 'expand.grid' creates a data frame that is the cartesian product of its arguments. It is a very useful function to create prediction data frames which can be used, for example, to plot predicted responses over various combinations of values for predictor variables. It is particularly effective for models using fitting formulas that include expressions (e.g. log(X)) of predictor variables. The 'predict' method for many fitting functions (e.g. 'lm', 'glm', 'lme') only needs values of the underlying predictor variables, not their transformations.

This convenience comes at a cost. The 'predict' methods for 'lm' and for 'glm' have undergone a great deal of refinement so that they work 'safely' when factors in a prediction data frame do not have a levels attribute that is not identical to that of the same factor in the original data frame.

For other fitting methods, such as 'lme', where developers have focused on perfecting a specialized methodology and did not devote resources to perfecting a safe predict method, it is important for users to generate prediction data frames so that predictions will be correct.

This can be done by following two simple rules:

1. All factors in the prediction data frame must be generated by using the levels of the factor in the original data frame:

> library(car) > summary(Prestige) > fit <- lm(prestige ~ income * type, Prestige) > summary(fit) > pred <- expand.grid(income = c(1000, 20000), type = levels(Prestige $ type)) > pred $ prestige.pred <- predict(fit, newdata = pred) > pred

2. To get a subset of levels of a factor in a prediction data frame, generate a data frame with all levels, as above, then use 'subset' to retain only the desired levels:

> pred.sub <- subset( pred, type %in% c('bc', 'wc')) > pred.sub

The following examples show how 'predict' in 'lm' works correctly even when the preceding rules are violated. However, with 'nlme', the rules are vital.

## Factors with expand.grid

> de <- expand.grid(x = 1:2, a = c('male','female','unknown')) > de x a 1 1 male 2 2 male 3 1 female 4 2 female 5 1 unknown 6 2 unknown

creates a data frame with every combination of values for 'x' and for 'a'. Note that the 'a' variable is actually created as a factor:

> class(de$a) [1] "factor" > de$a [1] male male female female unknown unknown Levels: male female unknown > unclass(de$a) [1] 1 1 2 2 3 3 attr(,"levels") [1] "male" "female" "unknown"

with levels in the same order as in the 'a' argument to 'expand.grid'.

In other words, the command:

dother <- expand.grid(x = 1:2, a = c('female','male','unknown'))

does not merely create a data frame in which the rows are in a different order, but it creates the 'a' factor in a way that is entirely incompatible for purposes of fitting models in R with the factor created in the data frame 'de' above. Note:

> dother x a 1 1 female 2 2 female 3 1 male 4 2 male 5 1 unknown 6 2 unknown > dother$a [1] female female male male unknown unknown Levels: female male unknown > unclass(dother$a) [1] 1 1 2 2 3 3 attr(,"levels") [1] "female" "male" "unknown"

Using 'dother' where 'de' should have been used may produce wrong results because fitting functions use the numerical codes or the factor, not their labels. In 'dother$a', 'female' has code 1 while in 'de$a' it has code 2.

For expand.grid to produce a correct factor for prediction, it must be given all factor levels in exactly the same order as in the original data frame. The only safe way to do this is to explicitly use the levels from the parent data frame.

## lm is 'safe' with bad factors

The following example uses the data frame 'Prestige' in the 'car' package:

> library(car) > summary(Prestige) > fit <- lm( prestige ~ income*type, Prestige ) > > summary(fit) Call: lm(formula = prestige ~ income * type, data = Prestige)

Residuals: Min 1Q Median 3Q Max -12.2669 -5.2956 0.3125 4.3392 25.0200 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 13.9045168 3.1671787 4.390 3.02e-05 *** income 0.0040235 0.0005530 7.276 1.12e-10 *** typeprof 45.0190221 4.2907398 10.492 < 2e-16 *** typewc 18.9807386 5.3421020 3.553 0.000603 *** income:typeprof -0.0031783 0.0006047 -5.256 9.48e-07 *** income:typewc -0.0021712 0.0009700 -2.238 0.027603 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.268 on 92 degrees of freedom (4 observations deleted due to missingness) Multiple R-squared: 0.8286, Adjusted R-squared: 0.8193 F-statistic: 88.94 on 5 and 92 DF, p-value: < 2.2e-16 > pred <- expand.grid(income = c(10000,20000), type = levels(Prestige$type)) > pred$prestige.pred <- predict( fit, newdata = pred) > pred income type prestige.pred 1 10000 bc 54.13936 2 20000 bc 94.37421 3 10000 prof 67.37554 4 20000 prof 75.82754 5 10000 wc 51.40794 6 20000 wc 69.93062

Suppose we create 'pred' manually and put the levels in the 'wrong' order:

> pred2 <- expand.grid( income = c(10000,20000), type = c('bc','wc','prof')) > pred2$prestige.pred <- predict(fit, newdata = pred2) > pred2 income type prestige.pred 1 10000 bc 54.13936 2 20000 bc 94.37421 3 10000 wc 51.40794 4 20000 wc 69.93062 5 10000 prof 67.37554 6 20000 prof 75.82754

Magically, every thing turns out correct. However, this is for a reason that is true of prediction with 'lm' objects but not necessarily for other types of objects, for example 'nlme'. The reason things work out correctly with 'predict.lm' is that a great deal of effort has gone into making 'predict.lm' *safe*.

## Behaviour of 'predict' in 'nlme'

'Safe prediction' has not been built into 'nlme' as it has into 'lm'. The consequence is that generating the 'pred' data frame so factors have identical levels to those of the original data frame is critical. The following is an example using the 'Milk' data frame in 'nlme':

> library(nlme) > data() > summary(Milk) protein Time Cow Diet Min. :2.450 Min. : 1.000 B02 : 19 barley :425 1st Qu.:3.200 1st Qu.: 5.000 B17 : 19 barley+lupins:459 Median :3.410 Median : 9.000 B24 : 19 lupins :453 Mean :3.422 Mean : 9.185 B09 : 19 3rd Qu.:3.630 3rd Qu.:13.000 B11 : 19 Max. :4.590 Max. :19.000 B05 : 19 (Other):1223 > fit <- lme( protein ~ Diet*Time, Milk, random = ~ 1 | Cow ) > summary(fit) Linear mixed-effects model fit by REML Data: Milk AIC BIC logLik 534.1415 575.691 -259.0707 Random effects: Formula: ~1 | Cow (Intercept) Residual StdDev: 0.1661141 0.273052 Fixed effects: protein ~ Diet * Time Value Std.Error DF t-value p-value (Intercept) 3.550293 0.04295985 1255 82.64212 0.0000 Dietbarley+lupins -0.081598 0.05959248 76 -1.36927 0.1749 Dietlupins -0.123677 0.05965636 76 -2.07316 0.0415 Time -0.002633 0.00260675 1255 -1.01001 0.3127 Dietbarley+lupins:Time -0.001605 0.00362262 1255 -0.44310 0.6578 Dietlupins:Time -0.008913 0.00363821 1255 -2.44986 0.0144

Number of Observations: 1337 Number of Groups: 79 > pred <- expand.grid( Time = c(1,10), Diet =levels(Milk$Diet)) > pred$protein.pred <- predict( fit, newdata = pred, level = 0) > pred Time Diet protein.pred 1 1 barley 3.547660 2 10 barley 3.523964 3 1 barley+lupins 3.464456 4 10 barley+lupins 3.426314 5 1 lupins 3.415070 6 10 lupins 3.311156 > > pred2 <- expand.grid( Time = c(1,10), Diet =c('lupins','barley','barley+lupins')) > pred2$protein.pred <- predict( fit, newdata = pred, level = 0) > pred2 Time Diet protein.pred 1 1 lupins 3.547660 2 10 lupins 3.523964 3 1 barley 3.464456 4 10 barley 3.426314 5 1 barley+lupins 3.415070 6 10 barley+lupins 3.311156

Note that predicted values do not match the correct labels for 'diet'

> > pred3 <- expand.grid( Time = c(1,10), Diet = 'lupins') > pred3$protein.pred <- predict( fit, newdata = pred, level = 0) Error in `$<-.data.frame`(`*tmp*`, "protein.pred", value = c(3.54765985573679, : replacement has 6 rows, data has 2 > pred3 Time Diet 1 1 lupins 2 10 lupins