R/Predict with expand.grid

From Wiki1

< R
Jump to: navigation, search

The R command 'expand.grid' creates a data frame that is the cartesian product of its arguments. It is a very useful function to create prediction data frames which can be used, for example, to plot predicted responses over various combinations of values for predictor variables. It is particularly effective for models using fitting formulas that include expressions (e.g. log(X)) of predictor variables. The 'predict' method for many fitting functions (e.g. 'lm', 'glm', 'lme') only needs values of the underlying predictor variables, not their transformations.

This convenience comes at a cost. The 'predict' methods for 'lm' and for 'glm' have undergone a great deal of refinement so that they work 'safely' when factors in a prediction data frame do not have a levels attribute that is not identical to that of the same factor in the original data frame.

For other fitting methods, such as 'lme', where developers have focused on perfecting a specialized methodology and did not devote resources to perfecting a safe predict method, it is important for users to generate prediction data frames so that predictions will be correct.

This can be done by following two simple rules:

1. All factors in the prediction data frame must be generated by using the levels of the factor in the original data frame:

> library(car)
> summary(Prestige)
> fit <- lm(prestige ~ income * type, Prestige)
> summary(fit)
> pred <- expand.grid(income = c(1000, 20000), type = levels(Prestige $ type))
> pred $ prestige.pred <- predict(fit, newdata = pred)
> pred

2. To get a subset of levels of a factor in a prediction data frame, generate a data frame with all levels, as above, then use 'subset' to retain only the desired levels:

> pred.sub <- subset( pred, type %in% c('bc', 'wc'))
> pred.sub 
 

The following examples show how 'predict' in 'lm' works correctly even when the preceding rules are violated. However, with 'nlme', the rules are vital.

Factors with expand.grid

> de <- expand.grid(x = 1:2, a = c('male','female','unknown'))
> de
  x       a
1 1    male
2 2    male
3 1  female
4 2  female
5 1 unknown
6 2 unknown

creates a data frame with every combination of values for 'x' and for 'a'. Note that the 'a' variable is actually created as a factor:

> class(de$a)
[1] "factor"

> de$a
[1] male    male    female  female  unknown unknown
Levels: male female unknown

> unclass(de$a)
[1] 1 1 2 2 3 3
attr(,"levels")
[1] "male"    "female"  "unknown"

with levels in the same order as in the 'a' argument to 'expand.grid'.

In other words, the command:

 dother <- expand.grid(x = 1:2, a = c('female','male','unknown'))

does not merely create a data frame in which the rows are in a different order, but it creates the 'a' factor in a way that is entirely incompatible for purposes of fitting models in R with the factor created in the data frame 'de' above. Note:

> dother
  x       a
1 1  female
2 2  female
3 1    male
4 2    male
5 1 unknown
6 2 unknown
> dother$a
[1] female  female  male    male    unknown unknown
Levels: female male unknown
> unclass(dother$a)
[1] 1 1 2 2 3 3
attr(,"levels")
[1] "female"  "male"    "unknown"

Using 'dother' where 'de' should have been used may produce wrong results because fitting functions use the numerical codes or the factor, not their labels. In 'dother$a', 'female' has code 1 while in 'de$a' it has code 2.

For expand.grid to produce a correct factor for prediction, it must be given all factor levels in exactly the same order as in the original data frame. The only safe way to do this is to explicitly use the levels from the parent data frame.

lm is 'safe' with bad factors

The following example uses the data frame 'Prestige' in the 'car' package:

> library(car)
> summary(Prestige)
> fit <- lm( prestige ~ income*type, Prestige )
> 
> summary(fit)

Call:
lm(formula = prestige ~ income * type, data = Prestige) 
Residuals:
     Min       1Q   Median       3Q      Max 
-12.2669  -5.2956   0.3125   4.3392  25.0200  

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     13.9045168  3.1671787   4.390 3.02e-05 ***
income           0.0040235  0.0005530   7.276 1.12e-10 ***
typeprof        45.0190221  4.2907398  10.492  < 2e-16 ***
typewc          18.9807386  5.3421020   3.553 0.000603 ***
income:typeprof -0.0031783  0.0006047  -5.256 9.48e-07 ***
income:typewc   -0.0021712  0.0009700  -2.238 0.027603 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 7.268 on 92 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared: 0.8286,	Adjusted R-squared: 0.8193 
F-statistic: 88.94 on 5 and 92 DF,  p-value: < 2.2e-16 

> pred <- expand.grid(income = c(10000,20000), type = levels(Prestige$type))
> pred$prestige.pred <- predict( fit, newdata = pred)
> pred
  income type prestige.pred
1  10000   bc      54.13936
2  20000   bc      94.37421
3  10000 prof      67.37554
4  20000 prof      75.82754
5  10000   wc      51.40794
6  20000   wc      69.93062

Suppose we create 'pred' manually and put the levels in the 'wrong' order:

> pred2 <- expand.grid( income  = c(10000,20000), type = c('bc','wc','prof'))
> pred2$prestige.pred <- predict(fit, newdata = pred2)
> pred2
  income type prestige.pred
1  10000   bc      54.13936
2  20000   bc      94.37421
3  10000   wc      51.40794
4  20000   wc      69.93062
5  10000 prof      67.37554
6  20000 prof      75.82754

Magically, every thing turns out correct. However, this is for a reason that is true of prediction with 'lm' objects but not necessarily for other types of objects, for example 'nlme'. The reason things work out correctly with 'predict.lm' is that a great deal of effort has gone into making 'predict.lm' safe.

Behaviour of 'predict' in 'nlme'

'Safe prediction' has not been built into 'nlme' as it has into 'lm'. The consequence is that generating the 'pred' data frame so factors have identical levels to those of the original data frame is critical. The following is an example using the 'Milk' data frame in 'nlme':

> library(nlme)
> data()
> summary(Milk)
    protein           Time             Cow                  Diet    
 Min.   :2.450   Min.   : 1.000   B02    :  19   barley       :425  
 1st Qu.:3.200   1st Qu.: 5.000   B17    :  19   barley+lupins:459  
 Median :3.410   Median : 9.000   B24    :  19   lupins       :453  
 Mean   :3.422   Mean   : 9.185   B09    :  19                      
 3rd Qu.:3.630   3rd Qu.:13.000   B11    :  19                      
 Max.   :4.590   Max.   :19.000   B05    :  19                      
                                  (Other):1223                      
> fit <- lme( protein ~ Diet*Time, Milk, random = ~ 1 | Cow )
> summary(fit)
Linear mixed-effects model fit by REML
 Data: Milk 
       AIC     BIC    logLik
  534.1415 575.691 -259.0707   

Random effects:
 Formula: ~1 | Cow
        (Intercept) Residual
StdDev:   0.1661141 0.273052  

Fixed effects: protein ~ Diet * Time 
                           Value  Std.Error   DF  t-value p-value
(Intercept)             3.550293 0.04295985 1255 82.64212  0.0000
Dietbarley+lupins      -0.081598 0.05959248   76 -1.36927  0.1749
Dietlupins             -0.123677 0.05965636   76 -2.07316  0.0415 
Time                   -0.002633 0.00260675 1255 -1.01001  0.3127
Dietbarley+lupins:Time -0.001605 0.00362262 1255 -0.44310  0.6578
Dietlupins:Time        -0.008913 0.00363821 1255 -2.44986  0.0144
Number of Observations: 1337
Number of Groups: 79 
> pred <- expand.grid( Time = c(1,10), Diet =levels(Milk$Diet))
> pred$protein.pred <- predict( fit, newdata = pred, level = 0)
> pred
  Time          Diet protein.pred
1    1        barley     3.547660
2   10        barley     3.523964
3    1 barley+lupins     3.464456
4   10 barley+lupins     3.426314
5    1        lupins     3.415070
6   10        lupins     3.311156
> 
> pred2 <- expand.grid( Time = c(1,10), Diet =c('lupins','barley','barley+lupins'))
> pred2$protein.pred <- predict( fit, newdata = pred, level = 0)
> pred2
  Time          Diet protein.pred
1    1        lupins     3.547660
2   10        lupins     3.523964
3    1        barley     3.464456
4   10        barley     3.426314
5    1 barley+lupins     3.415070
6   10 barley+lupins     3.311156

Note that predicted values do not match the correct labels for 'diet'

> 
> pred3 <- expand.grid( Time = c(1,10), Diet = 'lupins')
> pred3$protein.pred <- predict( fit, newdata = pred, level = 0)
Error in `$<-.data.frame`(`*tmp*`, "protein.pred", value = c(3.54765985573679,  : 
  replacement has 6 rows, data has 2
> pred3
  Time   Diet
1    1 lupins
2   10 lupins
Personal tools