# MATH 6643 Summer 2012 Applications of Mixed Models/Snijders and Bosker: Discussion Questions

## Chapter 5: The Hierarchical Linear Model

### Matthew

At the bottom of page 83, Snijders and Bosker outline the process for probing interactions between two level one variables, and how there can be four possibilities for how to model it. If a researcher was to include all four, discuss how each would be interpreted. What might a good selection strategy be if our model had substantially more than two variables?

### Qiong

If we do not have any information about the data set, how to choose a level - two variable to predict the group dependent regression coefficients? After we choose the level - two variable z, how to explain the cross - level interaction term.

### Carrie

A client arrives with a random slope and intercept model using IQ as a predictor. IQ was measured on the traditional scale with a mean of 100 and standard deviation of 15. What should the client keep in mind about the interpretation of the variance of the intercept and covariance of the slope-intercept?

This raises the interesting question of how the variance of the random intercept and the covariance of the random intercept with the random slope are changed under a recentering of IQ. Let
$Var\left[ \begin{matrix} {{u}_{0j}} \\ {{u}_{1j}} \\ \end{matrix} \right]=\left[ \begin{matrix} \tau _{0}^{2} & {{\tau }_{01}} \\ {{\tau }_{10}} & \tau _{1}^{2} \\ \end{matrix} \right]$ for raw IQ.
If we recenter IQ with: $\tilde{\text{IQ}}=\text{IQ}-c$ then:
$\left[ \begin{matrix} {{{\tilde{u}}}_{0j}} \\ {{{\tilde{u}}}_{1j}} \\ \end{matrix} \right]=\left[ \begin{matrix} 1 & c \\ 0 & 1 \\ \end{matrix} \right]\left[ \begin{matrix} {{u}_{0j}} \\ {{u}_{1j}} \\ \end{matrix} \right]$
and
$Var\left[ \begin{matrix} {{{\tilde{u}}}_{0j}} \\ {{{\tilde{u}}}_{1j}} \\ \end{matrix} \right]=\left[ \begin{matrix} 1 & c \\ 0 & 1 \\ \end{matrix} \right]\left[ \begin{matrix} \tau _{0}^{2} & {{\tau }_{01}} \\ {{\tau }_{10}} & \tau _{1}^{2} \\ \end{matrix} \right]\left[ \begin{matrix} 1 & 0 \\ c & 1 \\ \end{matrix} \right]=\left[ \begin{matrix} \tau _{0}^{2}+2c{{\tau }_{01}}+{{c}^{2}}\tau _{1}^{2} & {{\tau }_{01}}+c\tau _{1}^{2} \\ {{\tau }_{10}}+c\tau _{1}^{2} & \tau _{1}^{2} \\ \end{matrix} \right]$
If $c=-{{\tau }_{01}}/\tau _{1}^{2}$, then the variance of the intercept is minimized:
$Var\left[ \begin{matrix} {{{\tilde{u}}}_{0j}} \\ {{{\tilde{u}}}_{1j}} \\ \end{matrix} \right]=\left[ \begin{matrix} \tau _{0}^{2}-\tau _{01}^{2}/\tau _{1}^{2} & 0 \\ 0 & \tau _{1}^{2} \\ \end{matrix} \right]=\left[ \begin{matrix} \tau _{0}^{2}\left( 1-\rho _{01}^{2} \right) & 0 \\ 0 & \tau _{1}^{2} \\ \end{matrix} \right]$

### Gilbert

In chapter 5, they talk about hierarchical linear model where fixed effects and random effects are taken into consideration. Discuss a clear simple example in class which shows both effects and give interpretations of each of the coefficients and their use in real life.

### Daniela

In chapter 5, they talked about mostly about the two-level nesting structure. Can we have a bigger example with at least 4 levels that includes the random intercept and slope and how to apply this into R coding?

In 'lme', multilevel nesting is handled with. e.g.
        fit <- lme( Y ~ X * W, dd, random = ~ 1 | idtop/idmiddle/idsmall)

Contextual variables present an ambiguity. Assuming that the id variables 'idsmall' and 'idmiddle' are coded uniquely overall, then the the higher level, say 'idmiddle', contextual variables could be coded as either:
        cvar(X,idmiddle)

or
        capply( dd , ~ id, with, mean( c(tapply( X, idsmall, mean))))

Here is a table prepared for SPIDA showing how to handle multilevel nesting and crossed structures in a selection of R functions:
Function Notes
lme

in package nlme

Linear mixed effects: normal response

G side and R side modelling
Model syntax:

 Y ~ X * W, random = ~ 1 + X | id


For nested effect:

 Y ~ X * W, random = ~ 1 + X | higher/lower


or, to have different models at different levels:

 Y ~ X * W, random = list(higher = ~ 1, lower = ~ 1 + X )

lmer

in package lme4

Linear mixed models for gaussian response with Laplace approximation

G side modeling only, R = σ2I
Model syntax:

 Y ~ X * W +(1+X|id)


For nested effect:

 Y ~ X * W +(1+X|higher) + (1+X|higher:lower)


For crossed effect:

 Y ~ X * W +(1+X|id1) + (1+X|id2)

glmer

in package lme4

• family: binomial, Gamma, inverse.gaussian, poisson, gaussian

G side only, no R side
Model syntax:

 Y ~ X * W +(1+X|id)


For nested effect:

 Y ~ X * W +(1+X|higher) + (1+X|higher:lower)


For crossed effect:

 Y ~ X * W +(1+X|id1) + (1+X|id2)

glmmPQL

in packages MASS/nlme

Generalized linear mixed models with Penalized Quasi Likelihood
• family: binomial, Gamma, inverse.gaussian, poisson, gaussian

G side and R side as in lme
Model syntax:

 Y ~ X * W, random = ~ 1 + X | id


For nested effect:

 Y ~ X * W, random = ~ 1 + X | higher/lower

MCMCglmm

in package MCMCglmm

Generalized linear mixed models with MCMC
• family: poisson, categorical, multinomial, ordinal, exponential, geometric, cengaussian, cenpoisson,

cenexponential, zipoisson, zapoisson, ztpoisson, hupoisson, zibinomial (cen=censored, zi=zero-inflated, za=zero-altered, hu=hurdle
G side and R side, R side different from 'lme': no autocorrelation but can be used for multivariate response
Note: 'poisson' potentially overdispersed by default (good), 'binomial' variance for binary variables is unidentified.
Model syntax:

 Y ~ X * W, random = ~ us(1 + X):id  [Note: id should be a factor, us=unstructured]


For nested effect:

 Y ~ X * W, random =  ~us(1 + X):higher + us(1 + X):higher:lower


For crossed effect:

 Y ~ X * W, random =  ~us(1 + X):id1+ us(1 + X):id2


### Phil

Exluding fixed effects that are non-significant is common practice in regression analyses, and Snijders and Bosker follow this practice when simplifying the model in Table 5.3 to the model found in Table 5.4. While this practice is used to help make the model more parsimonious it can ignore the joint effect that these variables have on the model as a whole. Discuss alternative criteria that one should explore when determining whether a predictor should be excluded from the model.

### Ryan

When using random slopes it is generally the case that the level 1 model contains a fixed effect for what will also be the level 2 random effect. The random effect is then an estimate of the group/cluster/individual departure from the fixed effect. However, non-significant level 1 variables are not determinable as different from zero. Are there cases where a non-significant fixed effect can be excluded from the model while retaining the random effect at level 2. What would be the consequence of this and what might it reveal about the level 1 variable? Would this help control for the error of excluding a non-significant but confounding variable?

This is a very interesting question. It would be interesting to create a simulated data set illustrating the issue so we could consider the consequences of having random effects for a confounding factors whose within cluster effect changes sign from cluster to cluster. Can we think of a confounding factor that would do that?

## Chapter 6: Testing and Model Specification

### What is random?

At the beginning of the chapter, S&J present two models (Table 6.1). They note that "The variable with the random slope is in both models the grand-mean centered variable IQ". In R: would the random side look like: random = ~ (IQ - IQbar)|school, even though (IQ - IQbar) isn't in the fixed part of the second model? How is this different in interpretation from: random = ~ (group centered IQ - IQbar)|school)?

• NOTE: This question is not necessarily about how to specify a model using nlme, but rather about the terms included in the random part of the model. As a test, I ran two models:
IQ_dev <- mlbook_red$IQ_verb - mlbook_red$sch_iqv

mlb612a <- nlme:::lme(langPOST ~ IQ_dev + sch_iqv + ses, random =~ IQ_verb|schoolnr, data = mlbook_red, method="ML")
summary(mlb612a)

mlb612b <- nlme:::lme(langPOST ~ IQ_dev + sch_iqv + ses, random =~ IQ_dev|schoolnr, data = mlbook_red, method="ML")
summary(mlb612b)


Note that the basic IQ_verb variable has been grand mean centered. According to the chapter, it sounds like using IQ_verb in the random portion is OK since it is a linear combination of IQ_dev and sch_iqv (the school means of IQ_verb). However, if I compare it against a second model using IQ_dev in the random part, pretty much all of the coefficients change. Is this expected?

--Msigal 12:07, 13 June 2012 (EDT)

### Qiong

The author introduce the t - test to test fixed parameters. We can use summary(model) to get the p - value directly in R. In practice, we use wald test to test fixed parameter. On page 95, they mentioned that "for 30 or more groups, the Wald test of fixed effects using REML standard errors have reliable type I error rates." Is it the only reason we use the wald test in practice?

### Carrie

On page 105 in their discussion of modeling within-group variability the authors warn to "keep in mind that including a random slope normally implies inclusion of the fixed effect!". What is an example of a model where one might include random slope without a fixed effect??

### Gilbert

On page 104 , the author discuss different approaches for model selection including working upward from level one, joint consideration of both level one and two. Are there other methods to be used If both methods are not providing a satisfactory model?

### Daniela

On page 97 and 99, the authors showed us how the tests for random intercept and random slope independently. What test would you use if you wanted to test for both random slope and intercept in the same model? and what would you test it against? ... the linear model or a model with just a random slope or a random intercept?

### Phil

Multi-parameter tests are possible for fixed effects, but can they also be applied to predicted random effects? If so what would be the analog to $\hat{\gamma}^' \hat{\Sigma}^{-1}_\gamma \hat{\gamma}$, used to find the Wald statistic, and how do we find an appropriate df denominator term for an F test?

Question: Just to be sure, I presume you mean a BLUP? Answer: Yes I do :)

### Ryan

In Chapter 6 Snijders and Bosker suggest that using deviance tests to evaluate fixed effects in a multilevel model is inappropriate if the estimation method is REML. What is the characteristic of REML vs ML that makes this type of model evaluation incorrect?

Comment: Likelihood has the form L(data|theta) and is used for inference on theta, for example, by comparing the altitude of the log-likelihood at theta.hat (the summit) compared with the altitude at a null hypothetical value theta.0. This is the basis of deviance tests: Deviance = 2*(logLik(data|theta.hat) - logLik(data|theta.0)).
With ML the log-likelihood is logLik( y | beta, G, R) and we can use the likelihood for inference on all parameters. With REML, the data is not the original y, but the residual of y on X, say, e. And the likelihood is only a function of G and R: logLik( e | G, R). 'beta' does not appear in the likelihood, thus the likelihood cannot be used to answer any questions about 'beta' since is does not appear in the likelihood.

## Chapter 7: How Much Does the Model Explain?

### Explained Variance in Random Slope Models

Looking at the proportion of variance explained by a model in a traditional ANOVA/multiple regression framework is something clients are often extremely interested in. In Chapter 7, Snijders and Bosker discuss how we might approach the issue in MLM. Near the end of the chapter, the authors insinuate that getting an estimate of the amount of explained variance in a random effects (intercept and slope) model is a somewhat tedious endeavour.

The claim is that random slopes don't change prediction very much so if we re-estimate the model using only random intercepts (no random slopes), this will "normally yield [predicted] values that are very close to values for the random slope models" (p. 114). This statement doesn't quite ring true for me, as in our examples the differences in slope between schools has been fairly striking/substantial.

Is the authors' statement justifiable? Is obtaining an R2 as important/interesting in MLM as it is in other models?

### Explained variance in three-level models

In example 7.1, we know that how to calculate the explained variance of a level one variable when this variable has a fixed effect only. I want to know how to calculate the explained variance of a level one variable when this variable has a fixed and random effect in the model?

### Interpreting R2 as an Effect Size

A client fits a multilevel model and comes up with several significant predictors. The client is pleased with themselves, but remembers learning that significance alone isn't good enough these days, and needs help producing a measure of effect size. You compute the Level 1 R^2 and come up with a very small value, say 0.01. Is the model then worthless, even if the magnitude of the predicted change in the outcome is substantively meaningful?

### Explained variance

In the example provided on page 110, it show that the residual variance at level two increases as within-group deviation is added as an explanatory variable to the model in balanced as well as in the unbalanced case. Is this always the case or it is only for this particular example?

### Estimates of R2

On page 113, "it is observed that an estimated value for R2 becomes smaller by the addition of a predictor variable, or larger by the deletion of a predictor variable, there are two possibilities: either this is a chance fluctuation, or the larger model is misspecified." The authors then say that whether the first or second possibility is more likely depends on the size of the change in R2. Can you give an example of when this occurs based on the size of change in R2?

### Predicted R2

After predicting values for random intercepts and slopes using Bayesian methods it is possible to form composite values, $\hat{Y}_{ij}$, to predict the observed dependent values, Yij. Obviously these estimates will be sub-optimal as they will suffer from 'shrinkage' effects, but they may be useful for computing a '$R^2_{Predicted}$'. Discuss situations where knowledge of the predicted slopes and intercepts could be important, and whether an $R^2_{Predicted}$ could be a useful description.

### The Size and Direction of R2 Change As a Diagnostic Criteria

The suggestion has been made that changes in R2 where the addition or deletion of a variable creates an unexpected and opposing directional change, can serve as a diagnostic toward determining where the flaw in the model resides. However, the authors do not actually indicate which scenarios the size and increase/decrease information obtained from the R2 estimate determines the source of the flaw. 'Wrong' directions provide evidence of model misspecification, but what then of the magnitude component mentioned just prior? (p. 113)

## Chapter 8: Heteroscedasticity

You can sign your contributions with --~~~~ --Georges 07:50, 14 June 2012 (EDT)

### "Correlates of diversity"

Provide an example illustrating how level-two variables are considered being associated with level-one heteroscedasticity. --Gilbert8 11:26, 16 June 2012 (EDT)

### "Modeling Heteroscedasticity"

When Snijders and Bosker say they are "modeling heteroscedasticity", is this simply incorporating more random slopes into the model? For instance, on page 127, they added a fixed effect for SA-SES (the school average of SES) and a random slope for it. What kind of plots would let us see if these inclusions are necessary? --Msigal 11:26, 18 June 2012 (EDT)

### Linear or quadratic variance functions

The level-one residual variance can be expressed by a linear or quadratic function of some variables. How to decide the function form? Can we say that if the variables have a random effect, then we use a quadratic form. Otherwise, we use a linear form? Is it the same thing for the intercept residual variance? --~~~~ --Qiong Li 12:25, 18 June 2012 (EDT)

### On a practical note - How is this done??

It appears that in order to fit a model with a linear/quadratic function for the variance the authors had to use MLwiN. Are there other ways to accomplish this? Could we talk a little about what their demo code is accomplishing? | S&B ExCode --Smithce 14:15, 19 June 2012 (EDT)

### Variable centering

Since fixed effects variables can be included in the R matrix to model systematic heteroscedasticity discuss the effects of centering variables in this context. Does centering affect the estimation results or numerical stability? --Rphilip2004 13:42, 19 June 2012 (EDT)

What happens to the variance function when there are more than two levels? Do we still only have to choose from linear and quadratic forms or dose it become more complicated? --Dusvat 18:43, 18 June 2012 (EDT)

### Generic regression question: Treating a factor as continuous for the interaction

On page 126 Models 3 (described on page 124) the authors treat SES as a factor for main effects, but then to keep the number of interactions around they treat it as numeric in the interaction with other variables. This seems like it could come in useful, are there any caveats we should be aware of in using this technique? --Smithce 14:15, 19 June 2012 (EDT)

### Generalized Linear Models and Heteroscedasticiy

Given a linear model that has level - 1 heteroscedasticity related to multiple level - 1 predictors, does this not mean that the heteroscedasticity can be thought of as related to the overall mean response. The residuals of a heteroscedastic model would then become functions of the mean response. Generalized linear models often model the variance as a function of the mean response (ex., Poisson, Gamma, Negative Binomial, etc.). When might it be appropriate to abandon a direct linear relationship in favour of a generalized linear model (which at times retains the additive linear properties desired (in the linear predictor)) to deal with heteroscedastic issues? Is this even possible? --Rbarnhar 14:24, 19 June 2012 (EDT)

## Chapter 9: Missing Data

### "Imputation"

1.The chapter discuss imputation as a way of filling out missing data to form a complete data set. Are there any other method which can be used to achieve the same goal? Provide few examples.

--Gilbert8 11:38, 18 June 2012 (EDT)


### Patterns of Missingness

Snijders and Bosker go to some lengths to explain the difference between MCAR, MAR, and MNAR. However, I felt they somewhat glossed over a definition of monotone missingness. What is monotone missingness? How would one check for it, especially in terms of a multilevel model? --Msigal 08:47, 20 June 2012 (EDT)

If we use a missingness indicator and predict this using a logistic regression model, does this mean that significant predictors should be kept in the imputation model and non-significant predictors can be omitted from the imputation model? --Smithce 09:58, 21 June 2012 (EDT)

### Missingness Assumption

Rubin defined three types of missingness in 1976. When we use methods to handling incomplete data, what information can help us to make a reasonable assumption? What is the key point whether missingness is MCAR or MAR? --~~~~ --Qiong Li 12:31, 20 June 2012 (EDT)

### Full Maximum Likelihood vs. Imputation

Could you give us an example using both maximum likelihood and imputation methods in R and then compare? How are the methods similar/ different? Is one method computationally better than the other?--Dusvat 17:27, 20 June 2012 (EDT)

### Dancing on the Bay(esian)

Imputation seems to be a very Bayesian practice, and the authors mention the intimate connection to Gibbs sampling when imputing data in the univariate case. I wonder, however, that if we are willing to impute data in this Bayesian manner why we don't just jump ship and move to a more complete Bayesian methodology? What are the benefits/downsides of initially dancing with the idea of being Bayesian to get our complete data, then being frequentists to fit our models? --Rphilip2004 09:18, 21 June 2012 (EDT)

### Don't Throw the Kitchen Sink - ICE

While Snijders and Bosker promote the idea of a complex model for imputation using, as many feasible variables as possible, is this always true? Should we not consider parsimony as we would in any other form of regression? Should it also possibly reflect the amount of missing data we are trying to impute? --Rbarnhar 09:59, 21 June 2012 (EDT)

## Chapter 10: Assumptions of the Hierarchical Linear Model

### Influence of level-two units

Provide a detailed explanation of how deletion diagnostics is performed and provide a practical example to illustrate it.--Gilbert8 18:22, 24 June 2012 (EDT)

### Incorporating Descriptive Statistics

One aside that Snijders and Bosker make in this chapter is about the inclusion of the standard deviation for each group of a relevant level one variable as a fixed effect in the model. This was mentioned within the section on adding contextual variables (p. 155). This strikes me as an interesting prospect. Does modeling the standard deviation have any interpretative benefits over simply using group size? Are there other descriptive statistics pertaining to groups that would be meaningful to add to a model? What would their interpretation be? --Msigal 10:12, 25 June 2012 (EDT)

### Orders of the model checks

In Chapter 10, the auther introduced a number of things we need to do when we build a mixed model, such as "include contextual effects", "check random effects", "specification of the fixed part", "specification of the random part" and "check the distributional assumption". When I deal with real data, I always confused that which things I should do first, which things I should do next. No rules or there is a better order to do these things? --~~~~ --Qiong Li 2:33, 25 June 2012 (EDT)

### Assumption violations

In Ch. 10, the authors talk about having the random slope and random intercept uncorrelated with all explanatory variables. If this assumption is incorrect you can just add relevant explanatory variables to the model. What happens if you have a real world example where there are not quantifiable relevant explanatory variables to be added to the model? How would you go about fixing the incorrect assumption? What happens if more than one assumption is violated and you cannot just include other 'descriptive' variables to the model?--Dusvat 17:45, 25 June 2012 (EDT)

### Checking for Random Slopes

In this chapter (page 155-156) S&B advocate checking for a random slope for each level-one variable in the fixed part. Because the process can be time consuming, they further suggest using one-step methods to obtain provisional estimates and then checking the t-ratio. We have learned that t-tests of this kind are problematic for variance parameters and not to be trusted. Should we then use LRTs to test all possible random slopes, possibly with simulation?

### Cluster Size and Model Assumptions

In the practical nuts and bolts of application, one at times encounters situations where the number of clusters in a requested multilevel model does not support the testable or readily examinable assumption that the random intercepts are normally distributed. How important is it that the assumption of normally distributed intercepts holds?--Rbarnhar 01:22, 26 June 2012 (EDT)

### Model Residuals

The authors emphasize the importance of using OLS estimation for determining unbiased diagnostics. However, it may be useful to use model implied residuals such as $r = y - X \hat{\beta}$ and $r.c = y - X \hat{\beta} - Z \hat{\gamma}$. Describe how these model implied residuals can be used to evaluate influential observations at different design levels. --Rphilip2004 08:33, 26 June 2012 (EDT)

## Chapter 11: Designing Multilevel Studies

### Unequal cluster sample sizes

Usually, we choose the same number n as the sample size of the micro-units and the same number N as the sample size of the macro-units. I want to know whether we can improve the power for testing or small standard errors through choosing the different number as the sample size in different groups? --~~~~ --Qiong Li 12:11, 27 June 2012 (EDT)

Comment: The main situation where it seems obvious to me that one would consider unequal cluster sizes by design would be to estimate a correlation parameter in the R matrix. --Georges 18:31, 28 June 2012 (EDT)

### Allocating treatment to groups or individuals

The author mentions that multisite trial are difficult to implement as the reseercher has to make sure that they will be no contamination between the two treatment conditions within the same site or cluster randomized control may lead to selection bias. The author propose the pseudo-cluster randomized trial.Explain how this method is performed. Is this used often in practice? If this technique fails, are there other methods to resolve those situations? --Gilbert8 15:04, 27 June 2012 (EDT)

### Treatment Allocation, Continued

Building upon Gilbert's question, discuss the differences between cluster randomized trials, multisite trials, and pseudo-cluster randomized trials. Do any of these strategies match the Schizophrenia dataset that we have been working with during Lab 2? --Msigal 18:51, 27 June 2012 (EDT)

Continuing Matt's question, about the differences between cluster randomized trials, multisite trials, and pseudo-cluster randomized trials, can you give us an example of when to pick one over the other along with the cost of each trial is broken down by the cost function. Which trial is more costly? --Dusvat 23:17, 27 June 2012 (EDT)

### Power Expecting Missingness/Drop Out

I think an interesting addition to the power analyses presented would be if we could work in estimates of drop-out/non-response within cluster. For example, if I estimated that there was only a 80% chance that I would actually get useful/complete data from each level-1 unit (lower if using the URPP ;) How might this be built into these analyses? Are there authors who have done work in this with mixed models? Is it really worth bothering with, given that power analysis is hard enough (e.g. coming up 'guesses' for estimates) as it is? --Smithce 09:40, 28 June 2012 (EDT) ...now that I think about it, if you approximate using constant loss across all clusters rather than trying to fool around with unbalance this is pretty easy. So, never mind.

Comment: Perhaps an easy answer but nevertheless a very good point.

### An Unknown Value of the Intraclass Correlation Coefficient

The Authors acknowledge that the ICC is an unknown quantity, but suggest that for the Social Sciences the value tends to lie within 0.0 to 0.4. These two values have very different properties and this is made clear in the plot on the page following (p.189). The question is, not as easily answered as plotting them all as we can see from the graph the values follow different patterns of divergence. An assumed value can lead to very different optimal estimates, especially if one is wrong at the extremes. Are there any better ways to estimate the ICC a priori in order to avoid issues related to optimizing the sample size. --Rbarnhar 09:55, 28 June 2012 (EDT)

### Normal Distribution

Relating to a similar question in the past, how important is it that the various levels are normally distributed when computing power estimates? --Rphilip2004 10:08, 28 June 2012 (EDT)

Comment: Have a look at this attempt to simulate normal errors and errors with a t distribution with 5 degrees of freedom. One conceptual problem is the concept of effect size. The t distribution with 5 dfs has a standard deviation of about 1.22. The problem is that, with its high kurtosis, your estimate of the standard deviation will tend to be lower, not in the sense of 'expectation' but in the sense of the 'typical' standard deviation. The question, then, is whether to define 'effect size' in terms of the standard deviation for the t or in the original metric. This script uses the standard deviation of the t. A quick look suggests that there isn't much change, just a slight drop in power at higher effect sizes. --Georges 18:24, 28 June 2012 (EDT)
Reply: Thanks! Also here is the same code utilizing multiple cores, in this case 4.

## Chapter 12: Other Methods and Models

### An Application of the BIC

As mentioned in this chapter, the BIC is a good indicator of model fit, which sets a penalty based upon the number of parameters in the model. It also seems easy to calculate based upon the typical summary objects from nlme and lme4. We learned in the last chapter about dealing with influential observations.

What I would like to know is: might it be appropriate to compare BICs from two models where the only difference between them is the removal of a set of observations deemed influential or problematic? Since there is no difference in the number of parameters, other measures of model fit seem inappropriate. In a practical application (working with a client), what would be the best way of approaching this situation? --Msigal 14:59, 30 June 2012 (EDT)

### Mixtures to Normal

Latent class mixture models are a non-parametric way to avoid, or lessen, the assumption of normality for the random coefficients, and can approximate any distribution as the number of classes is increased. How effectively can arbitrary distributions be modeled, and should this modeling technique be used verify that the normality of the random coefficients assumptions hold? --Rphilip2004 19:14, 1 July 2012 (EDT)

### Sandwich estimators for standard errors

It is mentioned that the researcher works with a misspecified model and a few reasons are given for why they do so. The Sandwich method is used to estimate standard error. Is the method always applicable for misspecified model? If the model is not misspecified, is the Sandwich estimator still used?--Gilbert8 14:02, 2 July 2012 (EDT)

In the sense that 'all models are wrong, but some are useful', are we not always using 'misspecified models'? When S&B talk about using a misspecified model intentionally, are they referring to cases in which either through statistical tests or diagnostics, we have evidence that the model fails in some regard? --Smithce 22:15, 2 July 2012 (EDT)

### BIC vs. AIC

The authors talk about how BIC is a good indicator of model fit. However we have seen in the R code the AIC is also very close to the BIC method and similar in numbers. Though the authors don't talk about AIC,I would like to know if there is a difference when comparing models with random parts, as opposed to the simple models. Which method is better AIC or BIC? Why?--Dusvat 21:37, 2 July 2012 (EDT)

### Alternative method GEE

The authors mentioned that GEE is an alternative method to handle the multilevel data. Some people prefer GEEs because they like to have a procedure that estimating parameters in the absence of assumptions for how the coefficients vary. However, from the GLM course, I know that GEEs can used to estimate the parameters of generalized linear model with a possible unknown correlation between outcomes. Thus, can we use GEEs to handle a wild class of dataset including multilevel data? --~~~~ --Qiong Li 21:52, 2 July 2012 (EDT)

### MCMC Convergence

The authors suggest that MCMC converges without the use of convergence diagnostics. The only related issue for consideration is time given possible dependence in the data. This would suggest that given enough time all MCMC models will converge. Is this true, and if so, does it in anyway indicate anything arbitrary/trivial or critically dependent upon prior (no pun intended) assumptions. --Rbarnhar 9:55, 3 July 2012 (EDT)

## Chapter 13: Imperfect Hierarchies

### The ICC in Multiple Membership Models

In the section on a two-level model with a crossed random factor, Snijders and Bosker discuss the various formulae for the ICC due to level (p. 209). However, in the section that builds upon this framework, where we have a multiple membership multiple classification model, there is no remark about the ICC at all. Does the calculation of the ICC change when we incorporate a multiple membership aspect to the hierarchy? --Msigal 18:24, 3 July 2012 (EDT)

### Multiple membership multiple classification models

In section 13.3, multiple membership, the author discuss about students who attends multiple school and living also in different neighborhood. A student is assigned a membership in each school attended and it is a membership weight. In the section 13.4, the author talks about when not only school membership is taken into consideration but also neighborhood membership. How will the weights be distributed in this case? Is it for school and neighborhood membership respectively or there will be a combined weight for both if applicable? --Gilbert8 20:39, 3 July 2012 (EDT)

What happens if a student is misclassified in a multiple membership multiple classification model? For example, suppose the student moves to a new neighborhood and a new school, but the new neighborhood and/or is reported incorrectly. What are the implications to the model? What happens to the weights?--Dusvat 20:35, 4 July 2012 (EDT)

### Example 13.1 Sustained primary school effects

In example 13.1 on page 207, we compare the results of a model without (model 1) and a model with the inclusion of primary school effects (model 2). We found that the average examination grade remains the same, but there are some changes in the variance components. The primary school has a variance 0.006. It is a very small value. However, the variance is significnatly different from 0. In this situation, can we say model 2 is a better model than model 1? --~~~~ --Qiong Li 11:29, 4 July 2012 (EDT)

### Crossed Random Effects and the Problem of Snowball Sampling

Snowball Sampling was mentioned last class and I suggested that this could be modeled as a crossed random effects model. Here now is my explanation and some added material to go with the chapter.

The example given was for Aids patients. Let us retain this idea, but more elaborately expand certain assumptions and hypotheticals to make this example understandable in our context.

Let us assume we are interested in following the health of certain Aids sufferers by tracking their CD4 cell counts. We are in need of a sample, but are not easily capable of obtaining one.

As a group of 3 medical researchers we have access to a subset of existing Aids patients. For ease let us say we have three patients with ascending health difficulties good, fair and poor health (the assumptions here is that people of similar health status will likely have more connections than those of other stages of disease). We provide these patients with a set of tickets that are traceable back to the original owner and inform our three patients that they are to go out and recruit others for the study from wherever they can.

In the meantime, we (the 3 doctors) return to our 3 treatment sites and prepare to undertake our study.

Once our recruiters have returned with enough patients we randomly assign each participant to a treatment site. That means we have 3 samples that are directly nested within treatment site, but within the sample we also have people who are nested within each respective snowball. However, the distribution of snowball members is not one-to-one within the treatment sites. Given the randomization, the snowball should be unrelated to the treatment site and this means that we have one direct simple nesting structure and a second nesting structure that is randomly distributed along with the first. This process of randomization is critical. If this randomization is unrelated to the snowball then no direct connection between the 2 level 2 structures needs to be accounted for.

Snijder's and Bosker have an image of exactly what I have just described on page 206. The only change here is to substitute school for treatment center and snowball for neighbourhood.

There is a very clever way to deal with this problem very effectively. A trick proposed by Goldstein (1987) is to do the following:

1. Consider the entire dataset as a pseudo level 3 unit where both the snowball and the treatment centers are nested. We will need a level 3 identifier for this purpose.
2. Treat either the treatment center or snowballs as the level 2 units and specify a random intercept.
3. For the factor not chosen at step 2, specify a level 3 random intercept for each level of the factor. This requires estimating a random coefficient for a dummy coded variable representing each level.

The result is that we wind up with two residual intraclass correlation coefficients.

For the Treatment Center We have:

$\rho _{treatment} = \tau _{t}^{2} / (\tau _{t}^2 + \tau _{s}^2 + \sigma ^2)$

For the Snowball We have:

$\rho _{snowball} = \tau _{s}^{2} / (\tau _{s}^2 + \tau _{t}^2 + \sigma ^2)$

The solution to evaluating the similarity of the Snowballs is then to look at the ICC for Snowballs and plot and evaluate the random effects to observe whether any are unusually large.

An example of how this is declared in STATA using the xtmixed procedure is the following.
xtmixed CD4 (predictors/covariates) || _all: R.Snowball || TreatmentSite: , (options)
The _all command declares the full dataset as level 3 identified, R.Snowball automatically creates the dummy codes for Snowball
An example of how this is declared in R using the lmer procedure is the following
(Do not quote me on this though as I am still shaky with R)
lmer(CD4~(predictors/covariates)+(1|TreatmentSite)+(1|Snowball))
Though I do not see a declaration for the level 3 identifier so this may not be exactly right.

Goldstein, H. (1987) Multilevel Covariance Components Models. Biometrika 74:430-431.

--Rbarnhar 9:53, 4 July 2012 (EDT)

### MCMC estimation

As we move through the book and discover new topics the authors are constantly noting that advanced models can be 'easily' estimated by MCMC methods, yet they provide no examples of this type of programming --- which I immediately interpret as being 'not that easy at all' (they provide code options for R, Stata, HLM, MLwiN, Mplus, and SAS, but nothing for MCMC estimation). How easy would it be to generate this type of code, and could we construct a moderately simple example that could be run in Bugs or WinBugs that couldn't be run in lme or lmer that could be used as an introduction to Bayesian estimation? --Rphilip2004 22:45, 4 July 2012 (EDT)

## Chapter 14: Survey Weights

### Model Accuracy and Design Weights

I found the idea presented on page 226-227 interesting. It is recommended that checks are made against the design weights to see if the model differs based upon them. Snijders and Bosker then recommend divvying up the level 2 units into two or three groups, and then split the level 1 units into two or three groups. The combination of the two divisions splits the data set into 4 to nine parts, each of which can be analyzed with the hierarchical linear model.

I have two concerns about this procedure: 1) How large can the discrepancies be between these parts be before we start becoming worried? and 2) At which point in the analysis would you want to do this? Does this procedure assume that we already have selected the "correct" or "true" model? --Msigal 08:40, 9 July 2012 (EDT)

### Exploring the informativeness of the sampling design

If the residuals are correlated with the design variables, then the sampling design is informative. In a informative sampling, the estimate of the parameters may be biased and inconsistent. The authors mentioned that (page 222)if we can be confident of working with a well-specified hierarchical linear model and the sample design is unrelated to the residuals, we can do the analysis and proceed as usual with estimating the hierarchical linear model. How to know a hierachical linear model is a well specificed model? --~~~~ --Qiong Li 12:20, 9 July 2012 (EDT)

On page 226, the author mentions that adding design variables to model of interest allows much clear and more satistifactory analysis than only knowing the weights when the design variables are available to the researcher. Does it mean that the analysis will be wrong if they are not available? --Gilbert8 18:54, 9 July 2012 (EDT)

Longitudinal data analysis Italic text

BLUE is best for resampling from the same school over and over again. The BLUP is best on average for resampling from the population of schools and students. How about the EBLUP? How the researcher chooses to use any of these? --Gilbert8 14:36, 9 July 2012 (EDT)

What is the difference between EBLUP and BLUP? Are they graphically the same? In the graphs on the slides, I didn't find any difference between BLUP and EBLUP. When do you choose one over the other? --Dusvat 22:15, 9 July 2012 (EDT)

### Weights, weights, weights .... What ones to use? .... What are the effects of my choice?

The use of survey weights for complex data has become entrenched in many social science branches. Much like the discussion provided by Snijder's and Bosker, there remains a belief that there are really two types of weights sampling or survey weights and precision weights. There are many other weights that one should consider with HLM and in some cases combinations of them may be more revealing. The use of them however, becomes extremely complicated, numerically intensive and with no real ability to know if they are being applied correctly the issue becomes murky. The problem that lies at the heart of the issue is that THERE IS NO WAY OF KNOWING WHICH MODEL IS MORE CORRECT - WEIGHTED OR UNWEIGHTED.

The choice of weights plays an extremely important role in multilevel models. In particular, the use of weights directly impacts the estimating equations depending upon the type of Likelihood inference being used. Two in particular are of interest as they dominate most software. Maximum Pseudo Maximum Likelihood (MPML) and Probability Weighted Iterative Generalized Least Squares (PWIGLS).

For my simple post here I am just highlighting a problem and not fully unpacking it.

For MPML we have the following:

${L(y)} = \Sigma _{i=1} ^{n(L)} \omega _{i}^{L} {L} _{i}^{L} ({y}_{i})$ - here L refers to the specific Level of the model

where the Likelihood of the observation is proportional to the weighted Likelihood by the scaled weight. This enters like a frequency weight enumerating/replicating the unit to its new value. Given this we should be aware then that the use of weights requires perfectly scaled weights for each level of representation. That means in longitudinal modelling the inverse probability weights should not be applied at anything other than the person specific level or cluster. This too is a problem though as it is not the correctly scaled weight. The weight is generally provided or deduced from a series of calculations each one assuming a specific relationship. That means most supplied weights and design schemes provide you with inappropriate weights for analysis. The effect of an incorrect weight is to create undesired bias in an undetectable and random direction.

For PWIGLS we have the following method: (Sorry but it is a bit too complicated for me to put the equation in here)

After taking the partial derivatives, the population quantities in the Fisher score function are replaced by the weighted sample statistics.

This procedure has shown to have fairly good properties in estimating the fixed effects parameters, but often fails to estimate the variance components effectively especially where the weights are informative. The flaw here then is that if the weights are informative we would want to use them, but we would not get good estimates of our hierarchical model. If the weights were not informative, then we would not want to bother with them anyway and just stick to regular Maximum Likelihood estimation.

The use of weights is terribly complex and far from being resolved any time soon. For the time being I suggest that the use of weights is not the most effective means of modelling in the presence of complex sampling. The best use for weight is in a diagnostic role.

Consider this one point, if the weights are uninformative and the model is correctly or even close to correctly specified the use of weight should change nothing except the Likelihood. Weights can help us better understand how well specified our model of interest is. Introducing weights can open insight to errors. However, one must be very careful to not let the weights themselves be the error. Know your weights and use them with caution.

One last thing about weights in the longitudinal context. The weight that should be given for a level 1 observation, level 2 observation and ascending level weights, likely need to be rescaled at each and every time point, especially when missing data is present. Attrition can be in part controlled for with weights or model parameters using a propensity for dropout weight or score, that helps model the changing inclusion probabilities. Current practices suggest that weights are generally scaled for the sampling method and are not modified across time. This raises many questions regarding the results of weighted longitudinal multilevel data analysis where inverse probability weights are the norm in the social sciences.

--Rbarnhar 20:11, 9 July 2012 (EDT)

### Example 14.1

In the example they say that "the observational and cross-sectional nature of the data precludes causal interpretations of the conclusions" Does this mean that the main interpretations are incorrect? if so, which ones and how are they affected by this cross-sectional nature exactly?--Dusvat 22:40, 9 July 2012 (EDT)

## Chapter 15: Longitudinal Data

### Variable occasion designs

It makes sense to consider models with function of time t when we analyze a longitudinal dataset. If the response is continuous, we can get some information of the function from the scatterplot between time t and the response. However, if the response is dummy variable, how to get some ideas of the function of time t in the model? --~~~~ --Qiong Li 16:13, 11 July 2012 (EDT)

To fit a random part of the model, one can use polynomial, splines or other functions where the covariates are fixed.If the covariates are not fixed (changing covariates), What will be the implications on our original model? Can we still use those functions to fit the random part in this case?--Gilbert8 18:31, 11 July 2012 (EDT)

### Contextual Variables in Longitudinal Designs

On the bottom of page 258, there is a note about adding a level 2 contextual variable to a longitudinal design. Up until now, this has meant taking the mean of the level 1 observations for a particular cluster (e.g. mean SES for the different schools). However, "in the longitudinal case, ... including this in the model would, ..., imply an effect from events that occur in the future because, at each moment except the last, the person mean of a changing explanatory variable depends on measurements to be made at a future point in time".

I have two questions about this. First, can we rephrase this to make it somewhat more meaningful? I'm not entirely clear on what it means as it is presently stated. Second, their answer to this is to include "not... the person mean but, rather, the employment situation at the start of the study period". Can we talk about what the data actually looks like for this design and how it could be modeled in R? Also, how would the model change if the covariate of job status had been continuous instead of categorical?

[For reference, this model has fixed effects for: age (55 through 60), birth year, birth year x age, job status at 55 (three levels, categorical), current job status (three levels, categorical).] --Msigal 17:51, 11 July 2012 (EDT)

### More on Contextual Variables

The use of polynomial or non-linear terms like time2 are often used to create or estimate curvature in linear models. When a longitudinal analysis has both time and other monotone increasing functions, their interactions, if theoretically interpretable and meaningful, can substitute for these non-linear terms.

An example would be an age by time interaction. When used with a contextual variable like mean age and a raw variable age, taken together, different aspects of the aging trajectory can be the dominant component at various times/ages.

So referring to Matt's point above from S & B about the mean of a variable across time being an event in the future, I suggest that on this point S & B are not entirely accurate, unless one is moving beyond the scope of the already observed data. Yet, even then only possibly accurate.

I argue there are time when S & B are not correct. If we are interested in the relations between numbers changing with meaning, then we need to understand their meaning, but also how numbers change without meaning. We need to consider a variety of interpretations and interrogate them with all extreme prejudice.

In the example I supplied, the mean age can be thought of as the hinge point around which the linear and non-linear components move together. This hinge then operates as an indicator of where in the lifecourse one is rather than simply one's mean age. The other two variables are contextualized and the mean age is not an element necessarily of the future, but of one's relative point in the lifecourse. Not an as of yet unobserved variable at the start of the study.

--Rbarnhar 22:22, 11 July 2012 (EDT)

### Autocorrelated Residuals

The the end of the chapter they talk about the fixed occasion design and how the assumption that the level-one residuals are independent can be relaxed and replaced by the assumption of first-order autocorrelation. Can you use this in variable occasion designs? How does this affect model? They also say that other covariance and correlation patterns are possible. Can you give other examples?--Dusvat 22:00, 11 July 2012 (EDT)

### That Darned Random Part

In Example 15.5 S&B demonstrate that the fully multivariate model has a significant deviance difference over the random slope model, but since the covariance matrix is visually similar they put it down to sample size. They prefer the random-slope model due to clearer interpretation of the random part of the model. Truth is most of the time I fit the random part to the best of my ability then otherwise ignore it! Though not particular to this chapter, if we have opportunity to discuss interpretation of the random part of HLM models a little more I'd appreciate it! --Smithce 09:29, 12 July 2012 (EDT)