# MATH 6627 2010-11 Practicum in Statistical Consulting/Assignment Teams/Diaconis

### From Wiki1

## Contents |

## Assignment 1

- GoogleVis
- examplecode

library(googleVis)

Motion=gvisMotionChart(Fruits, idvar="Fruit", timevar="Year", options=list(height=350, width=400))

- Display chart

plot(Motion)

- Create Google Gadget

cat(createGoogleGadget(Motion), file="motionchart.xml")

- Powerpoint Presentation
- File:GoogleVis presention.pdf GoogleVis presentation.

- Simpson's Paradox

- Example 1
- File:Assignment1 MATH6627.pdf Example 1 of Simpson's paradox with 3 categorical variables.

- Example 2
- File:Paradoxexample2.pdf Example 2 of Simpson's paradox with longitudinal data.

## Assignment 2

- Presentation Slides File:Assignment2 MATH6627.pdf

### Article Discussion ‘Purfect explanation for everything‘

The following questions about the article are discussed as follow:

#### 1

- Question
- whether the article suggest a causal relationship between two variables? If so which? Are the data observational or experimental?

- Discussion

Constance: Yes. The article suggests a causal relationship between a parasite called toxoplasma and specific behaviours in men versus women. It also suggests this parasite is responsible for a million traffic accidents worldwide. The data must be observational. The only way this data could be experimental would be to expose some people to the parasite and not others. This could be the case, but the article is not specific about it, and leads me to believe that the data is observational.

Bin Sun: This data is observational since the experimenter could not make some people infected by the parasite and other not.

#### 2

- Question
- Can you think of alternative explanations to causality? Confounding factors? Or explanations consistent with causality? Mediating factors?

- Discussion

Bin Sun: There could be a mediate factor that causes men and women different reaction to the parasite, like after infected by the parasite, men are more likely to be restless than women, and the factor of restless could cause withdrawn, suspicious, jealous, morose and dogmatic to men. Moreover, the factor of restless causes the traffic accident. What was said in the article is not that convincing at all to me. The cause-reaction relation between this particular parasite and the death on the road worldwide is ridiculous. Maybe this parasite would cause other diseases like mental illnesses which is the real causality to the prolonged the reaction time. Or there is a confounding factor which is the environmental problem. People that live in a urban drive more and could have more chance of car accident. And there are more chances that people infected by the parasite in the urban area.

Crystal: There maybe another confounding factors that the author didn’t take into account, the personality difference. To be affected by the parasite, the people should have close contact with the pets like cat. The women who have reckless, friendly personality are more likely to have cats as pet. The men who have the personality of quiet, withdrawn, suspicious, jealous, morose and dogmatic prefer cats more that dog.

Crystal: As bin explained, there maybe a mediate factor like some hormone level which is different between man and woman. Being affected by the parasite cause the hormone level react differently, and that cause different behaviours between man and woman.

#### 3

- Question
- Have any confounding factors been accounted for in the analysis?

- Discussion

Constance: Very little information about the actual research design is disclosed in the article, and they do not mention how the variables were measured, and nothing about other variables measured is discussed.

Bin Sun: If they consider the confounding factors, I doubted they would still have such conclusions in the article.

#### 4

- Question
- Have any mediating factors been controlled for in a way that vitiates a causal interpretation of the relationship?

- Discussion

Bin Sun: It is not mentioned in the article that they controlled the mediate factor. And they did not mentioned how they matched the control group and case group. In this case, there might be a lot lurking variables in the study.

#### 5

- Question
- What is your personal assessment of the evidence for causality in the study that is the subject of the article?

- Discussion

Bin Sun: As I see, the statement made in the article is not a professional statistical conclusion. But you can read it to have fun. And keep in mind that avoiding to be infected which could be serious in some sense.

Hangjing: After I read the article, I have to say I can't believe the statement of the article, because the statement is lack of scientific and specific proof.

Crystal: Although this is not an academic article, I still think the author should explain more on how he got the conclusion. The article may cause misleading within the readers.

### Paradoxes and Fallacies

#### 1.

- Question
- Suppose you are studying how some measure of health is related to weight. You are looking at a regression of health on height and weight but you observe that what you are really interested in is the relationship between health and excess weight relative to height. What you should do is to compute the residuals of weight on height and replace weight in the model with this new variable. The resulting coefficient of 'excess weight' will give a better estimate of the effect of excess weight.

- Discussion

Crystal: False. The traditional residual plot against a new variable does not show a strong pattern and we might be tempted to stick with our model showing no significant relationship between Height and Weight. If we explore all possible plots of residuals against linear combinations of weight and height, the one that produces maximum correlation is the added variable plot. In this case weight has a significant negative coefficient.

#### 4.

- Question
- A survey of Canadian families yielded average 'equity' (i.e. total owned in real estate, bonds, stocks, etc. minus total owed) of $48,000. Aggregate government data of the total equity in the Canadian population shows that this figure must be much larger, in fact more than three times as large. This shows that respondents much tend to dramatically underreport their equity.

- Discussion

Bin Sun: False. Consider the huge gap between these two number of 'equity', the sampling in this survey is questionable and biased. The reason why the 'equity' is so low could be that it samples the population who tends to have low equity due to some factors. Or the survey was done in a special time like the economic depression period during which people sold their properties and stocks.

#### 7.

- Question
- In a multiple regression, if you drop a predictor whose effect is not significant, the coefficients of the other predictors should not change very much.

- Discussion

Constance: It depends. If the non-significant predictor isn't correlated with the other predictors in the regression model, then it is true. However, if it correlated with the other predictors, the effects of the predictors left in the model will change.

#### 10.

- Question
- In a regression model with two predictors X1 and X2, and an interaction term between the two predictors, we know that it is dangerous to interpret the `main' effects of X1 and X2 when the model includes an interaction term but it is safe to do so provided the interaction term is not significant.

- Discussion

Constance: True. The main effects are often "confounded" by the presence of a significant interaction. However, if X1 and X2 do not interact, then it is appropriate to discuss the main effects.

#### 13.

- Question
- In general, a variable cannot be a ‘confounding factor’ for the effect of another variable unless they are correlated with each other.

- Discussion

Constance: I think this is true. A variable confounds the effects of another variable if they share variability, and thus are correlated.

## Assignment 3

1) As we saw, 'Sector' appears to be an important predictor. Consider models using ses and Sector. Aim to estimate the between Sector gap as a function of ses # if there is an interaction between Sector and ses # Check for and provide for a possible contextual effect of ses. # Plot expected math achievement in each sector. # Plot the gap with SEs. # Consider the possibility that the apparently flatter effect of # ses in Catholic school could be due to a non-linear effect of ses. # How would you test whether this is a reasonable alternative # explanation? #

Bin Sun: /R code The model I built for measuring the effects of ses and sector is fit4<-lme( mathach ~ ses * Sector+cvar(ses,id), dd, random = ~ 1+Sector | id), the interaction term between Sector and ses is significant.

- Plot expected math achievement in each sector.Plot the gap with SEs.

- I use the scatter plot to see the nonlinear effect of ses and I have the plot as follows:

2) Take the example further by incorporating Sex. Consider the # the 'contextual effect' of Sex which is school sex composition. # Note that there are three types of schools: Girls, Boys and Coed # schools. If you consider an interaction between Sector and # school gender composition, you will see that the Public Sector # only has Coed schools. What is the consequence of this fact for # modelling sex composition and Sector effects. # #

Bin Sun:

dd$sex.p <- with(dd, capply( Sex == "Female", school, mean)) fit6<- lme( mathach ~ sex.p*Sector+ses, dd, random = ~ 1+ses | id ) plot( fit6 )

- Create a contextual variable from sex which indicated the proportion of female in each of the school, the contextual variable is a level 2 variable.

- The diagnosis plot shows that the residual is not normally distributed.

- I guess the consequence of the fact that the Public Sector only has Coed schools is that the sex.p for public school does not have extreme values, but catholic school does, the model can not predict the math achievement for the only female public school or only male public school.

3) Does it appear that boys are better off in a boy's school and girls # in a girl's school or are they better off in coed schools? How would # you qualify your findings so parents don't misinterpret them in making # decisions for their children?

Crystal: Firstly, let us consider the math achievement for girls at all-girls school and coed school.The codes 'boy','coboy','cogirl' and 'girl' stands for math achievement of boys at all-boy school, boys at coed school, girls at coed school and girls at all-girl school. And the following table gives the means and ranges for the four types stated above.

boy coboy cogirl girl MathAch 15.34438 13.04391 11.69392 12.68712 range [-1.853,24.993] [-2.832,24.993] [-2.832,24.993] [-2.832,24.993]

Then take a look at 'xyplot' for those four types of situations which gives us an idea intuitively.

From the above result, it seems girls at all-girl school are doing better than girls who are at coed school according to the average, and obviously they have the same range, moreover the above plot tells us the proportion of low achievement girls at all-girl schools is smaller than that at coed schools. Moreover, this assertion also applies to boy and coboy. Then we need a more accurate procedure to illustrate this.

Then we fit a mixed model which is conditional on school.

Therefore we get the following results:

Linear mixed-effects model fit by REML

Data: dd AIC BIC logLik 47050.23 47091.5 -23519.11 Random effects: Formula: ~1 | School (Intercept) Residual StdDev: 2.731431 6.232728 Fixed effects: MathAch ~ 1 + Type Value Std.Error DF t-value p-value (Intercept) 15.645749 0.6638950 7023 23.566602 0.0000 Typecoboy -2.782810 0.7195557 7023 -3.867400 0.0001 Typecogirl -4.083225 0.7183162 7023 -5.684439 0.0000 Typegirl -3.026572 0.9467605 158 -3.196766 0.0017 Correlation: (Intr) Typcby Typcgr Typecoboy -0.923 Typecogirl -0.924 0.971 Typegirl -0.701 0.647 0.648 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -3.17340475 -0.74779623 0.03514232 0.76758192 2.62788617 Number of Observations: 7185 Number of Groups: 160

Coding boy= 0, all those coe±cients are strongly significant since all those p-values are less than 0.01 conditional on 'School'. Then we believe there must be some di®erence between girls at all-girl school and at coed school and all those four groups have different achievements. Moreover, girls at all-girl school are significantly coef(fit2)[1,4]-coef(fit2)[1,3] = 4.083225-3.026572 = 1.056653 higher than that at coed school and boys at all-boy school are signiffcantly 2.782810 higher than that at coed school. The following plot gives us an idea directly the achievements of the four groups where the baseline is boy achievement at all-boy school:

Then if we conditional on 'Sector', proceed a similar procedure, we can easily get:

Linear mixed-effects model fit by REML

Data: dd AIC BIC logLik 47685.76 47727.04 -23836.88 Random effects: Formula: ~1 | Sector (Intercept) Residual StdDev: 2.162514 6.672983 Fixed effects: MathAch ~ 1 + Type Value Std.Error DF t-value p-value (Intercept) 13.818392 1.5486162 7180 8.923058 0.0000 Typecoboy -0.193560 0.2940287 7180 -0.658304 0.5104 Typecogirl -1.585568 0.2893400 7180 -5.479946 0.0000 Typegirl -2.657254 0.3103628 7180 -8.561766 0.0000 Correlation: (Intr) Typcby Typcgr Typecoboy -0.140 Typecogirl -0.141 0.803 Typegirl -0.105 0.552 0.561 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -2.69486910 -0.77833388 0.02954596 0.80236262 2.14089549 Number of Observations: 7185 Number of Groups: 2

Again coding boy= 0, all coe±cients except 'coboy'are strongly signifcant since all those p-values are less than 0.01 conditional on 'Sector'. Then we believe there must be some difference between girls at all-girl school and at coed school and all those four groups have di®erent achievements. However, girls at all-girl school are significantly coef(fit3)[1,3]-coef(fit3)[1,4] = 2.657254- 1.585568 = 1.071686 lower than that at coed school and boys at all-boy school are almost the same as that at coed school since its coefficient are not significant which is quite different from the previous case if conditional on 'School'. The following plot gives us an idea directly the achievements of the four groups where the baseline is boy achievement at all-boy school. Those lines are(from top to bottom):'boy','coboy','cogirl' and 'girl'. It seems girls at all-girl school are doing worst. It is interesting to see that 'girl' and 'cogirl' reverse in these two cases. One thing need to mention is that boys are better than girls in these two cases. From the above two analysis, it is hard to say which one is doing a better job. Then we just consider the simple linear regression:

fit4 = lm(MathAch ~ Type,dd),

which gives us

Call: lm(formula = MathAch ~ Type, data = dd) Residuals: Min 1Q Median 3Q Max -17.1974 -5.3279 0.4081 5.4811 13.2991 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 15.3444 0.2282 67.243 <2e-16 *** Typecoboy -2.3005 0.2654 -8.668 <2e-16 *** Typecogirl -3.6505 0.2615 -13.962 <2e-16 *** Typegirl -2.6573 0.3156 -8.421 <2e-16 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 6.785 on 7181 degrees of freedom Multiple R-Squared: 0.02743, Adjusted R-squared: 0.02702 F-statistic: 67.5 on 3 and 7181 DF, p-value: < 2.2e-16

Then we continue the analysis with the same idea, from the overall model, we can tell for sure there are some discrepancy among those four types.Then we conclude there are some difference between girls at all-girl school and at coed school and all those four groups have different achievements. However, girls at all-girl school are signifcantly coef(fit4)[4]-coef(fit4)[3] = 3.6505.2.6573 = 0.9931985, which is also the difference between the two means, higher than that at coed school.

Summarize the above the analysis, we conclude overall, girls at all-girl school is doing better but it can change if we conditional on some other variables.

Secondly, we consider the relationship between 'SES' and 'math achievement' the same among girls and boys. Then look at a scatter plot which is grouped by gender: 'Female' is more dense than 'Male' in the low achievement part, which indicates 'Male' could be higher.

Then we fit models as following together with the results

Female.list <- lmList( MathAch ~ SES | Sex, data = dd,subset = Sex == 'Female')

Call: Model: MathAch ~ SES | Sex Data: dd Coefficients: (Intercept) Estimate Std. Error t value Pr(>|t|) 12.1082290 0.1002315 120.8026585 0.0000000 SES Estimate Std. Error t value Pr(>|t|) 3.2248161 0.1283816 25.1189880 0.0000000 Residual standard error: 6.162026 on 3793 degrees of freedom Male.list <- lmList( MathAch ~ SES | Sex, data = dd,subset = Sex == 'Male') Call: Model: MathAch ~ SES | Sex Data: dd Coefficients: (Intercept) Estimate Std. Error t value Pr(>|t|) 13.474891 0.113909 118.295225 0.000000 SES Estimate Std. Error t value Pr(>|t|) 3.0112488 0.1464696 20.5588646 0.0000000 Residual standard error: 6.614931 on 3388 degrees of freedom.

The above result show there is some discrepancy between the two lines and they are not parallel in the following plot but not too much difference:

We obtain the following result by just fitting a simple linear model:

Call: lm(formula = MathAch ~ SES + Sex + SES * Sex, data = dd) Residuals: Min 1Q Median 3Q Max -19.1736 -4.7287 0.2765 5.0472 16.5869 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.1082 0.1038 116.681 <2e-16 *** SES 3.2248 0.1329 24.262 <2e-16 *** SexMale 1.3667 0.1511 9.043 <2e-16 *** SES:SexMale -0.2136 0.1940 -1.101 0.271 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since 'SES:SexMale' is insignificant, we can say the relationship between 'math achievement' and 'SES' are almost the same.

**Appendix: /code for example 3**

4) Is a low ses child better off in a high ses school or are they better # off in a school of a similar ses? How about a high ses child in a low # ses school? How would you qualify your findings so parents don't # misinterpret them in making decisions for their children?

Crystal:

We need to consider the combination of school type, gender and the ses score.

- 1 Compare “poor” girl in relatively high SES, public and mixed school with the one in relatively high SES, Catholic and girl school

- 2 Compare “poor” girl in relatively high SES, public and mixed school with the one in relatively high SES, public and girl school

- 3 Compare “rich” girl in relatively high SES, public and mixed school with the one in relatively high SES, Catholic and mixed school

- 4 Compare “rich” girl in relatively low SES, public and mixed school with the one in relatively low SES,Catholic and girl school

- 5 Compare “poor” boy in relatively low SES, public and mixed school with the one in relatively low SES, public and boy school

- 6 Compare “poor” boy in relatively low SES, public and mixed school with the one in relatively low SES, Catholic and mixed school

- 7 Compare “rich” boy in relatively low SES, public and mixed school with the one in relatively low SES, public and boy school

- 8 Compare “rich” boy in relatively low SES, public and mixed school with the one in relatively low SES, Catholic and mixed school

If we add Sector, then R returns an error. So, we only consider id.1 and id.2. Now, we have to consider the following two cases:

- 1 Compare “poor” girl in relatively high SES and mixed school with the one in relatively high SES, and girl school

- 2 Compare “poor” girl in relatively low SES and mixed school with the one in relatively low SES and girl school

The results are as following:

numDF denDF F.value p.value 1 2 102 4.808326 0.0101 Estimate Std.Error DF t-value p-value Lower 0.95 Upper 0.95 1 3.018529 1.015558 102 2.972286 0.00369 1.004175 5.032884 2 1.097838 1.168964 102 0.939154 0.34987 -1.220797 3.416473

The first comparison gives pvalue 0.00369, which is significant. So,based on 1.097838, we say “poor” girl in relatively high SES and girl school have better performance than the one in relatively high SES, and mixed school

- 3 Compare “rich” girl in relatively high SES and mixed school with the one in relatively high SES, and girl school

- 4 Compare “rich” girl in relatively low SES and mixed school with the one in relatively low SES and girl school

Estimate Std.Error DF t-value p-value Lower 0.95 Upper 0.95 1 76.442538 45.806786 95 1.668804 0.09845 -14.495430 167.380506 2 -0.632692 0.660615 95 -0.957732 0.34063 -1.944179 0.678794

Pvalue 0.09845 is not very significant, and anothe pvalue 0.34063 is not significant. So we say for the “rich” girl, the two combinated cases 2.1 and 2.2 do not have siginicant difference.

- 5 Compare “poor” boy in relatively high SES and mixed school with the one in relatively high SES, and boy school

- 6 Compare “poor” boy in relatively low SES and mixed school with the one in relatively low SES and boy school

numDF denDF F.value p.value 1 2 96 9.253682 0.00021 Estimate Std.Error DF t-value p-value Lower 0.95 Upper 0.95 1 5.522658 22.417706 96 0.246352 0.80594 -38.976138 50.02145 2 -4.234878 1.074526 96 -3.941158 0.00015 -6.367796 -2.10196

The second comparison is significant based on the pvalue 0.00015.So, based on -4.234878, we say “poor” boy in relatively low SES and boy school have better performance better than the one in relatively low SES and mixed school.

- 7 Compare “rich” boy in relatively high SES and mixed school with the one in relatively high SES, and boy school

- 8 Compare “rich” boy in relatively low SES and mixed school with the one in relatively low SES and boy school

Value Std.Error DF t-value p-value (Intercept) 12.979318 1.3330135 337 9.736824 0.0000 ses 3.673740 0.9297412 337 3.951358 0.0001 id.2 0.240477 2.6971291 104 0.089160 0.9291 SES.School 2.522787 0.8314768 104 3.034103 0.0030 ses:id.2 -1.664814 1.9298626 337 -0.862660 0.3889

Correlation:

wald( fit.c, L)

numDF denDF F.value p.value 1 2 137 4.562654 0.01207 Estimate Std.Error DF t-value p-value Lower 0.95 Upper 0.95 1 3.898822 20.714629 137 0.188216 0.85099 -37.062931 44.860576 2 -1.510737 0.506364 137 -2.983502 0.00337 -2.512036 -0.509438

The pvalue 0.00337 is significant, so the second comparison is significant. Based on -1.664814, We say “rich” boy in relatively low SES and boy school have better performance than the one in relatively low SES and mixed school

- Conclusion:

“poor” girl in relatively high SES and girl school have better performance than the one in relatively high SES, and mixed school

“rich” girl does not have significant difference

“poor” boy in relatively low SES and boy school have better performance better than the one in relatively low SES and mixed school.

“rich” boy in relatively low SES and boy school have better performance than the one in relatively low SES and mixed school.

**Appendix: /code for example 4**

5) Is a minority status child better off in a school with a higher # proportion of minority status children or are they better # off in a school with a low proportion? How would you qualify # your findings so parents don't misinterpret them in making # decisions for their children?

Bin Sun:

- This this the plot of minority minus majority gap as a function of SES.
- As the plot show, for a majority status kid, it depends, if the kid has low ses, it is better send the kid to the high majority *school;on the other hand, if the kid has high ses, then it is better send the kid to the high proportion of minority school.

- If the model we consider is more complicated, and we have the math achievement prediction like the following plot for the majority kid. The decision of the parents make should be more careful because they need to consider other contextual variables.

The R code is from the exercise2 except that there is a minor change of two places which are the name changing from coefp, coefm to U2 and L2:

xyplot( coef + U2+ L2 ~ ses, dwgap, type = 'l', lwd = 2, ylab = "Gap in Math Achievement (+/- 2 SEs)", xlab = "SES", xlim = c(-2,2), scale = list( x = list( at = seq(-2,2,.5))), sub = 'Minority minus Majority gap as a function of SES', panel = function(x, y, ...) { panel.superpose(x,y,...) panel.abline(v=seq(-1.5,1.5,.5), col = 'gray') panel.abline(h=seq( -6,0, 1), col = 'gray') })

predc$grps <- with( predc, paste( ifelse(Minority=="Yes", "Min: ", "Maj: "), "ses = ", ses.mean, sep = "")) predc$labs <- as.character( 1:nrow(predc)) td( col = c('red','blue','red','blue') , lty = c(1,1,2,2), lwd = 2) xyplot( mathach ~ ses ,predc, groups = grps , type = 'l', auto.key = list(columns = 2, lines = T, points = F), panel = panel.superpose, panel.groups = function(x, y, subscripts, col, cex, ...) { panel.lines( x, y, ... ) panel.text( x, y, labels = predc$labs[subscripts], col = 'black', cex = 1.2 ,...) } )