MATH 6627 2010-11 Practicum in Statistical Consulting/Assignment Teams/Diaconis

From Wiki1

Jump to: navigation, search

Contents

Assignment 1

GoogleVis
examplecode

library(googleVis)

Motion=gvisMotionChart(Fruits, idvar="Fruit", timevar="Year", options=list(height=350, width=400))

  1. Display chart

plot(Motion)

  1. Create Google Gadget

cat(createGoogleGadget(Motion), file="motionchart.xml")

Powerpoint Presentation
File:GoogleVis presention.pdf GoogleVis presentation.
Simpson's Paradox
Example 1
File:Assignment1 MATH6627.pdf Example 1 of Simpson's paradox with 3 categorical variables.
Example 2
File:Paradoxexample2.pdf Example 2 of Simpson's paradox with longitudinal data.

Assignment 2

Presentation Slides File:Assignment2 MATH6627.pdf

Article Discussion ‘Purfect explanation for everything‘

The following questions about the article are discussed as follow:

1

Question
whether the article suggest a causal relationship between two variables? If so which? Are the data observational or experimental?
Discussion

Constance: Yes. The article suggests a causal relationship between a parasite called toxoplasma and specific behaviours in men versus women. It also suggests this parasite is responsible for a million traffic accidents worldwide. The data must be observational. The only way this data could be experimental would be to expose some people to the parasite and not others. This could be the case, but the article is not specific about it, and leads me to believe that the data is observational.

Bin Sun: This data is observational since the experimenter could not make some people infected by the parasite and other not.

2

Question
Can you think of alternative explanations to causality? Confounding factors? Or explanations consistent with causality? Mediating factors?
Discussion

Bin Sun: There could be a mediate factor that causes men and women different reaction to the parasite, like after infected by the parasite, men are more likely to be restless than women, and the factor of restless could cause withdrawn, suspicious, jealous, morose and dogmatic to men. Moreover, the factor of restless causes the traffic accident. What was said in the article is not that convincing at all to me. The cause-reaction relation between this particular parasite and the death on the road worldwide is ridiculous. Maybe this parasite would cause other diseases like mental illnesses which is the real causality to the prolonged the reaction time. Or there is a confounding factor which is the environmental problem. People that live in a urban drive more and could have more chance of car accident. And there are more chances that people infected by the parasite in the urban area.

Crystal: There maybe another confounding factors that the author didn’t take into account, the personality difference. To be affected by the parasite, the people should have close contact with the pets like cat. The women who have reckless, friendly personality are more likely to have cats as pet. The men who have the personality of quiet, withdrawn, suspicious, jealous, morose and dogmatic prefer cats more that dog.

Crystal: As bin explained, there maybe a mediate factor like some hormone level which is different between man and woman. Being affected by the parasite cause the hormone level react differently, and that cause different behaviours between man and woman.


3

Question
Have any confounding factors been accounted for in the analysis?
Discussion

Constance: Very little information about the actual research design is disclosed in the article, and they do not mention how the variables were measured, and nothing about other variables measured is discussed.

Bin Sun: If they consider the confounding factors, I doubted they would still have such conclusions in the article.

4

Question
Have any mediating factors been controlled for in a way that vitiates a causal interpretation of the relationship?
Discussion

Bin Sun: It is not mentioned in the article that they controlled the mediate factor. And they did not mentioned how they matched the control group and case group. In this case, there might be a lot lurking variables in the study.

5

Question
What is your personal assessment of the evidence for causality in the study that is the subject of the article?
Discussion

Bin Sun: As I see, the statement made in the article is not a professional statistical conclusion. But you can read it to have fun. And keep in mind that avoiding to be infected which could be serious in some sense.

Hangjing: After I read the article, I have to say I can't believe the statement of the article, because the statement is lack of scientific and specific proof.

Crystal: Although this is not an academic article, I still think the author should explain more on how he got the conclusion. The article may cause misleading within the readers.

Paradoxes and Fallacies

1.

Question
Suppose you are studying how some measure of health is related to weight. You are looking at a regression of health on height and weight but you observe that what you are really interested in is the relationship between health and excess weight relative to height. What you should do is to compute the residuals of weight on height and replace weight in the model with this new variable. The resulting coefficient of 'excess weight' will give a better estimate of the effect of excess weight.
Discussion

Crystal: False. The traditional residual plot against a new variable does not show a strong pattern and we might be tempted to stick with our model showing no significant relationship between Height and Weight. If we explore all possible plots of residuals against linear combinations of weight and height, the one that produces maximum correlation is the added variable plot. In this case weight has a significant negative coefficient.

4.

Question
A survey of Canadian families yielded average 'equity' (i.e. total owned in real estate, bonds, stocks, etc. minus total owed) of $48,000. Aggregate government data of the total equity in the Canadian population shows that this figure must be much larger, in fact more than three times as large. This shows that respondents much tend to dramatically underreport their equity.
Discussion

Bin Sun: False. Consider the huge gap between these two number of 'equity', the sampling in this survey is questionable and biased. The reason why the 'equity' is so low could be that it samples the population who tends to have low equity due to some factors. Or the survey was done in a special time like the economic depression period during which people sold their properties and stocks.

7.

Question
In a multiple regression, if you drop a predictor whose effect is not significant, the coefficients of the other predictors should not change very much.
Discussion

Constance: It depends. If the non-significant predictor isn't correlated with the other predictors in the regression model, then it is true. However, if it correlated with the other predictors, the effects of the predictors left in the model will change.

10.

Question
In a regression model with two predictors X1 and X2, and an interaction term between the two predictors, we know that it is dangerous to interpret the `main' effects of X1 and X2 when the model includes an interaction term but it is safe to do so provided the interaction term is not significant.
Discussion

Constance: True. The main effects are often "confounded" by the presence of a significant interaction. However, if X1 and X2 do not interact, then it is appropriate to discuss the main effects.

13.

Question
In general, a variable cannot be a ‘confounding factor’ for the effect of another variable unless they are correlated with each other.
Discussion

Constance: I think this is true. A variable confounds the effects of another variable if they share variability, and thus are correlated.

Assignment 3

1) As we saw, 'Sector' appears to be an important predictor.
Consider models using ses and Sector.
Aim to estimate the between Sector gap as a function of ses
     #     if there is an interaction between Sector and ses
     #     Check for and provide for a possible contextual effect of ses.
     #     Plot expected math achievement in each sector.
     #     Plot the gap with SEs.
     #     Consider the possibility that the apparently flatter effect of
     #     ses in Catholic school could be due to a non-linear effect of ses.
     #     How would you test whether this is a reasonable alternative
     #     explanation?
     #

Bin Sun: /R code The model I built for measuring the effects of ses and sector is fit4<-lme( mathach ~ ses * Sector+cvar(ses,id), dd, random = ~ 1+Sector | id), the interaction term between Sector and ses is significant.

  • Plot expected math achievement in each sector.Plot the gap with SEs.

Gap.jpg

  • I use the scatter plot to see the nonlinear effect of ses and I have the plot as follows:

Scatter.jpg

2) Take the example further by incorporating Sex. Consider the
     #     the 'contextual effect' of Sex which is school sex composition.
     #     Note that there are three types of schools: Girls, Boys and Coed
     #     schools. If you consider an interaction between Sector and
     #     school gender composition, you will see that the Public Sector
     #     only has Coed schools. What is the consequence of this fact for
     #     modelling sex composition and Sector effects.
     #
     # 

Bin Sun:

  dd$sex.p <- with(dd, capply( Sex == "Female", school, mean))            
                fit6<- lme( mathach ~ sex.p*Sector+ses, dd, random = ~ 1+ses | id ) 
     plot( fit6 )
  • Create a contextual variable from sex which indicated the proportion of female in each of the school, the contextual variable is a level 2 variable.

Diag.jpg

  • The diagnosis plot shows that the residual is not normally distributed.
  • I guess the consequence of the fact that the Public Sector only has Coed schools is that the sex.p for public school does not have extreme values, but catholic school does, the model can not predict the math achievement for the only female public school or only male public school.
3) Does it appear that boys are better off in a boy's school and girls
     #     in a girl's school or are they better off in coed schools? How would
     #     you qualify your findings so parents don't misinterpret them in making
     #     decisions for their children?

Crystal: Firstly, let us consider the math achievement for girls at all-girls school and coed school.The codes 'boy','coboy','cogirl' and 'girl' stands for math achievement of boys at all-boy school, boys at coed school, girls at coed school and girls at all-girl school. And the following table gives the means and ranges for the four types stated above.

                boy           coboy           cogirl          girl
MathAch      15.34438        13.04391        11.69392       12.68712
range    [-1.853,24.993] [-2.832,24.993] [-2.832,24.993] [-2.832,24.993]

Then take a look at 'xyplot' for those four types of situations which gives us an idea intuitively. P1.jpg

From the above result, it seems girls at all-girl school are doing better than girls who are at coed school according to the average, and obviously they have the same range, moreover the above plot tells us the proportion of low achievement girls at all-girl schools is smaller than that at coed schools. Moreover, this assertion also applies to boy and coboy. Then we need a more accurate procedure to illustrate this.

Then we fit a mixed model which is conditional on school.

Therefore we get the following results:

Linear mixed-effects model fit by REML


Data: dd
      AIC       BIC      logLik
   47050.23   47091.5   -23519.11

Random effects:
 Formula: ~1 | School
           (Intercept)   Residual
StdDev:     2.731431      6.232728

Fixed effects: MathAch ~ 1 + Type
                 Value   Std.Error   DF    t-value     p-value
(Intercept)   15.645749  0.6638950  7023  23.566602    0.0000
Typecoboy     -2.782810  0.7195557  7023  -3.867400    0.0001
Typecogirl    -4.083225  0.7183162  7023  -5.684439    0.0000
Typegirl      -3.026572  0.9467605  158   -3.196766    0.0017
Correlation:
            (Intr)  Typcby  Typcgr
Typecoboy   -0.923
Typecogirl  -0.924  0.971
Typegirl    -0.701  0.647   0.648

Standardized Within-Group Residuals:
     Min           Q1           Med          Q3           Max
-3.17340475   -0.74779623   0.03514232   0.76758192   2.62788617
Number of Observations: 7185
Number of Groups: 160

Coding boy= 0, all those coe±cients are strongly significant since all those p-values are less than 0.01 conditional on 'School'. Then we believe there must be some di®erence between girls at all-girl school and at coed school and all those four groups have different achievements. Moreover, girls at all-girl school are significantly coef(fit2)[1,4]-coef(fit2)[1,3] = 4.083225-3.026572 = 1.056653 higher than that at coed school and boys at all-boy school are signiffcantly 2.782810 higher than that at coed school. The following plot gives us an idea directly the achievements of the four groups where the baseline is boy achievement at all-boy school:

P2.jpg

Then if we conditional on 'Sector', proceed a similar procedure, we can easily get:

Linear mixed-effects model fit by REML

Data: dd
     AIC        BIC      logLik
  47685.76   47727.04   -23836.88

Random effects:
 Formula: ~1 | Sector
        (Intercept)    Residual
StdDev:  2.162514      6.672983

Fixed effects: MathAch ~ 1 + Type
                Value    Std.Error    DF    t-value     p-value
(Intercept)   13.818392  1.5486162   7180   8.923058    0.0000
Typecoboy     -0.193560  0.2940287   7180   -0.658304   0.5104
Typecogirl    -1.585568  0.2893400   7180   -5.479946   0.0000
Typegirl      -2.657254  0.3103628   7180   -8.561766   0.0000
Correlation:
            (Intr)  Typcby  Typcgr
Typecoboy   -0.140
Typecogirl  -0.141  0.803
Typegirl    -0.105  0.552   0.561

Standardized Within-Group Residuals:
    Min             Q1          Med          Q3          Max
-2.69486910   -0.77833388   0.02954596   0.80236262   2.14089549
Number of Observations: 7185
Number of Groups: 2

Pic3.jpg

Again coding boy= 0, all coe±cients except 'coboy'are strongly signifcant since all those p-values are less than 0.01 conditional on 'Sector'. Then we believe there must be some difference between girls at all-girl school and at coed school and all those four groups have di®erent achievements. However, girls at all-girl school are significantly coef(fit3)[1,3]-coef(fit3)[1,4] = 2.657254- 1.585568 = 1.071686 lower than that at coed school and boys at all-boy school are almost the same as that at coed school since its coefficient are not significant which is quite different from the previous case if conditional on 'School'. The following plot gives us an idea directly the achievements of the four groups where the baseline is boy achievement at all-boy school. Those lines are(from top to bottom):'boy','coboy','cogirl' and 'girl'. It seems girls at all-girl school are doing worst. It is interesting to see that 'girl' and 'cogirl' reverse in these two cases. One thing need to mention is that boys are better than girls in these two cases. From the above two analysis, it is hard to say which one is doing a better job. Then we just consider the simple linear regression:

fit4 = lm(MathAch ~ Type,dd),

which gives us

Call:
lm(formula = MathAch ~ Type, data = dd)

Residuals:
   Min        1Q     Median     3Q        Max
-17.1974   -5.3279   0.4081   5.4811   13.2991

Coefficients:
               Estimate   Std. Error   t value    Pr(>|t|)
(Intercept)    15.3444      0.2282     67.243     <2e-16 ***
Typecoboy      -2.3005      0.2654     -8.668     <2e-16 ***
Typecogirl     -3.6505      0.2615     -13.962    <2e-16 ***
Typegirl       -2.6573      0.3156      -8.421    <2e-16 ***

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.785 on 7181 degrees of freedom
Multiple R-Squared: 0.02743, Adjusted R-squared: 0.02702
F-statistic: 67.5 on 3 and 7181 DF, p-value: < 2.2e-16

Pic4.jpg

Then we continue the analysis with the same idea, from the overall model, we can tell for sure there are some discrepancy among those four types.Then we conclude there are some difference between girls at all-girl school and at coed school and all those four groups have different achievements. However, girls at all-girl school are signifcantly coef(fit4)[4]-coef(fit4)[3] = 3.6505.2.6573 = 0.9931985, which is also the difference between the two means, higher than that at coed school.

Summarize the above the analysis, we conclude overall, girls at all-girl school is doing better but it can change if we conditional on some other variables.

P5.jpg

Secondly, we consider the relationship between 'SES' and 'math achievement' the same among girls and boys. Then look at a scatter plot which is grouped by gender: 'Female' is more dense than 'Male' in the low achievement part, which indicates 'Male' could be higher.

Then we fit models as following together with the results

Female.list <- lmList( MathAch ~ SES | Sex, data = dd,subset = Sex == 'Female')

Call:
Model: MathAch ~ SES | Sex
Data: dd

Coefficients:
(Intercept)
Estimate     Std. Error      t value       Pr(>|t|)
12.1082290    0.1002315    120.8026585    0.0000000

SES
Estimate     Std. Error      t value       Pr(>|t|)
3.2248161     0.1283816     25.1189880    0.0000000

Residual standard error: 6.162026 on 3793 degrees of freedom
Male.list <- lmList( MathAch ~ SES | Sex, data = dd,subset = Sex == 'Male')

Call:
Model: MathAch ~ SES | Sex
Data: dd

Coefficients:
(Intercept)
Estimate    Std. Error       t value        Pr(>|t|)
13.474891    0.113909       118.295225      0.000000

SES
Estimate Std. Error t value Pr(>|t|)
3.0112488 0.1464696 20.5588646 0.0000000
Residual standard error: 6.614931 on 3388 degrees of freedom.

Pic6.jpg

The above result show there is some discrepancy between the two lines and they are not parallel in the following plot but not too much difference:

We obtain the following result by just fitting a simple linear model:

Call:
lm(formula = MathAch ~ SES + Sex + SES * Sex, data = dd)

Residuals:
   Min        1Q     Median     3Q       Max
-19.1736   -4.7287   0.2765   5.0472   16.5869

Coefficients:
               Estimate     Std. Error      t value    Pr(>|t|)
(Intercept)    12.1082         0.1038       116.681    <2e-16 ***
SES            3.2248          0.1329        24.262    <2e-16 ***
SexMale        1.3667          0.1511         9.043    <2e-16 ***
SES:SexMale    -0.2136         0.1940        -1.101     0.271

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since 'SES:SexMale' is insignificant, we can say the relationship between 'math achievement' and 'SES' are almost the same.


Appendix: /code for example 3

4) Is a low ses child better off in a high ses school or are they better
     #     off in a school of a similar ses? How about a high ses child in a low
     #     ses school? How would you qualify your findings so parents don't
     #     misinterpret them in making decisions for their children?

Crystal:

We need to consider the combination of school type, gender and the ses score.

  • 1 Compare “poor” girl in relatively high SES, public and mixed school with the one in relatively high SES, Catholic and girl school
  • 2 Compare “poor” girl in relatively high SES, public and mixed school with the one in relatively high SES, public and girl school
  • 3 Compare “rich” girl in relatively high SES, public and mixed school with the one in relatively high SES, Catholic and mixed school
  • 4 Compare “rich” girl in relatively low SES, public and mixed school with the one in relatively low SES,Catholic and girl school
  • 5 Compare “poor” boy in relatively low SES, public and mixed school with the one in relatively low SES, public and boy school
  • 6 Compare “poor” boy in relatively low SES, public and mixed school with the one in relatively low SES, Catholic and mixed school
  • 7 Compare “rich” boy in relatively low SES, public and mixed school with the one in relatively low SES, public and boy school
  • 8 Compare “rich” boy in relatively low SES, public and mixed school with the one in relatively low SES, Catholic and mixed school

If we add Sector, then R returns an error. So, we only consider id.1 and id.2. Now, we have to consider the following two cases:

  • 1 Compare “poor” girl in relatively high SES and mixed school with the one in relatively high SES, and girl school
  • 2 Compare “poor” girl in relatively low SES and mixed school with the one in relatively low SES and girl school

The results are as following:

      numDF   denDF     F.value      p.value
1       2      102      4.808326      0.0101

      Estimate    Std.Error   DF    t-value    p-value   Lower 0.95    Upper 0.95
1     3.018529    1.015558    102   2.972286   0.00369   1.004175       5.032884
2     1.097838    1.168964    102   0.939154   0.34987   -1.220797      3.416473

The first comparison gives pvalue 0.00369, which is significant. So,based on 1.097838, we say “poor” girl in relatively high SES and girl school have better performance than the one in relatively high SES, and mixed school


  • 3 Compare “rich” girl in relatively high SES and mixed school with the one in relatively high SES, and girl school
  • 4 Compare “rich” girl in relatively low SES and mixed school with the one in relatively low SES and girl school
      Estimate    Std.Error    DF     t-value     p-value    Lower 0.95    Upper 0.95
1    76.442538    45.806786    95     1.668804    0.09845    -14.495430    167.380506
2    -0.632692    0.660615     95     -0.957732   0.34063    -1.944179     0.678794

Pvalue 0.09845 is not very significant, and anothe pvalue 0.34063 is not significant. So we say for the “rich” girl, the two combinated cases 2.1 and 2.2 do not have siginicant difference.


  • 5 Compare “poor” boy in relatively high SES and mixed school with the one in relatively high SES, and boy school
  • 6 Compare “poor” boy in relatively low SES and mixed school with the one in relatively low SES and boy school
       numDF   denDF    F.value       p.value
1        2       96     9.253682      0.00021

       Estimate      Std.Error    DF    t-value     p-value     Lower 0.95     Upper 0.95
1      5.522658      22.417706    96    0.246352    0.80594     -38.976138     50.02145
2      -4.234878     1.074526     96    -3.941158   0.00015     -6.367796      -2.10196

The second comparison is significant based on the pvalue 0.00015.So, based on -4.234878, we say “poor” boy in relatively low SES and boy school have better performance better than the one in relatively low SES and mixed school.


  • 7 Compare “rich” boy in relatively high SES and mixed school with the one in relatively high SES, and boy school
  • 8 Compare “rich” boy in relatively low SES and mixed school with the one in relatively low SES and boy school
                Value          Std.Error   DF     t-value      p-value
(Intercept)     12.979318      1.3330135   337    9.736824     0.0000
ses              3.673740      0.9297412   337    3.951358     0.0001
id.2             0.240477      2.6971291   104    0.089160     0.9291
SES.School       2.522787      0.8314768   104    3.034103     0.0030
ses:id.2        -1.664814      1.9298626   337   -0.862660     0.3889

Correlation:

wald( fit.c, L)

        numDF     denDF      F.value     p.value
1         2        137       4.562654    0.01207

        Estimate    Std.Error   DF       t-value      p-value     Lower 0.95     Upper 0.95
1       3.898822    20.714629   137      0.188216     0.85099     -37.062931     44.860576
2      -1.510737    0.506364    137      -2.983502    0.00337     -2.512036      -0.509438

The pvalue 0.00337 is significant, so the second comparison is significant. Based on -1.664814, We say “rich” boy in relatively low SES and boy school have better performance than the one in relatively low SES and mixed school


  • Conclusion:

“poor” girl in relatively high SES and girl school have better performance than the one in relatively high SES, and mixed school

“rich” girl does not have significant difference

“poor” boy in relatively low SES and boy school have better performance better than the one in relatively low SES and mixed school.

“rich” boy in relatively low SES and boy school have better performance than the one in relatively low SES and mixed school.


Appendix: /code for example 4

5) Is a minority status child better off in a school with a higher
     #     proportion of minority status children or are they better
     #     off in a school with a low proportion? How would you qualify
     #     your findings so parents don't misinterpret them in making
     #     decisions for their children?

Bin Sun:

  • This this the plot of minority minus majority gap as a function of SES.
  • As the plot show, for a majority status kid, it depends, if the kid has low ses, it is better send the kid to the high majority *school;on the other hand, if the kid has high ses, then it is better send the kid to the high proportion of minority school.

Minority.jpg

  • If the model we consider is more complicated, and we have the math achievement prediction like the following plot for the majority kid. The decision of the parents make should be more careful because they need to consider other contextual variables.

The R code is from the exercise2 except that there is a minor change of two places which are the name changing from coefp, coefm to U2 and L2:

xyplot( coef + U2+ L2 ~ ses, dwgap, type = 'l', lwd = 2,
              ylab = "Gap in Math Achievement (+/- 2 SEs)",
              xlab = "SES", xlim = c(-2,2),
              scale = list( x = list( at = seq(-2,2,.5))),
              sub = 'Minority minus Majority gap as a function of SES',
              panel = function(x, y, ...) {
                    panel.superpose(x,y,...)
                    panel.abline(v=seq(-1.5,1.5,.5), col = 'gray')
                    panel.abline(h=seq( -6,0, 1), col = 'gray')
              })

Predict.jpg

 predc$grps <- with( predc, paste( ifelse(Minority=="Yes", "Min: ", "Maj: "), "ses = ", ses.mean, sep = ""))

     predc$labs <- as.character( 1:nrow(predc))

     td( col = c('red','blue','red','blue') , lty = c(1,1,2,2), lwd = 2)
     xyplot( mathach ~ ses ,predc, groups = grps , type = 'l',
             auto.key = list(columns = 2, lines = T, points = F),
                      panel = panel.superpose,
                      panel.groups = function(x, y, subscripts, col, cex, ...) {
                           panel.lines( x, y,  ... )
                           panel.text( x, y, labels = predc$labs[subscripts], col = 'black', cex = 1.2 ,...)
                      }
            )
Personal tools