MATH 6627 2010-11 Practicum in Statistical Consulting/Assignment Teams/Gray
1. Simpson's Paradox
Example: My friend and I play a basketball game and each shoot 20 shots. Who is the better shooter?
But, who is the better shooter if you control for the distance of the shot? Who would you rather have on your team?
We can see from this figure, the relationship changed from negative to positive when we took the distance to our consideration. Black line linked probability of we two made. Red line is linked our performance when far, but blue when close.
Simpson’s paradox arises from one simple mathematical truth. Given eight real numbers: a, b, c, d, A, B, C, D with the following properties:, then it is not necessarily true that. In fact, it may be true that:.
This is an obvious math reality, yet it has significant ramifications in Bayesian analysis, medical research, science and engineering studies, and societal statistical analysis. It is of concern for any statistical activity involving the calculation and analysis of ratios of two measurements.
Exmaple 2 (Real Income tax example)
2. Graphics to visualize data
Gray: rgl and p3d (include the use of the 'groups' parameter to produce trajectories
rgl is a library of functions that offers 3D real-time visualization functionality to the R programming environment (Adler & Murdoch, 2010), providing OpenGL implemention for R.
p3d is a library of functions which employs functions from RGL to help visualize statistical models expressed as a function of 2 independent variables with the possible addition of a categorical variable (Monette, 2009).
With rgl we create a ‘device’ , which is simply a window, within which a ‘world’ is created where we can create 3 dimensional shapes and through which we can navigate.
Functions within the rgl package can be divided into 6 categories: (1) Device management functions (open and close devices, control active device) (2) Scene management functions (option to remove certain or all objects from the scene) (3) Export functions (creating image files) (4) Shape functions - essential plotting tools primitives (points, lines, triangles, quads) as well as higher level functions (text, spheres, surfaces).
(5) Environment functions - modify the viewpoint, background and bounding box, adding light sources (6) Appearance function rgl.material(…).
Using shapes and surfaces within an rgl device, statistical data can be represented in 3 dimensions. Some advanced examples are available as demos or provided on the rgl website.
A few of the functions from rgl are useful for manipulating 3D models created using p3d, since p3d contains many functions that inherit from rgl but taylor them to statistical methods. Thus all but a few are unnecessary for our purposes unless you would like to contribute functionality to p3d!
In this section I will focus on example code you may use to familiarize yourself with the capabilities of this package. You will require the tuition.Rdata (source:) and USIndicesIndustrialProd.Rdata (source:) data sets. Note that a few of the commands employed in sample code are from rgl, but these will likely be superceded by p3d functions as the package matures.
In this section I will focus on example code you may use to familiarize yourself with the capabilities of this package. You will require the File:Gray p3d ex Tuition.txt and File:Gray p3d ex USIndicesIndustrialProd.txt data sets. Note that a few of the commands employed in sample code are from rgl, but these will likely be superceded by p3d functions as the package matures.
library( lattice )
library( nlme )
library( car )
library( spida )
library( rgl )
library( p3d )
tuit = read.table('tuition.Rdata',header=TRUE)
prod = read.table('USIndicesIndustrialProd.Rdata',header=TRUE)
For the tuition data we will begin by plotting the annual cost of tuition from a sample of American Universities against the rates of faculty compensation and proportion of students who graduate.
Using mouse keys you can change the field of view and zoom in and out. Plot3d creates the 3D plot as shown on the right.
We can remove elements from the device using the function Pop3d(). This function removes elements starting with the most recently added item. Multiple items can be removed addition an numeric argument, ie.Pop3d(4)
Init3d(cex = .8)
Plot3d(tuition ~ fac_comp + graduat, col = c("blue"), data = tuit)
Init3d(cex = .8)
Plot3d( tuition ~ fac_comp + graduat|public.private, col = c("blue", "red"), data = tuit)
Next we will add regression planes for private(red) and public(blue) schools using the lm() function to determine the fit, and Fit3d() to insert the plane in the graph. Axes and labels are added using Axes3d() and title3d().
fitpub = lm(tuition ~ fac_comp + graduat,subset=(public.private==0),data = tuit)
Fit3d( fitpub, col = c("blue"))
fitpri = lm(tuition ~ fac_comp + graduat,subset=(public.private==1),data = tuit)
Fit3d( fitpri, col = c("red"))
title3d(main='Tuition predicted by grad rates and faculty salary -private (red) and public(blue) institutions')
We can change the view point of our graph using function view3d(theta,phi,fov,zoom), which takes polar coordinates. Note that view3d(0,0,0) will rotate the image to to face the x-z plane (y into the screen) and view3d(270,0,0) will rotate the image to to face the y-z plane (x into the screen). Function snap() will capture a still image of the current view. Note that to use movie3d() you must have ImageMagick installed to automatically convert png's to gif, otherwise you must use external software.
spin(theta = 0, phi = 0)
spins(inc.theta = 1/4, inc.phi = 0, theta = NULL, phi = NULL)
movie3d( spin3d(axis=c(0,1,0), rpm=20), duration=2, dir='movie' )
Here is an additional example, using data on the US indices of industrial products, plotting Mining production (MIN) over months and years. Adding the argument ‘groups=YR’ to Plot3d connects the months in a given year to produce trajectories.
open3d(windowRect=c(100,100,800,800),cex = .8)
prod = read.table('USIndicesIndustrialProd.Rdata',header=TRUE)
Plot3d(MIN ~ YR+MONTH,data=prod,groups=YR)
title3d(main='Industrial Production Mining (1947-1993)')
Statistics in the News: "Spousal support a royal pain?" Journal Abstract
Overly supportive spouses are not necessarily doing their partners a favour.
They could be prolonging the recovery of their injured spouses.
Men with highly attentive spouses reported higher levels of pain and more disability but did well on physical functions tests.
Women with highly attentive spouses didn't report feeling more pain or being more disabled. However, they performed more poorly on physical function tests than did women with less attentive spouses
- whether the article suggest a causal relationship between two variables? If so which? Are the data observational or experimental?
Yes, the article did suggest a causal relationship. One variable is the spousal solicitousness (attentiveness & support). Another is the degree of reported pain and disability. A third is actual physical function. Patient gender was also taken into consideration.
According to the report, men with chronic pain report more perceived pain and disability when they receive higher levels of spousal attention, and it is implied this is controlling for actual physical ability. For women there was no difference in self-reports of pain and disability with level of spousal support, but women who received more attention from their husbands had poorer physical function than those who did not.
The data are observational, because there is no manipulation or randomization in collecting the data.
- Can you think of alternative explanations to causality? Confounding factors? Or explanations consistent with causality? Mediating factors?
Andy: Yes, maybe high degree of pain causes solicitousness, instead of solicitousness causing towards more pain. I don't think there exits a confounding factor. If there is, I believe it is LOVE, because a man's LOVE wins a highly attentive spouse and doesn't work as a narcotic as a woman feels. Then he may ask for more LOVEs by reporting hurt. That is a sweet answer to any woman, but tooth-hurting felt by any healthy man.
Also, LOVE could be the mediating factor too. And with high solicitousness, man's LOVE starts to fall in. He may move his body, more likely, which may cause more hurt. And then his physical function may recover faster. But to a woman, she likes to finding pieces of LOVEs through solicitousness. Her heart is numbed with those LOVEs. And then she will report she is better as an affectionate payback.
- Have any confounding factors been accounted for in the analysis?
Emotional (men) or Physical (women) troubles outside the primary cause of pain. Men who are emotionally unstable may attract and encourage attentiveness in their wives, and also report a disproportionately high level of disability relative to their level of function. Women who are physically weak prior to injury may attract husbands who are highly supportive, but also be more susceptible to injuries that result in chronic pain.
- Have any mediating factors been controlled for in a way that vitiates a causal interpretation of the relationship?
This does not appear to be the case.
- What is your personal assessment of the evidence for causality in the study that is the subject of the article?
There is a good case to be made for both forward and backward causality in this study. In addition it would be nice to be able to ontrol for emotional and physical stability prior to chronic injury. Are frequencies distributed as expected? Are the proportions of highly attentive spouses equal across groups? ASSESSMENT: Judgement withheld pending further evidence
Paradoxes and Fallacies
- You are studying observational data on the relationship between Health and Coffee (measured in grams of caffeine consumed per day). Suppose you want to control for a possible confounding factor 'Stress'. In this kind of study it is more important to make sure that you measure coffee consumption accurately than it is to make sure that you measure 'stress' accurately.
True. It is more important to accurately measure Coffee Consumption if this is the variable of primary interest in the study.
Consider this from the analysis of variance framework. When measurement error is introduced, the variability in the outcome accounted for by that factor is decreased, and the error sums of squares increases. If we decrease accuracy in measuring Stress, and the error term will accordingly increase, which will in turn decrease the power to detect the effect of Coffee on Health somewhat, but will not affect the coefficient estimates by much. However, if we have increased measurement error in Coffee Consumption, then the proportion of variability explained by Coffee will decrease AND the error term will increase. We lose power on two fronts!
Here is a short r-script that demonstrates this idea: File:Gray Q2MeasError.r
- In a multiple regression of Y on three predictors, X1, X2 and X3, if the coefficients of both X2 and X3, are not significant, it is safe to drop these two variable and perform a regression on X1 alone.
No. The way to do this is starting with 3 variables in the model, and dropping the least "significant", one at a time, until you are left with only "significant" variables. We can perform a stepwise selection, but drop variables which become no longer "significant" after introduction of new variables.
- In a multiple regression, if you drop a predictor whose effect is not significant, the p-values of the other predictors should not change very much.
No. After dropping the predictor, model changes. If we drop a predictor whose effect is not significant, the p-values of other predictors will change. We can see this from this example @ page 26-28
- In a model to assess the effect of a number of treatments on some outcome, we can estimate the difference between the best treatment and the worse treatment by using the difference in the mean outcomes.
No. Size of tumors does provide some info about two treatments, one is the best and another is the worse. But it is not a good idea to estimate the difference between the two treatments by using the difference of size of tumors. It is so biased, because there are other factors which will effect the outcome, such as time in the disease. For example, our target is New York when we drive out of Toronto. But some day, we find everyone around is speaking Spanish. But the answer may be yes too, as Barack Obama's slogan showed: "Yes we can".
- If two variables have a strong interaction, this implies a strong correlation.
False. An interaction exists if the effect of one independent on the dependent variable varies over another independent variable. This tells us nothing about the relationship between the IVs. An interaction between two variables can occur whether or not the variables are correlated with one another.
Here is an additional example using binary variables: Illustration of the difference between correlation and interaction amongst independent variables
1) As we saw, 'Sector' appears to be an important predictor. Consider models using ses and Sector. Aim to estimate the between Sector gap as a function of ses if there is an interaction between Sector and ses. Check for and provide for a possible contextual effect of ses. Plot expected math achievement in each sector. Plot the gap with SEs. Consider the possibility that the apparently flatter effect of ses in Catholic school could be due to a non-linear effect of ses. How would you test whether this is a reasonable alternative explanation?
The interaction between ses and sector is significant (p<0.0001). To fully appreciate the nature of the relationship consider the following plot:
- Estimating the compositional (= between sector) effect of ses
wald( myfit1 )
L <- list( 'Effect of ses' = rbind( "Within-school" = c( 0,1,0,0,0), "Contextual" = c( 0,0,0,1,0), "Compositional" = c( 0,1,0,1,0)))
wald ( myfit1 , L
numDF denDF F.value p.value
Effect of ses 2 77 27.43212 <.00001
Estimate Std.Error DF t-value p-value Lower 0.95 Upper 0.95 Within-school -1.778535 0.463645 77 -3.835987 0.00025 -2.701769 -0.855300 Contextual 3.090680 0.589674 77 5.241334 <.00001 1.916488 4.264872 Compositional 1.312145 0.832357 77 1.576421 0.11903 -0.345290 2.969581
# L matrix for the gap at a value of ses : # # 2) Lmatrix for Public: # c( 1, ses, 1, 0, ses ) # # 3) Lmatrix for Catholic: # c( 1, ses, 0, 0, 0) # # 4) Public - Catholic: # c( 0, 0, 1, 0 , ses) # pred.gap <- expand.grid( ses = seq( -2, 2, .05)) # why so many?
L.gap <- cbind( 0, 0, 1, 0, pred.gap$ses) L.gap wald( myfit1, L.gap ) pred.gap <- cbind( pred.gap, as.data.frame( wald( myfit1, L.gap), se = 2)) some( pred.gap )
td ( lty = c(1,2,2), lwd = 2, col = 'black') xyplot( coef + L2 + U2 ~ ses, pred.gap, type = 'l', xlim = c(-2,2), ylab = "Estimated Catholic - Public gap in MathAch (+/- 2 SEs)")
2) Take the example further by incorporating Sex. Consider the the 'contextual effect' of Sex which is school sex composition. Note that there are three types of schools: Girls, Boys and Coed schools. If you consider an interaction between Sector and school gender composition, you will see that the Public Sector only has Coed schools. What is the consequence of this fact for modeling sex composition and Sector effects.
Since I had no idea how to handle issue of the lack of non-coed schools in the Public board I maturely decided to ignore the problem and delete non-coed schools from the data set. This analysis, treating proportion of males as a continuous variable, focuses entirely on Coed schools.
dd$sex.p <- with( dd, cvar( Sex, id )) # Proportion of males dd$ses.m <- with( dd, cvar( ses, id )) dd$sex.cat <- factor( ifelse( dd$sex.p==1, "Girls", ifelse( dd$sex.p==0, "Boys", "Coed")))
ddcd <- dd[(dd$sex.cat=="Coed"),]
With the exception of ses*Sector interactions were not significant. Because it was of substantive interest for the research question, I decided to leave Sector:Sex in the model
fit3 <- lme( mathach ~ ses + Sector + Sex + Sector:Sex + ses:Sector + ses.m + sex.p, dd, random = ~ 1 | id ,control = list( maxIter = 200, msMaxIter = 200, niterEM = 100)) summary( fit3 ) wald(fit3)
numDF denDF F.value p.value
7 59 435 <.00001
Coefficients Estimate Std.Error DF t-value p-value Lower 0.95 Upper 0.95
(Intercept) 12.752 0.565 3600 22.581 <.00001 11.645 13.86 ses 1.495 0.223 3600 6.715 <.00001 1.058 1.93 SectorPublic -1.777 0.510 76 -3.483 0.00083 -2.792 -0.76 SexMale 1.037 0.412 3600 2.516 0.01191 0.229 1.85 ses.m 3.048 0.585 76 5.209 <.00001 1.883 4.21 sex.p 0.945 0.950 76 0.995 0.32312 -0.947 2.84 SectorPublic:SexMale 0.163 0.500 3600 0.327 0.74389 -0.816 1.14 ses:SectorPublic 1.320 0.294 3600 4.488 0.00001 0.743 1.90
Clearly the relationship between ses and math achievement is consistent for schools with a low proportion of male students, equal and a high proportion of male students. However female students consistently do worse than make students, by fairly constant amount, for all levels of ses, both school sectors and regardless of proportion of males in the school.
3) Does it appear that boys are better off in a boy's school and girls in a girl's school or are they better off in coed schools? How would you qualify your findings so parents don't misinterpret them in making decisions for their children?
From the following graphs, we can say that the overall perfromance performance of all schools(girls,boys and co-ed) is equal. But there are more number of students in coed school who perform lower than the students in single sex school. Also, there are very few girls who perform good in girls school as compared to co-ed school. So, I think girls are better off in co-ed school. Also, from the graph we can see that boys perfrom good in boys school as there are more number of students whose performance is good. So, Boys are better off in both boys and co-ed school.
I thought another possibility would be to model this with both Sex and and the type of school (sex.cat = Coed or NotCoed) as predictors of math achievement. Since the Public sector had no single sex schools, and there it was seen previously that Sector had an effect in math achievement, it seems reasonable to to conduct this analysis including only Catholic schools.
dd <- hs1[hs1$Sector=="Catholic",] dd$sex.cat <- factor( ifelse( dd$sex.p==0, "NotCoed", ifelse( dd$sex.p==1, "NotCoed", "Coed"))) fit <- lme( mathach ~ 1 + sex.cat + Sex + sex.cat:Sex , dd, random = ~ 1 | id ,control = list( maxIter = 200, msMaxIter = 200, niterEM = 100)) summary( fit )
Adding a random term for Sex did not improve the model significantly, thus this was not included in the model (p.value = 0.406). Because only the RE model was altered, it was possible to use the anova function to assess the difference in fit.
fit2 <- lme( mathach ~ 1 + sex.cat + Sex + sex.cat:Sex , dd, random = ~ 1 + dvar(Sex,id) | id ,control = list( maxIter = 200, msMaxIter = 200, niterEM = 100)) summary( fit2 ) # Adding a random effect for Sex did not improve the model anova(fit,fit2)
Wald tests for the coefficients included in the model showed that the effect of school category (Coed/Not Coed) was not significant, nor was it significant as an interaction with the gender of the student.
Coefficients Estimate Std.Error DF t-value p-value Lower 0.95 Upper 0.95 (Intercept) 13.427 0.718 1796 18.699 <.00001 12.019 14.84 sex.catNotCoed -0.635 1.313 33 -0.484 0.63185 -3.307 2.04 SexMale 1.144 0.396 1796 2.890 0.00389 0.368 1.92 sex.catNotCoed:SexMale 1.489 1.467 33 1.016 0.31722 -1.494 4.47
4) Is a low ses child better off in a high ses school or are they better off in a school of a similar ses? How about a high ses child in a low ses school? How would you qualify your findings so parents don't misinterpret them in making decisions for their children?
Ans: I have divided the whole data into two groups like "high ses school", and "low ses school". From this graph,we see that, In high ses schools, students with high ses perform better than the students with low ses. In low ses schools, students performs equally irrespective of ses status.
dd$hls<-cut(dd$ses.mean, breaks=2,labels=c("low","high")) xyplot(mathach~ses|hls,dd,auto.key=T)
5) Is a minority status child better off in a school with a higher proportion of minority status children or are they better off in a school with a low proportion? How would you qualify your findings so parents don't misinterpret them in making decisions for their children?
This graph is revealing that students in school with lower proportion minority perform better than the students in schools with higher proportion of minority. So, I think, if parents want their kids to perform good then they should send their kids to school with lower proportion of minority.
In the following graph we include the contextual/compositional effect of variables, then its a different story. These contextual variables can work positive or negative against students perfromance. So, parents need to be very careful while taking their decisions.