MATH 6627 2010-11 Practicum in Statistical Consulting/Assignment Teams/Gray

From Wiki1

Jump to: navigation, search


Assignment 1

1. Simpson's Paradox

Example: My friend and I play a basketball game and each shoot 20 shots. Who is the better shooter?


But, who is the better shooter if you control for the distance of the shot? Who would you rather have on your team?

H32.PNG This is question of Simpson's Paradox.


We can see from this figure, the relationship changed from negative to positive when we took the distance to our consideration. Black line linked probability of we two made. Red line is linked our performance when far, but blue when close.

Simpson’s paradox arises from one simple mathematical truth. Given eight real numbers: a, b, c, d, A, B, C, D with the following properties:12.png, then it is not necessarily true that122.png. In fact, it may be true that:13.png.

This is an obvious math reality, yet it has significant ramifications in Bayesian analysis, medical research, science and engineering studies, and societal statistical analysis. It is of concern for any statistical activity involving the calculation and analysis of ratios of two measurements.

Exmaple 2 (Real Income tax example)


2. Graphics to visualize data

Gray: rgl and p3d (include the use of the 'groups' parameter to produce trajectories


File:Grey rgl p3d 1.pdf File:Grey rgl p3d 2.pdf

rgl is a library of functions that offers 3D real-time visualization functionality to the R programming environment (Adler & Murdoch, 2010), providing OpenGL implemention for R.

p3d is a library of functions which employs functions from RGL to help visualize statistical models expressed as a function of 2 independent variables with the possible addition of a categorical variable (Monette, 2009).

Package rgl

With rgl we create a ‘device’ , which is simply a window, within which a ‘world’ is created where we can create 3 dimensional shapes and through which we can navigate.

Grey World.png

Functions within the rgl package can be divided into 6 categories: (1) Device management functions (open and close devices, control active device) (2) Scene management functions (option to remove certain or all objects from the scene) (3) Export functions (creating image files) (4) Shape functions - essential plotting tools primitives (points, lines, triangles, quads) as well as higher level functions (text, spheres, surfaces).


(5) Environment functions - modify the viewpoint, background and bounding box, adding light sources (6) Appearance function rgl.material(…).

Grey AppearanceOptions.png

Using shapes and surfaces within an rgl device, statistical data can be represented in 3 dimensions. Some advanced examples are available as demos or provided on the rgl website.

Gray Rgl 3d histogram.png Grey rgl example imulated animal abundance.png

A few of the functions from rgl are useful for manipulating 3D models created using p3d, since p3d contains many functions that inherit from rgl but taylor them to statistical methods. Thus all but a few are unnecessary for our purposes unless you would like to contribute functionality to p3d!

Package p3d

In this section I will focus on example code you may use to familiarize yourself with the capabilities of this package. You will require the tuition.Rdata (source:) and USIndicesIndustrialProd.Rdata (source:) data sets. Note that a few of the commands employed in sample code are from rgl, but these will likely be superceded by p3d functions as the package matures.

In this section I will focus on example code you may use to familiarize yourself with the capabilities of this package. You will require the File:Gray p3d ex Tuition.txt and File:Gray p3d ex USIndicesIndustrialProd.txt data sets. Note that a few of the commands employed in sample code are from rgl, but these will likely be superceded by p3d functions as the package matures.

Initialization code:

library( lattice )
library( nlme )
library( car )
library( spida )
library( rgl )
library( p3d )
tuit = read.table('tuition.Rdata',header=TRUE)
prod = read.table('USIndicesIndustrialProd.Rdata',header=TRUE)

For the tuition data we will begin by plotting the annual cost of tuition from a sample of American Universities against the rates of faculty compensation and proportion of students who graduate.

Using mouse keys you can change the field of view and zoom in and out. Plot3d creates the 3D plot as shown on the right.

We can remove elements from the device using the function Pop3d(). This function removes elements starting with the most recently added item. Multiple items can be removed addition an numeric argument, ie.Pop3d(4)
Gray rgl navigation.png Gray p3d ex1.png

Init3d(cex = .8)
Plot3d(tuition ~ fac_comp + graduat, col = c("blue"), data = tuit)

Next we will subdivide the data by category, in this case whether the school is private (red) or public (blue) (variable name public.private).
Gray p3d ex2.png

Init3d(cex = .8)
Plot3d( tuition ~ fac_comp + graduat|public.private, col = c("blue", "red"), data = tuit)

Next we will add regression planes for private(red) and public(blue) schools using the lm() function to determine the fit, and Fit3d() to insert the plane in the graph. Axes and labels are added using Axes3d() and title3d().
Gray p3d ex3.png

fitpub = lm(tuition ~ fac_comp + graduat,subset=(public.private==0),data = tuit)
Fit3d( fitpub, col = c("blue"))

fitpri = lm(tuition ~ fac_comp + graduat,subset=(public.private==1),data = tuit)
Fit3d( fitpri, col = c("red"))

title3d(main='Tuition predicted by grad rates and faculty salary -private (red) and public(blue) institutions')

Data ellipses are useful for understanding our data.
Gray p3d ex4.png


We can change the view point of our graph using function view3d(theta,phi,fov,zoom), which takes polar coordinates. Note that view3d(0,0,0) will rotate the image to to face the x-z plane (y into the screen) and view3d(270,0,0) will rotate the image to to face the y-z plane (x into the screen). Function snap() will capture a still image of the current view. Note that to use movie3d() you must have ImageMagick installed to automatically convert png's to gif, otherwise you must use external software.

Gray p3d ex5.png GrayMovie.gif


spin(theta = 0, phi = 0)

spins(inc.theta = 1/4, inc.phi = 0, theta = NULL, phi = NULL)

movie3d( spin3d(axis=c(0,1,0), rpm=20), duration=2, dir='movie' )

Here is an additional example, using data on the US indices of industrial products, plotting Mining production (MIN) over months and years. Adding the argument ‘groups=YR’ to Plot3d connects the months in a given year to produce trajectories. Gray p3d ex6.png

open3d(windowRect=c(100,100,800,800),cex = .8)
prod = read.table('USIndicesIndustrialProd.Rdata',header=TRUE)
Plot3d(MIN ~ YR+MONTH,data=prod,groups=YR)
title3d(main='Industrial Production Mining (1947-1993)')

Assignment 2

Statistics in the News: "Spousal support a royal pain?" Journal Abstract

Overly supportive spouses are not necessarily doing their partners a favour.

They could be prolonging the recovery of their injured spouses.

Men with highly attentive spouses reported higher levels of pain and more disability but did well on physical functions tests.

Women with highly attentive spouses didn't report feeling more pain or being more disabled. However, they performed more poorly on physical function tests than did women with less attentive spouses

MEDIA: Gray_RoyalPain.pdf


whether the article suggest a causal relationship between two variables? If so which? Are the data observational or experimental?

Yes, the article did suggest a causal relationship. One variable is the spousal solicitousness (attentiveness & support). Another is the degree of reported pain and disability. A third is actual physical function. Patient gender was also taken into consideration.

According to the report, men with chronic pain report more perceived pain and disability when they receive higher levels of spousal attention, and it is implied this is controlling for actual physical ability. For women there was no difference in self-reports of pain and disability with level of spousal support, but women who received more attention from their husbands had poorer physical function than those who did not.

The data are observational, because there is no manipulation or randomization in collecting the data.


Can you think of alternative explanations to causality? Confounding factors? Or explanations consistent with causality? Mediating factors?

Andy: Yes, maybe high degree of pain causes solicitousness, instead of solicitousness causing towards more pain. I don't think there exits a confounding factor. If there is, I believe it is LOVE, because a man's LOVE wins a highly attentive spouse and doesn't work as a narcotic as a woman feels. Then he may ask for more LOVEs by reporting hurt. That is a sweet answer to any woman, but tooth-hurting felt by any healthy man.

Also, LOVE could be the mediating factor too. And with high solicitousness, man's LOVE starts to fall in. He may move his body, more likely, which may cause more hurt. And then his physical function may recover faster. But to a woman, she likes to finding pieces of LOVEs through solicitousness. Her heart is numbed with those LOVEs. And then she will report she is better as an affectionate payback.


Have any confounding factors been accounted for in the analysis?

Emotional (men) or Physical (women) troubles outside the primary cause of pain. Men who are emotionally unstable may attract and encourage attentiveness in their wives, and also report a disproportionately high level of disability relative to their level of function. Women who are physically weak prior to injury may attract husbands who are highly supportive, but also be more susceptible to injuries that result in chronic pain.


Have any mediating factors been controlled for in a way that vitiates a causal interpretation of the relationship?

This does not appear to be the case.


What is your personal assessment of the evidence for causality in the study that is the subject of the article?

There is a good case to be made for both forward and backward causality in this study. In addition it would be nice to be able to ontrol for emotional and physical stability prior to chronic injury. Are frequencies distributed as expected? Are the proportions of highly attentive spouses equal across groups? ASSESSMENT: Judgement withheld pending further evidence

Paradoxes and Fallacies


You are studying observational data on the relationship between Health and Coffee (measured in grams of caffeine consumed per day). Suppose you want to control for a possible confounding factor 'Stress'. In this kind of study it is more important to make sure that you measure coffee consumption accurately than it is to make sure that you measure 'stress' accurately.

True. It is more important to accurately measure Coffee Consumption if this is the variable of primary interest in the study.

Consider this from the analysis of variance framework. When measurement error is introduced, the variability in the outcome accounted for by that factor is decreased, and the error sums of squares increases. If we decrease accuracy in measuring Stress, and the error term will accordingly increase, which will in turn decrease the power to detect the effect of Coffee on Health somewhat, but will not affect the coefficient estimates by much. However, if we have increased measurement error in Coffee Consumption, then the proportion of variability explained by Coffee will decrease AND the error term will increase. We lose power on two fronts!

Here is a short r-script that demonstrates this idea: File:Gray Q2MeasError.r


In a multiple regression of Y on three predictors, X1, X2 and X3, if the coefficients of both X2 and X3, are not significant, it is safe to drop these two variable and perform a regression on X1 alone.

No. The way to do this is starting with 3 variables in the model, and dropping the least "significant", one at a time, until you are left with only "significant" variables.


In a multiple regression, if you drop a predictor whose effect is not significant, the p-values of the other predictors should not change very much.

No. After dropping the predictor, model changes. We can see this from this example @ page 26-28


In a model to assess the effect of a number of treatments on some outcome, we can estimate the difference between the best treatment and the worse treatment by using the difference in the mean outcomes.

No. Size of tumors does provide some info about two treatments, one is the best and another is the worse. But it is not a good idea to estimate the difference between the two treatments by using the difference of size of tumors. It is so biased, because there are other factors which will effect the outcome, such as time in the disease. For example, our target is New York when we drive out of Toronto. But some day, we find everyone around is speaking Spanish. But the answer may be yes too, as Barack Obama's slogan showed: "Yes we can".


If two variables have a strong interaction, this implies a strong correlation.

False. An interaction exists if the effect of one independent on the dependent variable varies over another independent variable. This tells us nothing about the relationship between the IVs. An interaction between two variables can occur whether or not the variables are correlated with one another.

No Interaction Interaction
No Correlation Gray Q14 nCnI.gif X1X2 Uncorr NoInt.png Gray Q14 nCI.gif X1X2 Uncorr Int.png
Correlation Gray Q14 CnI.gif X1X2 Corr NoInt.png Gray Q14 CI.gif X1X2 Corr Int.png

File:Gray Q14.R

Here is an additional example using binary variables: Illustration of the difference between correlation and interaction amongst independent variables

Personal tools