MATH 6627 2010-11 Practicum in Statistical Consulting/Assignment Teams/Gray

From Wiki1

Jump to: navigation, search


Assignment 1

1. Simpson's Paradox

Example: My friend and I play a basketball game and each shoot 20 shots. Who is the better shooter?


But, who is the better shooter if you control for the distance of the shot? Who would you rather have on your team?

H32.PNG This is question of Simpson's Paradox.


We can see from this figure, the relationship changed from negative to positive when we took the distance to our consideration. Black line linked probability of we two made. Red line is linked our performance when far, but blue when close.

Simpson’s paradox arises from one simple mathematical truth. Given eight real numbers: a, b, c, d, A, B, C, D with the following properties:12.png, then it is not necessarily true that122.png. In fact, it may be true that:13.png.

This is an obvious math reality, yet it has significant ramifications in Bayesian analysis, medical research, science and engineering studies, and societal statistical analysis. It is of concern for any statistical activity involving the calculation and analysis of ratios of two measurements.

Exmaple 2 (Real Income tax example)


2. Graphics to visualize data

Gray: rgl and p3d (include the use of the 'groups' parameter to produce trajectories


File:Grey rgl p3d 1.pdf File:Grey rgl p3d 2.pdf

rgl is a library of functions that offers 3D real-time visualization functionality to the R programming environment (Adler & Murdoch, 2010), providing OpenGL implemention for R.

p3d is a library of functions which employs functions from RGL to help visualize statistical models expressed as a function of 2 independent variables with the possible addition of a categorical variable (Monette, 2009).

Package rgl

With rgl we create a ‘device’ , which is simply a window, within which a ‘world’ is created where we can create 3 dimensional shapes and through which we can navigate.

Grey World.png

Functions within the rgl package can be divided into 6 categories: (1) Device management functions (open and close devices, control active device) (2) Scene management functions (option to remove certain or all objects from the scene) (3) Export functions (creating image files) (4) Shape functions - essential plotting tools primitives (points, lines, triangles, quads) as well as higher level functions (text, spheres, surfaces).


(5) Environment functions - modify the viewpoint, background and bounding box, adding light sources (6) Appearance function rgl.material(…).

Grey AppearanceOptions.png

Using shapes and surfaces within an rgl device, statistical data can be represented in 3 dimensions. Some advanced examples are available as demos or provided on the rgl website.

Gray Rgl 3d histogram.png Grey rgl example imulated animal abundance.png

A few of the functions from rgl are useful for manipulating 3D models created using p3d, since p3d contains many functions that inherit from rgl but taylor them to statistical methods. Thus all but a few are unnecessary for our purposes unless you would like to contribute functionality to p3d!

Package p3d

In this section I will focus on example code you may use to familiarize yourself with the capabilities of this package. You will require the tuition.Rdata (source:) and USIndicesIndustrialProd.Rdata (source:) data sets. Note that a few of the commands employed in sample code are from rgl, but these will likely be superceded by p3d functions as the package matures.

In this section I will focus on example code you may use to familiarize yourself with the capabilities of this package. You will require the File:Gray p3d ex Tuition.txt and File:Gray p3d ex USIndicesIndustrialProd.txt data sets. Note that a few of the commands employed in sample code are from rgl, but these will likely be superceded by p3d functions as the package matures.

Initialization code:

library( lattice )
library( nlme )
library( car )
library( spida )
library( rgl )
library( p3d )
tuit = read.table('tuition.Rdata',header=TRUE)
prod = read.table('USIndicesIndustrialProd.Rdata',header=TRUE)

For the tuition data we will begin by plotting the annual cost of tuition from a sample of American Universities against the rates of faculty compensation and proportion of students who graduate.

Using mouse keys you can change the field of view and zoom in and out. Plot3d creates the 3D plot as shown on the right.

We can remove elements from the device using the function Pop3d(). This function removes elements starting with the most recently added item. Multiple items can be removed addition an numeric argument, ie.Pop3d(4)
Gray rgl navigation.png Gray p3d ex1.png

Init3d(cex = .8)
Plot3d(tuition ~ fac_comp + graduat, col = c("blue"), data = tuit)

Next we will subdivide the data by category, in this case whether the school is private (red) or public (blue) (variable name public.private).
Gray p3d ex2.png

Init3d(cex = .8)
Plot3d( tuition ~ fac_comp + graduat|public.private, col = c("blue", "red"), data = tuit)

Next we will add regression planes for private(red) and public(blue) schools using the lm() function to determine the fit, and Fit3d() to insert the plane in the graph. Axes and labels are added using Axes3d() and title3d().
Gray p3d ex3.png

fitpub = lm(tuition ~ fac_comp + graduat,subset=(public.private==0),data = tuit)
Fit3d( fitpub, col = c("blue"))

fitpri = lm(tuition ~ fac_comp + graduat,subset=(public.private==1),data = tuit)
Fit3d( fitpri, col = c("red"))

title3d(main='Tuition predicted by grad rates and faculty salary -private (red) and public(blue) institutions')

Data ellipses are useful for understanding our data.
Gray p3d ex4.png


We can change the view point of our graph using function view3d(theta,phi,fov,zoom), which takes polar coordinates. Note that view3d(0,0,0) will rotate the image to to face the x-z plane (y into the screen) and view3d(270,0,0) will rotate the image to to face the y-z plane (x into the screen). Function snap() will capture a still image of the current view. Note that to use movie3d() you must have ImageMagick installed to automatically convert png's to gif, otherwise you must use external software.

Gray p3d ex5.png GrayMovie.gif


spin(theta = 0, phi = 0)

spins(inc.theta = 1/4, inc.phi = 0, theta = NULL, phi = NULL)

movie3d( spin3d(axis=c(0,1,0), rpm=20), duration=2, dir='movie' )

Here is an additional example, using data on the US indices of industrial products, plotting Mining production (MIN) over months and years. Adding the argument ‘groups=YR’ to Plot3d connects the months in a given year to produce trajectories. Gray p3d ex6.png

open3d(windowRect=c(100,100,800,800),cex = .8)
prod = read.table('USIndicesIndustrialProd.Rdata',header=TRUE)
Plot3d(MIN ~ YR+MONTH,data=prod,groups=YR)
title3d(main='Industrial Production Mining (1947-1993)')

Assignment 2

Statistics in the News: "Spousal support a royal pain?"


whether the article suggest a causal relationship between two variables? If so which? Are the data observational or experimental?


Can you think of alternative explanations to causality? Confounding factors? Or explanations consistent with causality? Mediating factors?


Have any confounding factors been accounted for in the analysis?


Have any mediating factors been controlled for in a way that vitiates a causal interpretation of the relationship?


What is your personal assessment of the evidence for causality in the study that is the subject of the article?

Paradoxes and Fallacies


You are studying observational data on the relationship between Health and Coffee (measured in grams of caffeine consumed per day). Suppose you want to control for a possible confounding factor 'Stress'. In this kind of study it is more important to make sure that you measure coffee consumption accurately than it is to make sure that you measure 'stress' accurately.


In a multiple regression of Y on three predictors, X1, X2 and X3, if the coefficients of both X2 and X3, are not significant, it is safe to drop these two variable and perform a regression on X1 alone.


In a multiple regression, if you drop a predictor whose effect is not significant, the p-values of the other predictors should not change very much.


In a model to assess the effect of a number of treatments on some outcome, we can estimate the difference between the best treatment and the worse treatment by using the difference in the mean outcomes.


If two variables have a strong interaction, this implies a strong correlation.
Personal tools