# MATH 6627 2011-12 Practicum in Statistical Consulting/Students/JaneDoe

## Contents

Hi. My name is Jane Doe. Welcome to my wiki page for the Statistical Consulting Practicum. A few facts about me: I have a reasonable amount of experience in applied stats in the social sciences field. I have been working as a sociologist for the past decade or so and want to become skilled at generating and interpreting my own data. I have experience with SAS, SPSS, Mathematica and minitab - although I am kinda rusty at lots of it. I have recent experience with designing and analyzing surveys and regression. Jane Doe is a fictional student and any resemblance to a real person is purely coincidental!!

## Sample Exam Questions

Note that about half of the questions should be suitable for an in-class exam and the other half for a take home exam.

### Week 1

• (in-class exam) We will call a "square root" of a square matrix M any square matrix A such that M = AA'. Show that a matrix has a square root if and only if it is a variance matrix.
• Answer to follow

### Week 2

• (take-home exam) Generate 100 observations for three variables Y, X and Z so that in the separate simple regressions of Y on each of X and Z neither regression coefficient is significant (at the 5% level) but a test of the hypothesis that both coefficients are 0 in a multiple regression of Y on both X and Z is rejected at the 5% level. Explain your strategy in generating the data. How should the data be generated to produce the required result? Show a data ellipse for X and Z and appropriate confidence ellipses for their two regression coefficients. Explain the relationship between the ellipses and the phenomenon exhibited in this problem. What does this example illustrate about the appropriatenes of forward stepwise regression to identify a suitable model to predict Y using both X and Z?
• Answer will follow

## Statistics in the Media

### Week 1

In the Globe and Mail newspaper on Thursday, Margaret Wente raised some questions concerning various measures of the “gender gap” in Canada. You’ll find the full article here. Ms. Wente seems to accuse various labour reports and feminist activists of promoting their agenda with “a heaping cup of statistical abuse.” It is usually fascinating to watch people in extreme ideological and political positions attempt to harness statistics to make their point. In this case, it is not really a case of tampering with the numbers themselves – both sides are using nearly the exact same numbers to make two very different arguments.

One of the ways Ms. Wente believes that the statistics are being abused is the reporting that the gender wage gap in Canada currently is 70.5 cents. Meaning essentially that for every one dollar a male worker earns, a female worker earns 70.5 cents.

Ms. Wente goes on to demonstrate how she thinks this statement is statistical abuse: “Take the gender wage gap. To arrive at 70.5 cents, the report compares full-time annual wages between men and women. What it doesn’t mention is that men work more hours in a year than women do. Once you adjust for that, the gap narrows to 84 cents. And when you adjust for work experience and women’s preference for jobs in the public sector and social services, the gap shrinks to 93 cents.”

Juxtapose this with a recent article by leading feminist researchers on the economy. Dr. Sylvia Fuller of UBC and Dr. Leah Vosko the Canada Research Chair in Feminist Political Economy have recently published an article that arrives at very similar statistics. The full article is available here, for a fee or with a subscription.

The difference is the lens used to interpret these statistics. While Ms. Wente believes that the narrowing wage gap as you control for different variables demonstrates that the wage gap is not really as big as it might initially appear, Drs. Fuller and Vosko choose to see the fact that accounting for job characteristics causes the gap to decrease as indications that the women are experiencing discrimination and less choice in those areas as well.

For example, there is not dispute currently that the gender was gap narrows when you control for the fact that more women work in public sector jobs and social service jobs. But does this mean that really, women get paid nearly the same as men bu they simply prefer to work in sectors and industries that pay less to everyone overall? Or does this mean that not only do women get paid less than men overall, women are also confined to industries and sectors that pay less as well?

### Week 2

I found an amusing cartoon about correlation and causality. Let your mouse hover over the cartoon and an interesting message appears. It seems that the cartoonist has though fairly deeply about correlation and causality.

## Questions and Comments on Groupwork and Class Lectures

### Week 1

*Q: I have not been able to figure out the code to create the graph that we looked at in class that we used to look at the pay equity vs. employment equity issue. I am working on the assignment and want to create one for the question on police discrimination. I have found the functions in car that create data ellipses and confidence ellipses, and I have the R code from your visualizing regression notebook, but I just can't get it to work with the categorical variables. Is this possible? Issues with categorical variables seem to come up for me with a very high frequency!

A: Categorical variables can be tricky in R. We'll probably revisit this during the course. Here's a straightforward way. I'll use the Prestige data frame in library(car) as an example. There are three non-missing values for type.

library(car)
data(Prestige)
dd = Prestige

cols = c('red','green','blue')   # the colours you'd like to use for the three types

The line farther below, where we plot income vs education, shows the power and the danger of factors in R. The variable 'type' is a factor. When you print it with

dd\$type

it looks like a character variable. But it's 'really' integers that index into a vector of 'levels'. To see the 'real' dd\$type, you use:

unclass(dd\$type)

The following line uses the two personalities of 'type'. When 'type' is used to index 'cols', it acts like an integer. The first character of 'type' is used as a plotting character (pch):

plot( income ~ education , dd, pch = as.character(type), col = cols[type])

There are more 'traditional' ways of doing the following but here's an 'Rish' approach illustrating the use of the function 'split' and 'lapply': It also uses the 'dell' function in 'fun.R' but this could be done with John's functions in car:

source( 'http://www.math.yorku.ca/~georges/R/fun.R')

lapply( split( dd, dd\$type), function(x) with( x, lines(dell( education, income), col = cols[type], lwd=2)))

Finally, let people know what they're looking at:

legend( locator(1), levels( dd\$type) , col = cols, pch = substring( levels(dd\$type), 1,1))

which produces:

### Week 2

• Q2: What is the difference in running a Generized Mixed Model Regression and a Repeated Measures ANOVA?

Partial A...Please Help! Using a REPEATED statement with missing occasion

This is an example of an answer or comment provided by someone else: The REPEATED defines a model for the Σ matrix. The default choice, Σ =σ I , does not require the same number of occasions for each subject. More complex choices, e.g. UN (unstructured), FA0(T), AR(1), etc. are not inherently defined if the number of occasions changes from subject to subject. One can think of a Σ matrix defined f*or the largest number of occasions. For subjects who do not have the maximum number of occasions, it is necessary to determine how the recorded occasions correspond to the rows of the Σ matrix. For example, if the Σ matrix is 5 x 5 and a subject with incomplete data has data for only 3 occasions, it is necessary to know whether these three occasions correspond to the 1st, 2nd and 3rd rows of Σ , or the 1st, 3rd and 4th, etc. --Georges 11:22, 3 January 2011 (EST) Note: if response or comment comes from someone else, it is a good idea to sign it. Just click on the 'signature' icon above the editing window.