# MATH 6627 2010-11 Practicum in Statistical Consulting/Students/Carrie Smith

## Contents

This year I began working towards my PhD in Psychology in the Quantitative Methods Area. I completed my Masters degree at York studying depth perception (within the Centre for Vision Research) and my undergraduate degree at the University of Toronto in Engineering Science, Aerospace option. As you can see I have quite a varied (and some might say strange) educational background. I have been attending SCS meetings since September and would like to consult in the coming academic year.

I have experience working with R, Matlab, SPSS and SAS.

And a proud member of Team Gray

## Sample Exam Questions

### Week 1

The cigarette data discussed in class indicated positive relationship between smoking and life expectancy. Ample evidence exists that smoking causes life expectancy to decrease, so there is likely more going on. Beyond causality, what are some important alternative possibilities that may complicate interpretation of the results of regression from observational data? (1) There may exist mediating variables (2) There may exist confounding factors (3) The sample employed might not a good representation of the population, either due to choice of participants or by using a small sample (4) A linear regression may not be capturing the true pattern (5) The dependent variable may in fact be causing the independent variable (6) Selection bias

### Week 2

How, by sketching a few lines on this graph, may we satisfy ourselves that the correlation between education and prestige is significant? Explain why. (ps. thanks for lending me your image Andy!

### Week 3

A researcher studying a schizophrenia medication in a clinical population discovers that the dosage is positively correlated with strength of symptoms. She is about to begin a recall because the drug appears to be making patients worse, when it occurs to her that perhaps there is another variable in play which restores the good name of her drug. What might that variable be? How could this variable have this effect (sketch!) and would you describe it as a 'confounding' or 'mediating' variable?

### Week 4

In our lecture on simple regression, we saw an example in which there was a gender gap in salary, and the gap was dissected into two parts, an effect due to pure gender gap and an effect due to occupational segregation. We also saw how measurement error in the score would affect the estimates.

Here is a twist. What if we were dealing with data from men and women in a job that relies strongly on the ability to communicate effectively (Assume that all other pertinent variables (experience, education, etc) are controlled), and the data looks like this:

Question: How might we characterize the gender gap now?

Click to see my answer --->

My interpretation is that the some of the true gender gap is 'masked' by the fact that women have a higher mean communication score. In fact the true gender gap is much larger than the raw gender gap expressed as a difference in the mean salaries.

### Week 5

Continuing from here, how would measurement error in 'communication skills' alter the estimates of the gender gap?

Click to see my answer --->

With measurement error the raw gender gap is unaffected, but the estimate of the true gender gap would be too small. We are underestimating the extent to which the womens' generally higher communication skills are closing the salary gap.

### Week 6

When should we use a random effects ANOVA rather than the regular fixed effects ANOVA? In this example code ( FEvREanova ) 6 Catholic schools are chosen (not randomly, so the result is consistent, but lets pretend!).

How would you expect the estimated school means (reflected as intercepts) to differ between the random effects ANOVA and fixed effects ANOVA?

FE ANOVA (means in red)

Means from the fixed effects ANOVA are exactly the means of the individual schools.

RE ANOVA (means black crosses)

A “group” effect is random if we can think of the levels we observe in that group to be samples from a larger population. In this case we are grouping by school, but the schools are themselves are not exhaustive they are a sample from a population (ok, actually I chose the first 6 Catholic schools which isn't random, but I wanted the output to be consistent).

Because the model is allowing for the fact that the schools are a random sample of schools, and the students are random samples within the schools, the school means (which are intercepts) estimated are closer to the grand-mean than they are with the ordinary fixed-effects anova.

### Week 7

Ecological – βB Pooled - βP HLM - β? HLM+CV - βW FE Model - βW MANOVA- βW
Catholic 5.83 (sd = 1.642) 2.33 (sd = 0.372) 0.62 (sd = 0.463) 0.52 (sd = 0.470) 0.66 (sd = 0.446) 0.66 (sd = 0.531)
Public 5.25 (sd = 1.644) 3.59 (sd = 0.395) 3.01 (sd = 0.480) 2.81 (sd = 0.483) 2.87 (sd = 0.453) 2.87 (sd = 0.531)
' Context Effect: 3.55
Ecological – βB Pooled - βP HLM - β? HLM+CV - βW FE Model - βW MANOVA- βW
Catholic 7.74 (sd = 4.309) 3.13 (sd = 0.482) 0.55 (sd = 0.526) 0.51 (sd = 0.536) 0.66 (sd = 0.446) 0.66 (sd = 0.531)
Public 10.19 (sd = 4.314) 5.02 (sd = 0.511) 2.93 (sd = 0.536) 2.87 (sd = 0.546) 2.87 (sd = 0.453) 2.87 (sd = 0.531)
' Context Effect: 8.49

## Statistics in the Media

### Week 1

Researchers from Newcastle University conducted a study in which signs with pictures of staring eyes were posted in a busy little cafeteria to see whether the images would encourage patrons to act responsibly and tidy up after themselves. I came across this study via an article titled "Fake Watchful Eyes Discourage Naughty Behavior" on Wired Magazine's website.

The Wired article reports that, "The number of people who paid attention to the sign, and cleaned up after their meal, doubled when confronted with a pair of gazing peepers." Doubled, huh? Would that be 1 in 100 people to 2? 40% to nearly 80%? So I went to the source, an article by Ernest-Jones, Nettle & Bateson (2010). As it turns out the sample size was reasonable, and I think effect size in terms of proportions is quite respectable (see figure below), but Wired did make a mistake in reading the graph. The proportion of individuals who left litter decreased from .4 to .2, thus the proportion of individuals who cleaned up after themselves increased from .6 to .8. In other words, under the watchful eye of the creepy posters cafeteria patrons increased their pro-social behaviour by a factor of 1.3.

Though these errors are hardly earth shattering, it is an example of how results can easily be misinterpreted and misrepresented in public forums.

### Week 2

A blog written by a well respected author on Business Insider reports on a study on Twitter usage. The business news blogger concluded that Twitter is not a viable marketing tool, because half of Twitter users never read anyone else's Tweets ("THE TRUTH ABOUT TWITTER: Half Of Twitter Users Never Listen To A Word Anyone Else Says").

However, according to commentary on other sites (such as "Lies Damned Lies and Statistics") it would seem that the researchers who conducted the study and the Business Insider who reported on it failed to account for a very important detail, it is estimated that 40%-60% of Twitter accounts are abandoned. So, of courses many people never read anybody's Tweets, because they don't use Twitter!!

### Week 3

[Zyzmor’s Revenge?] This short article summarizes some cute findings regarding relationships between alphabetical position of surnames. One study showed that researchers with aphabetically early surnames more likely to gain tenure at a top university, become a fellow in the Econometric Society (it was a study by economists), and even win the Nobel Prize. This effect is explained by the fact that in many areas authors are listed alphabetically by last name, thus authors with early surnames are likely to have more citations, since many people (incorrectly) cite papers as Smith et al, 2000.

On the flip-side, people with late surnames also had to wait longer in lines at school, and as a result are slower (and presumably more thoughtful) at making buying decisions because they weren't rushed like the kids with early last names. I think causality there is pretty thin, and I wonder just how tiny this effect size is!!

### Week 4

CBC reports on a study that video game play is associated with anxiety and depression in youth (Kids' excess video gaming tied to anxiety). Researchers used latent growth mixture modeling, and concluded that pathological video game play causes depression and anxiety. This counters conventional assumption that youth who are depressed and/or anxious 'retreat' into game play to avoid their feelings. I look forward to learning more about this methodology in the future! The study was published in the January 2011 issue of Pediatrics (Pathological Video Game Use Among Youths: A Two-Year Longitudinal Study).

### Week 5

One of my favourite sources of fun statistical stories is the blog written by the creators of the free online dating site OkCupid. Because they have 3.5 million active members and they fill in 'quizzes' to ostensibly for the purpose of improving dating matches the site generates massive data sets. I am a little suspicious of the true intentions behind these quizzes, because the creators of the site were statistics students at Harvard, and I get the impression that they are as driven by collecting fun and interesting data about people as they are about making love matches!

With that said, this week's blog post is titled The Best Questions For A First Date. In this post they report correlations between quiz answers and self-reported demographic information so that you can find out more about your date by asking seemingly innocuous questions. My favourite result was that if you want to know: "Is my date religious?", you could ask: "Do spelling and grammar mistakes annoy you?" According to the author: "If your date answers 'no'—i.e. is okay with bad grammar and spelling — the odds of him or her being at least moderately religious is slightly better than 2:1."

### Week 6

Pot can double psychosis risk, new study says This article reported on a longitudinal study that followed nearly 2000 young adults aged 14-24 years. The researchers found that "incident" cannabis use (ie. starting use of cannabis after the start of the study) almost doubled the risk of new psychotic symptoms. The researchers did account for a long list of factors, including age, sex, socioeconomic status, use of other drugs and other psychiatric problems. They did acknowledge that "it not yet clear whether the link between cannabis and psychosis is causal, or whether it is because people with psychosis use cannabis to self-medicate to calm their symptoms."

I feel that an important variable that might improve the study would have been family history of psychiatric illness, as there have been many studies suggesting genetic links. If there was no relationship between family history and cannabis use or at least a relationship while controlling for family history, I think one could made a stronger argument against the possibility that youth who begin smoking cannabis are not simply self-medicating against early sub-clinical symptoms of psychosis and other mental illness.

### Week 7

Say what? Two-thirds of seniors have hearing loss I didn't find it overly surprising that the incidence of hearing loss among older adults is extremely high. What I did appreciate about the report is that the purpose of the study was to inform policy on screening for hearing loss and to contrast this statistic with the fact that most US insurance companies to not cover the cost of hearing aids, despite the fact that these devices are likely prohibitively expensive for many on fixed incomes. I also liked that they pointed out that the total number of people with hearing loss is going up, but that this is driven primarily by the fact that life-spans are increasing, which is an important note that can be omitted to make an statistic seem more sensational!

### Week 8

I would like to preface my post today with the statement that, regardless of politics, individuals on the extremes are sometimes want to misrepresent information in a way to sway public opinion in their favour.

In this post on the National Post website Opinion: Abortion statistics show reality of a land without restrictions Pro-Life activist Anastasia Bowles presents some very sensational statistics on abortion rates in Canada. For example: Even more disturbing, almost one fifth of teens aged 15-19 said they had already had at least one abortion."

Fortunately, the site allowed the authors of the study to which Ms. Bowles was referring to present their side (Counterpoint: Ontario abortion statistics show less dire situation than reported). It turns out that Ms. Bowles failed to mention an important conditional statement regarding the statistics she discussed in her article. Ontario-wide "well under 1% of Ontario’s young women in that age group report having had a previous abortion". According to the researchers, of teens aged 15-19 *who had an abortion in hospital in 2007* 18% reported having had at least one abortion previously, for a total of 0.4% of women in this age group. To reiterate less than 1% of women 15-19 have had at least one abortion, and 0.4% report having more than one.

The authors of the study present quite a different picture. But what I wonder is how many of Ms. Bowles' followers would seek to confirm her numbers?

### Week 9

Today's Metro free paper reports on a study in which it was found that there continues to be a pay discrepancy in Canada between individuals who are visible minorities and those who are not (Race plays big role in pay: Study). The study was based on 2006 long-form census data.

The study reports that male first-generation immigrants of visible minority earn just 68.7 per cent of those who were white males. They were not specific in the article what control variables were included in the analysis, and in this context it appears they are using the term "first-generation immigrants" to refer to individuals born abroad.

What I thought was very nice is that they also reported on second-generation Canadians, meaning individuals who were born in Canada to immigrant parents. I am supposing that second-generation Canadians would have comparable English language skills, which makes the comparison cleaner. Male 2nd generation Canadians earn 75.6 cents for every dollar white men earned (presumably white 2nd gens), and they specified this was controlling for education and age. I feel that this is a very meaningful and powerful statistic. On the upside, this gap has decreased by almost 7 cents since 2000, so at least there is a trend toward pay equity.

### Week 10

Researchers report, "our main finding was that people with a high frequency of religious participation in young adulthood were 50 percent more likely to become obese by middle age than those with no religious participation in young adulthood." They did control for age, race, gender, education, income, and baseline body mass index.

I also enjoyed this quote: "We didn’t look specifically at the potluck factor, but anecdotally, we know that oftentimes at these religious gatherings people will eat traditional comfort foods which are often high in fat and calories and salt,” says Feinstein. “But, again, that’s not something we looked at in this particular study.”

## Questions and Comments on Groupwork and Class Lectures

### Week 1

This is an oft referred to problem in Psychology research, but still an interesting one. A large portion of the studies coming out of university Psychology departments exclusively use undergraduate students as participants. Often, students that are required to participate to receive some credit for their courses. How much of what we think we 'know' about psychology might not actually apply to the general population??

Comment/Answer: - A whole lot!! Lets also consider the fact these psychology students are primarily North American and Caucasian! That makes the results of a vast majority of psychological studies even less generalizable! - Constance

### Week 2

As few other members of the course have already commented, it is becoming increasingly clear from the course consulting is quite a bit more involved that it might at first seem. Taking the time to step away from the problem and assess the basic elements of the analysis at hand couldn't be more important. I know I have rushed head first into experiments and/or analysis before really sitting down to get a clear idea of my objectives, and wasted a lot of time in the process!!

### Week 3

I would like to do something in R similar to what Team Rubin showed with the lattice package, but also want the fit summaries for the sub-plots.

This code splits the continuous X2 variable into 3 'shingles' and plots: X2group <- equal.count(data\$X2,number=3,overlap=0) xyplot(Y ~ X1 | X2group, data=data)

But this doesn't work: fit = lm(Y ~ X1 | X2group, data=data) summary( fit )

I would also love to have the data presented in one Y vs X1 plot, with data corresponding to the 3 levels (the categorization of a continuous var) of X2 in different colours.

Any help?

Hi Carrie,

You can try fit<-lm(Y~X1 | X2group, data=data) instead of fit = lm(Y ~ X1 | X2group, data=data) and then try the summary(fit) statement again. Hopefully that works!

--Lawarren 23:56, 25 January 2011 (EST)

Thanks for the suggestion, but unfortunately it didn't fix the problem. It doesn't throw an error, but gives 'NA' as the Estimate, Std. Error, t-value and p-value! I had to write some ugly unwieldy code to get what I wanted, and I'm sure there must be a better way...

### Week 4

In the slides for Visualizing Multiple Regression in the section on data ellipse for predictors (page 30), but the general predictor has a notation I'm not familiar with and missed in lecture. What does the plus sign in a in a circle mean? Thanks!

Andy: The plus sign in logical operation is called "exclusive or". Wiki link is here. But in the notes, it is specially defined in summary, Page 88.

### Week 5

I understand why we need to adjust our confidence ellipse to be larger for the joint prediction of the coefficients. I don't quite understand why we cast the shadows for the confidence intervals for individual coefficients from the unadjusted ellipse. For example, we are conducting a series of pairwise t-tests we make an adjustment to control the familywise Type I error rate. Why don't we do the same for a 'family' of coefficients?

### Week 6 (sorta)

Does anyone have expertise in converting wide-form data to long-form in R that they would like to share?

### Week 7

I understand why adding an additive contextual variable can be helpful, such that we can differentiate between the between school and within school effects: fitc <- lme( mathach ~ ses * Sector + cvar(ses,id), dd, random = ~ 1 + ses | id )

What about higher order terms involving ses.mean? Lab1 seems to jump straight to an additive effect of ses.mean (though maybe not, I have gone through it and then written my own code so many times I forget what happens where).

cvar(ses,id):Sector would, I presume, allow the between school effect to be different for the two Sectors. This is convenient as Sector is binary.

How about ses.mean:ses or ses.mean:ses:Sector? I would interpret these to mean that the effect student ses varies with school ses (again across Sectors for the 3rd order term).

Is there a reason not to check the 3-way interaction (which actually is very nearly significant) other than parsimony and interpretation?

### Week 8

In Lab1.R while comparing random effects models this line of code shows up: cond( getVarCov( fit.sm ) ) # High but not astronomical

The last part of the output appears to be the correlation matrix for the random effects. What is the rest of the output? And what is "high but not astronomical" about?

### Week 9

Example code to convert wide-form data into long form data ExCode

### Week 10

Want to put the longitudinal analysis techniques you learned in this class to a real test? Heritage Health and Kaggle are offering a *\$3.2 Million* prize "to build a statistical model to predict the number of days a person is likely to spend in hospital over the next year, based on (anonymized) factors such as demographics, medical visits and treatments, and other factors" [\$3.2M in prizes for predicting hospitalization] I will be accepting a 10% finders fee if any of you win!!  ;)