# MATH 6627 2008-09 Practicum in Statistical Consulting

Practicum in Statistical Consulting

NEWS
• NEW Starting Wednesday, March 18 2009, we will meet in Bethune College 202 from at 4 pm to 7 pm.

## General Information

### Meetings

The class will meet every second week on Wednesdays from 7:00 pm to 10:00 pm in Vari Hall 1016. Consult the schedule below for exact dates.

### Goals

As undergraduates we learn statistics through a sequence of courses each focusing on some part of statistical theory. When we solve problems in these courses the tools we are expected to use are obvious. When you have to solve real-world statistical problems, it is rare that there are clear clues about the correct theory or method that needs to be used.

In fact, many problems are best handled with eclectic solutions borrowing from many statistical fields. The goal of this course is to help you develop the skills and confidence to solve real-world problems. You will learn about the key role of many statistical concepts that are rarely seen in standard courses. You will also learn the vital role of visualization and graphics, communication (listening even more than talking) and presentation skills.

The course will help you develop skills in a number of areas:

1. programming and data management skills in R: Although the emphasis in this course is entirely on R, many jobs expect a strong knowledge of SAS -- take every opportunity you can to also learn SAS. Consider, if you are a beginner, the courses offered through the Statistical Consulting Service
2. graphics to visualize data and models
3. how to work as a statistical consultant/collaborator in the analysis of scientific problems
4. developing presentations skills
5. developing an understanding of the role of statistics as a discipline and as a profession in science and business
6. understanding ethical issues related to statistical practice

### Course Work

• In the first term the work for the course consists primarily of assignments done individually and posted on the 'private wiki' with help from your group.
• Since almost all the interesting consulting problems I have seen in recent years have required an understanding of multilevel models, which are a natural extension of traditional linear models, we will develop ideas and concepts for the application of linear and multilevel models to real research problems by working through the course text. The text uses many real data sets and presents methods of analysis in R.
• In the second term we continue working through the text. In addition, you will work on a major consultation project in which you will collaborate with a real client to produce a deep and probing consulting report. The project is very likely to involved multilevel models. Students who are interested may opt to work on a Statistical Society of Canada (SSC) case study for presentation at the SSC meeting in the spring of 2009. The Case Study team may include students who are not currently enrolled in MATH 6627.
• You will also attend some real statistical consultations and prepare brief reports which will count towards your assignment grade.
• Another important part of the course work is your contribution to the 'public wiki'. In particular we will develop two types of information on the wiki
• how to's in R: these are brief articles describing how to do something simple in R, either a graph, an analysis or a type of data manipulation.
• Paradoxes and fallacies in statistics: As your knowledge of statistics becomes deeper you abandon many simple suppositions and replace them with more sophisticated ones. An important part of communication between statisticians and clients -- for that matter between statisticians and the public or between statisticians and students -- involves understanding simple, often fallacious, suppositions and how they can lead to a deeper understanding. We will develop wiki pages to discuss important paradoxes and fallacies.

Assignments (40%)
Consulting Project (40%)
The grade will be based on the appropriateness and quality of the statistical analyses, their the quality of exposition, the quality of presentation.
Contributions to the public wiki (10%)
Class participation (10%)
This is based on the preparation of question on the assigned readings (to be posted on the wiki), attendance, preparation for class and participation in class discussions.

### Class list and teams

Class list
Number Family name Given name e-mail Week 1 Status
1 Boulsaien Khaled bouls801@yorku.ca A
3 Chane Parminder pchane@yorku.ca C
4 Gao Isabel isabelg@mathstat.yorku.ca D
5 Khan Sabria sabria@yorku.ca A
6 Leeza Nusrat nsleeza@yorku.ca B
7 Liao Yang liaoyang@yorku.ca C
X MeschianMehran mmeschia@yorku.ca D dropped
9 Nabipoor Sanjebad Majid masaba@yorku.ca A
10 Palma Luis luispal@yorku.ca B inactive
11 Shakya Sulin sshakya@yorku.ca C
12 Shi Xiaoping xpshi@yorku.ca D
13 Xu Hong hongxu@yorku.ca A
14 Pope Chris u843603@mathstat.yorku.ca C
8 Faroque Shahela TBD D dropped

## Week 1: September 10, 2008

Topics
Course organization
Participation in SCS seminars
You are welcome to attend SCS (Statistical Consulting Service) weekly meetings which consist of bi-weekly 'staff meetings' and bi-weekly seminars on a statistical topic of interest to statistical consultants. The exact topic for this year will be determined in two weeks. Meetings take place every Friday at 2:30 in TEL 5082. Please send an e-mail message to Georges Monette to have your name added to the SCS mailing list. Note that SCS also offers short courses some of which might be of interest to you.
Consulting, communication, writing reports
Statistical consulting environment
Writing reports: Secret of good writing: write so your reader understands you!
Notes on writing reports
Seven basic principles
Not all consulting activities require a formal report. Often a phone call, a verbal report in a face to face meeting, a letter or a memo are the most efficient way of communicating to a client
Communication:
Interpersonal aspects of statistical consulting: Janice Derr, Statistical Consulting Video
Contributions by Doug Zahn
The role of statistics in society -- understanding evidence
One of of the greatest challenges in understanding evidence is bridging the gap between observational data and causal inference, i.e. understanding the links between statistical significance and statistical meaning.
Statistics in the news: Lies or Statistics
Smoking: Observational vs. Experimental data: rough notes

The Fundamental Contingency Table of Statistics
Types of Data
Experimental Observational
Types of Inference Causal Where Fisher would like us to be Where we often are
Predictive Very rare but problematic Good for 'prediction' not 'causal inference':
This is the topic of Frank Harrell's Regression Modeling Strategies'

Finding meaning in observational data -- examples
Hans Rosling: Myths about the developing world
Al Gore: An Inconvenient Truth
Peter Donnelly: How juries get fooled by statistics
Statistician Peter Donnelly explores the common mistakes humans make in interpreting statistics, and the devastating impact these errors can have on the outcome of criminal trials.
Piet Groeneboom Lucia de Berk and the amateur statisticians
Andrey Feuerverger: The Lost Tomb of Jesus
Software
A working statistician should be proficient with at least SAS and R. This course uses R. A good consultant should also be familiar with packages that are likely to be used by clients, e.g. SPSS.
Getting started with R
After installing R, you should install the packages designed for the textbook:
      > install.packages("arm")
> install.packages("BRugs")

(From http://www.stat.columbia.edu/~gelman/bugsR/) Set up R in 'single window mode': Click on Edit, then GUI Preferences, then at the top click SDI. Add a couple of zeroes to the "buffer" and "lines" options near the middle of the screen. Then save the preferences.
Whenever you start R, issue the command:
      > library(arm)

to use the software with the text and issue the command:
      > source("http://www.math.yorku.ca/~georges/R/fun.R")

to use software written for this course.
It is a good idea to use separate project directories for different projects. See Using R with project directories under Windows.
To begin learning R work through Maindonald (2008) Using R for Data Analysis and Graphics
Another excellent tutorial is Christopher Green: R Primer
When you're ready to really plunge into R, work your way through the manual that comes with R. From the R window click on Help|Manuals|An Introduction to R.

Wikis
Public wiki
wiki.math.yorku.ca :
Open for reading to the world
Need an account for editing --- I will create accounts for all members of the class so you can make contributions to the information on the wiki
Private wiki
statswiki.math.yorku.ca :
Need a userid and password for access
The private wiki will be used for course assignments, course materials, etc.
Using a wiki for group assignments
Editing hints for course assignments

#### Gelman and Hill Chapters 1 and 2

• Web page for the book http://www.stat.columbia.edu/~gelman/arm/
• Data directory: http://www.stat.columbia.edu/~gelman/arm/examples/
1. Start R
2. Open a script window: File|New script
3. Use a web browser to to open a data file
4. Cut and paste the data file into the R script window and save it with a suitable name: e.g. police.dat
5. Open a new script window for commands to read in the data file as an R 'data.frame'
1. Count the number of non-data lines to skip at the top, then use the command:
2. Submit the command with Ctrl-R

#### Assignment 1 and things to do

Deadline: 5 pm, Wednesday, September 24.

1. Private wiki
2. R
1. Install R on your computer(s), including your laptop if you have one. See R: Getting started
2. Install the software that goes with our textbook with the R command install.packages("arm")
3. Work through the first two chapters of Maindonald (2008) Using R for Data Analysis and Graphics
3. Questions on readings for next class
Read Chapters 1 to 3 of Gelman & Hill and formulate at least one question on Chapters 1 or 2 and one question on Chapter 3. Add them to the questions at MATH 6627 2008 Questions
4. Statistics in the News
Find a current or recent topic in the news that involves, explicitly or implicitly, an interesting statistical issue. Prepare an analysis of the topic together with a review of scientific evidence. Are there gaps between the science and the public presentation of the topic?
5. Class photo

[Deferred to the next class -- I forgot to take a photo!] See the class photo at MATH 6627 2008-09 Class Photo and enter your name for the caption.

## Week 1.5: September 17, 2008

Topic
This is an optional tutorial on the use of R or the wiki for those who have had little or no experience with either. Be sure to have downloaded R and started covering some of the material in Maindonald (2008) Using R for Data Analysis and Graphics or another tutorial in [2]] before the tutorial. If you have a laptop, install R on it and bring it to the class.
In this tutorial we will work through:
1. the sample session in Venables and Ripley (2002) [3] and
2. the tutorial by John Fox prepared for a short course at UCLA: http://socserv.mcmaster.ca/jfox/Courses/UCLA/index.html
To continue learning R:
3. Work through http://cran.r-project.org/doc/manuals/R-intro.html, also available as a pdf file through the help menu on the R console.
4. Highly recommended for learning R systematically: work through the on-line textbook by J. H. Maindondald at http://wiki.math.yorku.ca/index.php/R:_Getting_started#Exploring_much_more_deeply

## Week 2: September 24, 2008

#### Assignment 2 and things to do

Deadline: 5 pm, Wednesday, October 22.

1. R
1. Work through chapters 3 and 4 of Maindonald (2008) Using R for Data Analysis and Graphics
2. Questions on readings for next class
Read Chapters 4 to 6 of Gelman & Hill and formulate at least one question on each chapter. Add them to the questions at MATH 6627 2008 Questions
3. Class photo

See the class photo at MATH 6627 2008-09 Class Photo and enter your name for the caption.

4. Do your part of the assignment for Week 2. Wherever you can, produce plots showing your fitted models even if not required by the question in the book. When two students work on the same question, work independently. You may look at each other's work but you should do your work with your own group.

## Week 3

#### Visualizing Regression

See Visualizing Regression [4] pp 1-84 for

• Regression to the mean
• The regression paradox and the regression fallacy
• The geometry and interpretation of the data (or concentration) ellipse: the regression line and the data ellipse
• Visualizing correlation and the confidence interval for the slope using the data ellipse

#### Notes on Chapter 3: Linear Regression: the basics

3.2 Multiple predictors
Interpretation of coefficient βi:
"expected change in Y when you change Xi keeping other X's constant".
Not always directly meaningful: e.g. a quadratic model:
E(Y | X) = β0 + β1X1 + β2X2
The change in E(Y | X) for a change in X depends on X and is equal to β1 + 2β2X. Note that β1 is the expected change in Y for a change in X when X = 0. Similar considerations hold for models with interactions, etc.

Counterfactuals (causal) versus predictive interpretation of βi
When is each interpretation correct?
3.3 Interactions
See R script for example
3.4 Statistical inference
Where does $SE(\hat{\beta})$ come from?
With simple regression it's easy:
$SE(\hat{\beta})= \frac{s_e}{\sqrt{\sum{(X_i-\bar{X})^2}}}$
which is easier to interpret when written as:
$SE(\hat{\beta})= \frac{s_e}{\sqrt{n} \times \sigma_X}$
where σX is the 'population' standard deviation of X, i.e. the standard deviation using n as a divisor. Compare with $SE(\bar{Y}) = \frac{s_y}{\sqrt{n}}$. So the information on β is proportional to $\sqrt{n} \times \sigma_X$ and inversely proportional to se.
For multiple regression the common formula is:
$SE(\hat{\beta_k})= \frac{s_e}{\sqrt{ (1-R_k^2) \sum{(X_{ki}-\bar{X_k})^2}}}$, where ...
$SE(\hat{\beta_k})= \frac{s_e}{\sqrt{n} \times \sigma_{X_{k|other Xs}}}$
where $\sigma_{X_{k|other Xs}}$ is the standard deviation of the residual of Xk after regression on all the other regressors.
The importance of this formula is that it suggests how you might try to improve the estimate of βk. You can increase n or decrease the error of regression or increase the variability in Xk keeping other X's constant.
3.5 Graphical display of fitted model
See R script for alternative approach
3.6 Assumptions and diagnostics
If assumptions true the residuals look approximately random from normal distribution and should not show patterns when plotted in various ways. Common diagnostics: study the residuals and plot. GH only mentions the traditional diagnostics of plotting residuals against fitted values and Xs. In addition, there are other plot that have served me very well.
3.7 Prediction and validation
Broad and important topic. Statisticians often pay too little attention to validation.

#### Notes on Chapter 4: Before and after fitting the model

4.1 Linear transformations
Standardizing:
Using z-scores
In passing: simple regression using z-scores:
$\hat{z}_y = r \times z_x$ where r is the correlation.
Using reasonable centre and scale
4.2 Centering and standardizing (especially for models with interactions)
If
E(Y) = β0 + β1X1 + β2X2 + β3X1X2
then
$\frac{\partial E(Y)}{\partial X_1}= \beta_1 + \beta_3 X_2$
So β1 is the 'effect' of X1 when X2 = 0. If we recenter X2 we change the meaning of β1 and vice-versa. The information on β1 is 'maximized' when X2 is centered so that $\bar{X}_2 = 0$. But this is no reason to centre X2 at $\bar{X}_2$ since recentering also changes the meaning of β1.
4.3 Correlation and regression to the mean
Four lines: the principal axis (principal component line), the regression of Y on X, the SD line and the regression of X on Y. In z-scores:
Y on X: $\hat{z}_y = r \times z_x$
X on Y: $\hat{z}_x = r \times z_y$
SD line: $\hat{z}_y = z_x$
If sy = sx then the SD line and the principal axis are identical. Otherwise it's more complicated.
Exercise:
In a course in which the final grade is the average of the mark on a mid-term and on a final exam (both are graded out of 100), a professor would like to impute the mid-term grade of a student who missed the mid-term for a legitimate reason. What's the best way? Just use the final grade? Use the predicted mid-term grade after doing a regression of the mid-term on the final? Impute a z-score for the mid-term using the z-score on the final? Use reverse regression by regressing the final on the mid-term and imputing the value for the mid-term that would predict the student's grade on the final? Use principal axis regression? What are the consequences of using these various methods and which one do you think is best? Examine at least briefly the meaning of best in this context?
4.4 Log transformations
Interpreting β's.
4.5 Other transformations
4.6 Building models for prediction (in contrast with causal inference)
i.e. models that fit well whether parameters have a causal interpretation or not.
GH omit one important consideration: the number of 'degrees of freedom' should not be too large relative to n. Harrell (2001) discusses this in detail. See sample size and validity.

#### Some notes on Chapter 5: Logistic regression

This will create a data set named 'data' with data from elections from 1972 to 2000.

Use:

  d92 <- data[data$year == 1992,]  to get data on the Bush/Clinton race of 1992. ##### Assignment 3 and things to do NEW Deadline: 5 pm, Wednesday, November 12. 1. Questions on readings for next class Read Chapter 7 of Gelman & Hill and formulate at least one question. Add it to the questions at MATH 6627 2008 Questions 2. Individual assignments The following assignment should be done individually. Email your work to me by the deadline. You can send a text file, a Word file or a pdf file. If you wish to use some other format, please let me know so I can make sure that I will be able to read it. 1. Look at the data set http://www.math.yorku.ca/~georges/Data/coffee.csv. It has three relevant variables, 'Heart', which is a measure of heart condition -- the higher the less healthy; 'Coffee', a measure of coffee consumption, and finally, 'Stress', measure of occupational stress. How could you use this data to address the question whether coffee consumption is harmful to the heart.Discuss assumptions needed to get anywhere with the data and discuss the nature of various assumptions that might lead to different interpretations, if relevant. 2. Look at the data set http://www.math.yorku.ca/~georges/Data/hwX.csv where X is the remainder when you divide your 'class number' (the number from 1 to 20 on the class list on the web) by 4. Thus X will be 0, 1, 2, or 3. (i.e. if you number is 7 then X is 3 and you would use the data set http://www.math.yorku.ca/~georges/Data/hw3.csv. The data set contains data on three variables: Health (the higher the better), Height and Weight. All are in standardized units. What would this data set have to say about the relationship between Weight and Health? Discuss assumptions needed to get anywhere with the data and discuss the nature of various assumptions that might lead to different interpretations, if relevant. 3. Do the exercise in red above on imputing a mid-term grade. 3. Review Review your textbooks on multiple regression. What is a confidence ellipse? What is its connection with hypothesis testing? What is a Scheffé confidence interval? What is a Bonferroni confidence interval? ## Week 3.5 Here are the 'blackboard' notes. ## Week 4 March 4, 2009 #### News Assignment 3 was officially due after the start of the strike which means that it wasn't due until now. We will discuss when it ought to be completed. #### Plans The major activity in the course is the analysis of a real data problem. All the data problems I have involve hierarchical or longitudinal data so our priority is to learn enough about the analysis of this kind of data so you can get started on projects by April 1. I propose to meet every week for the next three weeks and then we will reassess our progress. #### Visualizing Simple Regression #### Visualizing Multiple Regression ##### For next week 1. Readings for next week Read Chapters 9 and 10 of Gelman & Hill (skip 8 unless you wish to read it on your own). These two chapters are on causal inference with observational data. They are challenging but very important for professional statisticians to understand. Formulate at least one question. Add it to the questions at MATH 6627 2008 Questions 2. Formulate questions on the material we have seen in class this week: MATH 6627 2008 Questions 3. Finish outstanding assignements. 4. Something to think about 1. Look at the data set http://www.math.yorku.ca/~georges/Data/hs.csv. This data consists of math achievement scores and 'ses' (socio-economic status) of 1977 in 40 U.S. schools, 21 of which are Catholic and 19 public. A goal in analyzing this data is to describe the relationship between math achievement and ses, and to examine whether the relationship is similar in different school sectors and among boys and girls. Explore the data and think about how one could address these questions. A few specific questions to think about: Is a low ses child better off in a high ses school or in a lower ses school? If there is a difference, are we confident that it is the school that makes the difference? Is there any evidence that students in boys or girls schools do better than students in coed schools? How do public schools compare with Catholic schools? A brief description of some variables: school: a numeric id for each school mathach: a math achievement score ses: socio-economic status (education and income of parents) Size: size of school PRACAD: priority given to academics in a school DISCLIM: disciplinary climate Minority: hispanic or black HIMINTY: high proportion of minorities in school > hs <- read.csv("http://www.math.yorku.ca/~georges/Data/hs.csv") > source("http://www.math.yorku.ca/~georges/R/fun.R") > dim(hs) [1] 1977 13 > library( car ) > some( hs ) X school mathach ses sector female Sex Minority Size Sector PRACAD DISCLIM 176 1003 2458 9.142 0.242 1 1 Female Yes 545 Catholic 0.89 -1.484 500 1777 3013 18.846 0.032 0 1 Female No 760 Public 0.56 -0.213 680 3022 4292 16.442 -0.048 1 0 Male Yes 1328 Catholic 0.76 -0.674 880 3705 5619 21.451 0.412 1 1 Female No 1118 Catholic 0.77 -1.286 1023 3909 5720 8.259 -0.238 1 1 Female No 381 Catholic 0.65 -0.352 1064 3950 5720 18.241 1.132 1 0 Male No 381 Catholic 0.65 -0.352 1160 4210 6074 12.553 0.042 1 1 Female No 2051 Catholic 0.32 -1.018 1178 4228 6074 18.875 -0.508 1 1 Female No 2051 Catholic 0.32 -1.018 1818 6302 8707 22.102 0.792 0 0 Male No 1133 Public 0.48 1.542 1942 7150 9586 10.626 1.132 1 1 Female No 262 Catholic 1.00 -2.416 HIMINTY 176 1 500 0 680 1 880 0 1023 0 1064 0 1160 0 1178 0 1818 0 1942 0 > > tab(size = table(hs$school))
size
29    32    34    35    36    37    38    41    42    44    45    48    49    51    52
1     2     1     1     1     2     1     1     1     1     2     2     1     1     2
53    54    55    56    57    58    59    60    63    64    65    66 Total
5     1     1     2     3     2     1     1     1     1     1     1    40
> tab(~ Sex + school, hs)
school
Sex      1317 1906 2208 2458 2626 2629 2639 2658 2771 3013 3610 3992 4292 4511 4530 4868
Female   48   27   35   57   18    0   24   27   28   19   29   21    0   58   63   11
Male      0   26   25    0   20   57   18   18   27   34   35   32   65    0    0   23
Total    48   53   60   57   38   57   42   45   55   53   64   53   65   58   63   34
school
Sex      5619 5640 5650 5720 5761 5762 6074 6484 6897 7172 7232 7342 7345 7688 7697 7890
Female   30   24   32   24   52   21   56   20   29   22   30    0   29    0   11   24
Male     36   33   13   29    0   16    0   15   20   22   22   58   27   54   21   27
Total    66   57   45   53   52   37   56   35   49   44   52   58   56   54   32   51
school
Sex      7919 8531 8627 8707 8854 8874 9550 9586 Total
Female   16   23   24   26   17   21   19   59  1074
Male     21   18   29   22   15   15   10    0   903
Total    37   41   53   48   32   36   29   59  1977

> tab( ~Sector, up(hs, ~school))
Sector
Catholic   Public    Total
21       19       40



## Week 5

March 11, 2009

#### For next week

Read Chapters 11 and 12 of Gelman & Hill. Formulate at least one question. Add it to the questions at MATH 6627 2008 Questions
2. Start working on the following individual assignment due April 1:
Using the full high school data set at http://www.math.yorku.ca/~georges/Data/hsfull.csv address the following questions:
1) Describe the relationship between math achievement and SES. How does it seem to vary between school sectors, between girls and boys?
2a) In what kind of school does a 'poor' girl (ses = -1) seem to be better off? Would she be better off in a school with relatively low mean SES or a school with relatively high SES, a Catholic or a public school, a girls school or a mixed school?
2b) Do question 2a with 'poor' replaced with 'rich' (ses = 1).
2c) Do question 2a with 'girl' replaced with 'boy'.
2d) De question 2b with 'girl' repalced with 'boy'.
Compare the 'effect' of SES among boys in each combination of contexts: public, Catholic, poor school, rich school, girls, boys or mixed schools.
Compare the 'effect' of SES among girls in each combination of contexts: public, Catholic, poor school, rich school, girls, boys or mixed schools.

## Week 6

March 18, 2009

Course materials for this week

#### Hierarchical Models Part I

• [[[:Template:Hmr]]Hierarchical_Models_I/Hierarchical_Models_I_v2.pdf Hierarchical Models Part I version 2, (reasonably clean)]
• [[[:Template:Hmr]]Hierarchical_Models_I/PartI.R R scrips for Hierarchical Models Part I]

#### Hierarchical Models Part I (in progress)

• [[[:Template:Hmr]]Hierarchical_Models_I/Hierarchical_Models_I_v3_CURRENT_DRAFT.pdf Hierarchical Models Part I version 3, (still a mess)]
• [[[:Template:Hmr]]Hierarchical_Models_I/PartI-b.R R scrips for Hierarchical Models Part I(b) (in progress)]

#### For next week

2. Last week's assignment deadline is extended by 1 week to April 1.

## Weeks 7 & 8

March 25 and April 1, 2009

Course materials

#### Hierarchical Models Part II

• [[[:Template:Hmr]]Hierarchical_Models_II/Workshop-Longitudinal_with_R-2009_03_25.pdf Longitudinal Data Analysis with R]
• [[[:Template:Hmr]]Hierarchical_Models_II/TalkOnComasAndMigraines.pdf Non-linear mixed models and generalized linear mixed models]
• [[[:Template:Hmr]]Hierarchical_Models_II/Sample_Analysis.R R script for a sample analysis]

#### For next week

The assignment that was due April 1 is now due April 8. Preferably mail me a pdf or Word file.

## Week 9

Here's the script we wrote in class: MATH6627 Sample analysis 2009 04 22

### Other

New York Times, April 8, 1984