MATH 6627 2011-12 Practicum in Statistical Consulting

From Wiki1

Jump to: navigation, search

Practicum in Statistical Consulting

News 
  • A link to the video recorded in the 'tutorial' on March 21 is shown under the heading for Week 12 below.
  • Links to videos of the lectures of Feb. 29 and March 14 are shown under the heading for respective weeks below.
  • The final exam is scheduled for Thursday, April 5, 2012 from 3pm to 5pm in N638 Ross. Here are some sample questions.
  • Sep. 1, 2011: Fifteen places have been reserved for graduate students in Mathematics and Statistics and graduate students in Quantitative Methods in Psychology in the SCS course:
    Statistical Analysis and Programming with R
    offered in three sessions every second Wednesday 7 pm to 10 pm starting September 21. If you are a graduate student in Mathematics and Statistics or in Quantitative Methods in Psychology there is no fee for attending the course. Register by sending an email message to Georges Monette before September 15. Do not register through the SCS website.
Quick links 


Contents

General Information

Instructor

Meetings

  • The class will meet every second Wednesday from 7:00 pm to 10:00 pm in South Ross 101A.
  • The first meeting will be on September 14.
  • On three alternating Wednesday starting September 21, there is a course on R organized by the Statistical Consulting Service [1]. You are strongly encouraged to consider taking this course. Some places have been reserved, for graduate students in Statistics and in Quantitative Methodology. These places are funded by your programs and you do not need to pay the SCS fee yourselves. Please write directly to Georges Monette (georges@yorku.ca) if you intend to attend and DO NOT register through SCS.

Goals

As undergraduates we learn statistics through a sequence of courses each focusing on some part of statistical theory. The problems we solve in these courses use the tools learned in the course. When you have to solve real-world statistical problems, it is rare that there are clear clues about the appropriate theories or methods you need to use.

In fact, many problems are best handled with eclectic solutions borrowing from many areas of knowledge. Not only do you need to draw on your statistical knowledge but also on all your accumulated knowledge and experience in life: your understanding of the subject matter of the problem, your creativity with mathematical models, your ability to visualize and communicate, your interpersonal skills to help you understand your clients' possible anxieties with statistics, your insight in working with your own anxieties, etc, etc.

The goal of this course is to help you develop the skills and confidence to solve real-world problems. You will learn about the key role of many statistical concepts that are rarely seen in detail in standard courses. You will also learn the vital role of visualization and graphics, communication (listening even more than talking) and presentation skills.

The course will help you develop skills in a number of areas:

  1. programming and data management skills in R: Although the emphasis in this course is entirely on R, many jobs expect a strong knowledge of SAS -- take every opportunity you can to also learn SAS. Consider, if you are a beginner, the courses offered through the Statistical Consulting Service. If you work with clients who use another package, e.g. SPSS or Stata, you might have to learn enough about them to show your clients how to perform their analyses using their own packages. A recent feature in SPSS and SAS allows R to be called from these packages. If you have a solution that is too advanced for SPSS or SAS, you can prepare code in R and provide a client with the ability to perform the analysis from the package that is familiar to them.
  2. graphics to visualize data and models
  3. how to work as a statistical consultant/collaborator in the analysis of scientific problems
  4. developing presentations skills
  5. developing an understanding of the role of statistics as a discipline and as a profession in science and business
  6. acquiring basic concepts and techniques related to the analysis of hierarchical and longitudinal data -- a large proportion of the problems our students deal with in consulting involve problems that can be approached through these techniques
  7. understanding ethical issues related to statistical practice

References

There is no single textbook this year. Consult and add to our list of useful links and references: Links and References

Course Work

There are 4 components:

  1. Individual student pages (or 'blogs') and contributions to the course wiki:
    Use your course page, which you create as a subpage of the Student pages for the course, to prepare after each class and before the next class:
    1. Sample exam questions: A sample exam question and answer on the material of the previous class.
    2. A posting with links and comments on
      • Statistics in the News
        OR
      • Statistical paradoxes and fallacies
        OR
      • (optional) Reflections on statistical consultations that you arrange to attend in the SCS. The reflections must be constructed to avoid any violation of privacy of either the client or the consultant whose session you attended.
    3. Questions and comments on group work and class lectures
    4. In addition, throughout the course, contribute questions, answers, comments, reflections, etc. to the various discussion pages for the course, e.g. Statistics: Questions and Answers, R: Questions and Answers, etc. In fact, you can create new discussion pages.
    See more details.
  2. Groups assignments and short presentations
    Four or five group assignments on material covered in the course. Some assignments will also involve short group presentations to the class. These presentations will be timed (typically 15 minutes) and getting the interesting aspects of your message across in a limited time is challenging and requires good preparation and coordination.
    The groups for assignments are created using a quasi-random algorithm. Their composition will be sent to you shortly after the deadline for completing your wiki pages.
  3. Group consulting project
    You will work on a major consulting project in which you will collaborate with a real client to produce a deep and probing consulting report and analysis. The project is very likely to involved multilevel models. Students who are interested may opt to work on one of two case studies for possible presentation at conferences in the summer of 2012. The case study team may include students who are not currently enrolled in MATH 6627. Further details will be available later. The consulting project groups are not necessarily identical to the assignment group.
  4. A final in-class 3-hour exam on the material of the course. If you produce good sample questions, you might find a question very close to yours on the exam.

The weight of components is

  1. 25% for individual assignments, pages and contributions to the wiki,
  2. 20% for group assignments (10% a common group mark and 10% based on your individual contribution, e.g. as seen in presentations, answering questions, etc.),
  3. 30% for the consulting project (15% group and 15% individual based on presentations, answering questions, and a statement of individual contributions to the project) and
  4. 25% for the final exam.
  • Note that your contributions to the wiki in general are as easy to identify as those in your individual course page and are given credit that is similar to contribution on your own course page or, indeed, comments added to other students' course pages.

Course organization

  • In the first few classes we will look at general statistical questions that are important in consulting. The following five or six weeks will be devoted to multilevel models and the final portion of the course will be devoted to discussions and presentations of your projects.
  • You will have the opportunity to organize and present short presentation on assignments starting from the third week of classes.
  • Starting from the first week, you should work on your course pages, generating questions, links, comments, etc., and you should start working with your Assignment Team on the first assignment.
  • The course assumes a working knowledge of R including trellis (lattice) graphics. We can schedule a few special tutorials for anyone who feels the need.
  • The final exam will give you a chance to show how you have reflected on the material of the course. It will consist mainly of 'essay' questions possibly drawn for your own sample questions if they turn out to be good.

Week 1 (September 14)

Files and links used in class

Assignments

  • See /Students for:
    • information on creating your home page (deadline: Noon Friday September 16)
    • the biweekly assignment due at noon the day of each class
    • Assignment I1 (individual assignment 1): Assignments 1 and 2 (the answers to Assignment 1 should be sent as a e-mail attachment by noon on September 28)


Other files and links

Week 2 (September 28)

Assignment I2 (individual)

Assignment I2 (individual assignment 2): Assignments 1 and 2 (the answers to Assignment 2 should be sent as an e-mail attachment by noon on October 26)

Assignment W2 (individual web assignment)

Prepare the second contribution in your home page in /Students due by noon on October 26.

Assignment T1 (team)

Deadline for files on wiki: noon, October 26, 2011; Presentations: 7 pm on October 26, 2011

1. Simpson's Paradox 
Simpson's Paradox describes a situation where the direction of the association between two variables, X and Y, changes when conditioning on a third variable, Z. A troubling classical example, based on data obtained from the State of Florida, is discussed in Agresti (1990) "Categorical Data Analysis". In this example, Y is a dichotomous variable indicating whether a prisoner found guilty of murder was sentenced to capital punishment (execution), X is the race (white or black) of the prisoner and Z is the race of the victim. Overall, a lower proportion of blacks than whites are sentenced to be executed but, controlling for the race of the victim, the association is dramatically reversed. Possibly, the judicial process goes easier on blacks because their victims tend to be black. In this example of Simpson's Paradox, X, Y and Z are all dichotomous variables and Z can be thought of as a confounding factor. Although, in principle, the phenomenon is the same whether X, Y or Z are dichotomous or numerical (continuous or discrete) variables, our ability to visualize the phenomenon and to transfer the concept to other situations seems to depend crucially on the nature of the variables. Considering that the role of Z can be that of a confounding factor, a mediating factor or some mixture of both, there are 2 x 2 x 2 x 3 = 24 possible combinations of potential examples illustrating Simpson's Paradox. Find two different ones. They may be just plausible examples or, better, real examples or, best, real examples with real data. Discuss each example and post at least one graph or hand-drawn sketch (use something like Microsoft Paint, save as a .png file and upload to the wiki) for each to help visualize the example. Prepare a 5-minute presentation.
2. Graphics to visualize data 
Explore the graphical possibilities of one of the following packages in R:
  • Gosset: lattice (be sure to include the use of panels and groups)
  • Tukey: googleVis (the data set should be longitudinal, i.e. include a time variable)
  • Efron: rgl and p3d (include the use of the 'groups' parameter to produce trajectories
Prepare some files on the wiki that would be useful to help someone else understand how to use and exploit the package and prepare a 15-minute (absolute maximum -- rehearse to make sure you won't go over) presentation.
Part of the motivation for this assignment is to give you the experience of exploring new ideas and methodologies, not previously covered in class, as a group and preparing documents and presentations to help others understand what you have discovered. In preparing information and a presentation, actively take the perspective of your audience. The test of success is not showing how much you know but helping people in your audience increase what they know.

Week 3 (October 26)

  • Simulation as a tool for exploring statistical ideas

Assignment I3 (individual)

Assignment I3 (individual assignment 3). Send by email by noon on Wednesday, November 9, 2011. Include any computer code you have written in an appendix.

  1. Read Visualizing Multiple Regression carefully and work through the commands in Visualizing Multiple Regression - R script. In preparation for an in-depth discussion in the next class, prepare 3 questions on this material.
  2. Do a simulation to get some insight into alternative possibilities in answer to question 3 in assignment 1.
    The idea of the simulation is to get some insight into the behaviour of samples from highly skewed kurtotic populations, individual wealth for example, and to get insight into the limitations of simulation.
    Consider the distribution of lottery winnings in a lottery based on '6/49' (do a bit of research to find out what this is if you don't know). Suppose there is a prize of $10,000,000 for getting all 6 numbers correct and that there are no other prizes for getting fewer numbers correct.
    1. What is the expected value of a single ticket? (i.e. the population mean of an extremely large population of lottery ticket buyers).
    2. Suppose you decide to estimate the expected value of a single ticket by taking a sample of N = 100 ticket buyers and recording how much each person won. Answer the following question using two methods:
      • theoretical: using what you know of the binomial distribution and using R functions such as 'qbinom' and 'pbinom', and
      • simulation: simulate taking such a sample 10,000 times (perhaps using 'rbinom') and use the simulation results to guess the answers to the first three questions.
      1. What is the expected value of the mean of this sample?
      2. What is the median value of the distribution of the mean of this sample?
      3. What is the probability that the mean of the sample is larger than the mean of the population?
      4. Theoretical only: What is the asymptotic value of this probability as N tends to infinity?
    3. Repeat the above with a sample size of N = 1,000. Note that if you set things up well, repeating for N = 1,000 should take very little of your time although it may take a lot of computer time.
    4. Discuss the connection of these results with the Central Limit Theorem.
    5. Discuss the connection between these results and question 3 on assignment 1.
  3. Use simulations to explore the consequences of the various imputation methods suggested in question 7 of assignment 1. Address, for example, what are the consequences of each method for the final grades of poor, average, and very strong students?

Week 4 (November 9)

/Week 4 Questions

Week 5 (November 16)

/R-script for Wald Tests

Week 6 (November 23)

Week 7 (January 4)

Week 8 (January 18)

Week 9 (February 1)

Week 10 (February 15)

Week 11 (February 29)

Week 12 (March 14)

Week 13 (March 28)

More sample exam questions

  1. Consider the following models where Y, X, Z1, Z2, Z3 are numerical variables. All but one of these models will produce the same regression coefficient for X but they will produce different standard errors. Identify the model that produces a different coefficient. Rank the others where you can according to the se of the estimated coefficient stating which would be equal if any (assume a very large n and ignore the effect of slight differences in degrees of freedom for the error term). Explain your reasoning.
    • Y ~ X + Z1 + Z2 + Z3
    • Yr ~ Xr where Yr is the residual of Y regressed on Z1, Z2, Z3 and Xr is the same for X
    • Y ~ Xr
    • Y ~ X + Xh where Xh is the predictor of X in the regression of X on Z1, Z2 and Z3
    • Y ~ X
    • Y ~ X + Xh + Z1
  2. Let Y and X be a numerical variables and let G be a factor. Consider the following models. All but one of these models will produce the same regression coefficient for X but they will produce different standard errors. Identify the model that produces a different coefficient. Rank the others where you can according to the se of the estimated coefficient stating which would be equal if any (assume a very large n and ignore the effect of slight differences in degrees of freedom for the error term). Explain your reasoning.
    • Y ~ X + G
    • Y ~ X
    • Yr ~ Xr where Yr is the residual of Y regressed on G and Xr is the same for X
    • Y ~ Xr
    • Y ~ X + Xh where Xh is the predictor of X in the regression of X on G
    • Y ~ X + Xh + Zg where Zg is a 'G-level' numerical variable, i.e. it has the same value for all observations within any value of G.
  3. Longitudinal data analysis with mixed models: Consider a mixed model with random intercept and slope with respect to time, T. Suppose that the G matrix is
    
\begin{bmatrix}
\tau_{00}      &  \tau_{01}      \\
\tau_{10}      & \tau_{11} 
\end{bmatrix}
    • Find a the value of T for which the variance of Y is minimized.
    • Show that recentering T on this value (if known) turns the G matrix into one with only two free parameters.
    • Explain why recentering and rescaling T can thus help with convergence problems.
  4. Consider the following output:
> head(hs)
  school mathach    ses    Sex Minority Size   Sector PRACAD DISCLIM
1   1317  12.862  0.882 Female       No  455 Catholic   0.95  -1.694
2   1317   8.961  0.932 Female      Yes  455 Catholic   0.95  -1.694
3   1317   4.756 -0.158 Female      Yes  455 Catholic   0.95  -1.694
4   1317  21.405  0.362 Female      Yes  455 Catholic   0.95  -1.694
5   1317  20.748  1.372 Female       No  455 Catholic   0.95  -1.694
6   1317  18.362  0.132 Female      Yes  455 Catholic   0.95  -1.694
> fit <- lme( mathach ~  ses * cvar(ses,school), hs, 
+             random = ~ 1 + ses|school)
> summary(fit)
Linear mixed-effects model fit by REML
 Data: hs 
       AIC      BIC    logLik
  12846.85 12891.54 -6415.423

Random effects:
 Formula: ~1 + ses | school
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev    Corr  
(Intercept) 1.6293867 (Intr)
ses         0.6614903 -0.469
Residual    6.1109156       

Fixed effects: mathach ~ ses * cvar(ses, school) 
                          Value Std.Error   DF  t-value p-value
(Intercept)           12.681917 0.3054760 1935 41.51526  0.0000
ses                    2.243374 0.2416545 1935  9.28339  0.0000
cvar(ses, school)      3.687892 0.7699000   38  4.79009  0.0000
ses:cvar(ses, school)  0.873953 0.5771829 1935  1.51417  0.1301
 Correlation: 
                      (Intr) ses    cv(,s)
ses                   -0.188              
cvar(ses, school)      0.022 -0.261       
ses:cvar(ses, school) -0.258  0.065  0.014

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-3.2291287 -0.7433282  0.0306118  0.7770370  2.6906899 

Number of Observations: 1977
Number of Groups: 40 
  • Sketch the estimated response function for a school with mean ses of 0 and for a school with a mean ses of 1. Assume that the range of ses is from -2 to 2.
  • Show clearly where is each of the linear regression coefficients estimated in the the model are reflected on the graph.
  • For what value of ses is the variance of mathach estimated to be minimized.
Personal tools