MATH 6627 2008-09 Practicum in Statistical Consulting

From Wiki1

Jump to: navigation, search

Practicum in Statistical Consulting

Quick links 
NEWS
  • NEW Starting Wednesday, March 18 2009, we will meet in Bethune College 202 from at 4 pm to 7 pm.

NEWS ARCHIVE

WORLD NEWS

SHARE KNOWLEDGE

PROGRAMMING SKILLS

Contents

General Information

Instructor

Meetings

The class will meet every second week on Wednesdays from 7:00 pm to 10:00 pm in Vari Hall 1016. Consult the schedule below for exact dates.

Goals

As undergraduates we learn statistics through a sequence of courses each focusing on some part of statistical theory. When we solve problems in these courses the tools we are expected to use are obvious. When you have to solve real-world statistical problems, it is rare that there are clear clues about the correct theory or method that needs to be used.

In fact, many problems are best handled with eclectic solutions borrowing from many statistical fields. The goal of this course is to help you develop the skills and confidence to solve real-world problems. You will learn about the key role of many statistical concepts that are rarely seen in standard courses. You will also learn the vital role of visualization and graphics, communication (listening even more than talking) and presentation skills.

The course will help you develop skills in a number of areas:

  1. programming and data management skills in R: Although the emphasis in this course is entirely on R, many jobs expect a strong knowledge of SAS -- take every opportunity you can to also learn SAS. Consider, if you are a beginner, the courses offered through the Statistical Consulting Service
  2. graphics to visualize data and models
  3. how to work as a statistical consultant/collaborator in the analysis of scientific problems
  4. developing presentations skills
  5. developing an understanding of the role of statistics as a discipline and as a profession in science and business
  6. understanding ethical issues related to statistical practice


Text and references

Course Work

  • In the first term the work for the course consists primarily of assignments done individually and posted on the 'private wiki' with help from your group.
  • Since almost all the interesting consulting problems I have seen in recent years have required an understanding of multilevel models, which are a natural extension of traditional linear models, we will develop ideas and concepts for the application of linear and multilevel models to real research problems by working through the course text. The text uses many real data sets and presents methods of analysis in R.
  • In the second term we continue working through the text. In addition, you will work on a major consultation project in which you will collaborate with a real client to produce a deep and probing consulting report. The project is very likely to involved multilevel models. Students who are interested may opt to work on a Statistical Society of Canada (SSC) case study for presentation at the SSC meeting in the spring of 2009. The Case Study team may include students who are not currently enrolled in MATH 6627.
  • You will also attend some real statistical consultations and prepare brief reports which will count towards your assignment grade.
  • Another important part of the course work is your contribution to the 'public wiki'. In particular we will develop two types of information on the wiki
    • how to's in R: these are brief articles describing how to do something simple in R, either a graph, an analysis or a type of data manipulation.
    • Paradoxes and fallacies in statistics: As your knowledge of statistics becomes deeper you abandon many simple suppositions and replace them with more sophisticated ones. An important part of communication between statisticians and clients -- for that matter between statisticians and the public or between statisticians and students -- involves understanding simple, often fallacious, suppositions and how they can lead to a deeper understanding. We will develop wiki pages to discuss important paradoxes and fallacies.

Grading

Assignments (40%)
At each class meeting you will be given individual assignments to be completed and posted on the private wiki before the start of the following class (typically two weeks later). For each assignment, you will be assigned to a group that is expected to helps its members with any questions or problems completing the assignment. Groups members may help each other by editing each other's work. You work will be graded and comments will be sent to you from the instructor. You can then improve your work which will be marked again two weeks later. Your grade for each assignment will be the the sum of your grade at the original due data plus the grade on the corrected work two weeks later plus the average of your group's grade on their original due date. The purpose is to encourage group cooperation so each member produces strong work at the first due date. Grades will be based not only on the correctness of the solution but also on the effectiveness and creativity of appropriate graphical presentations. You will receive 8 or 9 out of 10 for work that is correct and complete. To receive 10 out of 10, you must also show extra effort to explain the solution or find an appropriate graphical representation of the solution.
Consulting Project (40%)
The grade will be based on the appropriateness and quality of the statistical analyses, their the quality of exposition, the quality of presentation.
Contributions to the public wiki (10%)
Class participation (10%)
This is based on the preparation of question on the assigned readings (to be posted on the wiki), attendance, preparation for class and participation in class discussions.

Class list and teams

Class list
Number Family name Given name e-mail Week 1 Status
1 Boulsaien Khaled bouls801@yorku.ca A
2 Cao Jianzhe adamcao@yorku.ca B
3 Chane Parminder pchane@yorku.ca C
4 Gao Isabel isabelg@mathstat.yorku.ca D
5 Khan Sabria sabria@yorku.ca A
6 Leeza Nusrat nsleeza@yorku.ca B
7 Liao Yang liaoyang@yorku.ca C
X MeschianMehran mmeschia@yorku.ca D dropped
9 Nabipoor Sanjebad Majid masaba@yorku.ca A
10 Palma Luis luispal@yorku.ca B inactive
11 Shakya Sulin sshakya@yorku.ca C
12 Shi Xiaoping xpshi@yorku.ca D
13 Xu Hong hongxu@yorku.ca A
14 Pope Chris u843603@mathstat.yorku.ca C
8 Faroque Shahela TBD D dropped


Schedule

Week 1: September 10, 2008

Topics
Course organization
Participation in SCS seminars
You are welcome to attend SCS (Statistical Consulting Service) weekly meetings which consist of bi-weekly 'staff meetings' and bi-weekly seminars on a statistical topic of interest to statistical consultants. The exact topic for this year will be determined in two weeks. Meetings take place every Friday at 2:30 in TEL 5082. Please send an e-mail message to Georges Monette to have your name added to the SCS mailing list. Note that SCS also offers short courses some of which might be of interest to you.
Consulting, communication, writing reports 
Statistical consulting environment
Writing reports: Secret of good writing: write so your reader understands you!
Notes on writing reports
Seven basic principles
Not all consulting activities require a formal report. Often a phone call, a verbal report in a face to face meeting, a letter or a memo are the most efficient way of communicating to a client
Communication:
Interpersonal aspects of statistical consulting: Janice Derr, Statistical Consulting Video
Contributions by Doug Zahn
The role of statistics in society -- understanding evidence 
One of of the greatest challenges in understanding evidence is bridging the gap between observational data and causal inference, i.e. understanding the links between statistical significance and statistical meaning.
Statistics in the news: Lies or Statistics
Smoking: Observational vs. Experimental data: rough notes


The Fundamental Contingency Table of Statistics
  Types of Data
Experimental Observational
Types of Inference Causal Where Fisher would like us to be Where we often are
Predictive Very rare but problematic Good for 'prediction' not 'causal inference':
This is the topic of Frank Harrell's Regression Modeling Strategies'


Finding meaning in observational data -- examples
Hans Rosling: Myths about the developing world
Al Gore: An Inconvenient Truth
Peter Donnelly: How juries get fooled by statistics
Statistician Peter Donnelly explores the common mistakes humans make in interpreting statistics, and the devastating impact these errors can have on the outcome of criminal trials.
Piet Groeneboom Lucia de Berk and the amateur statisticians
Andrey Feuerverger: The Lost Tomb of Jesus
Software 
A working statistician should be proficient with at least SAS and R. This course uses R. A good consultant should also be familiar with packages that are likely to be used by clients, e.g. SPSS.
Getting started with R
After installing R, you should install the packages designed for the textbook:
      > install.packages("arm")
      > install.packages("BRugs")
(From http://www.stat.columbia.edu/~gelman/bugsR/) Set up R in 'single window mode': Click on Edit, then GUI Preferences, then at the top click SDI. Add a couple of zeroes to the "buffer" and "lines" options near the middle of the screen. Then save the preferences.
Whenever you start R, issue the command:
      > library(arm)
to use the software with the text and issue the command:
      > source("http://www.math.yorku.ca/~georges/R/fun.R")
to use software written for this course.
It is a good idea to use separate project directories for different projects. See Using R with project directories under Windows.
To begin learning R work through Maindonald (2008) Using R for Data Analysis and Graphics
Another excellent tutorial is Christopher Green: R Primer
When you're ready to really plunge into R, work your way through the manual that comes with R. From the R window click on Help|Manuals|An Introduction to R.


Wikis 
Public wiki
wiki.math.yorku.ca :
Open for reading to the world
Need an account for editing --- I will create accounts for all members of the class so you can make contributions to the information on the wiki
Private wiki
statswiki.math.yorku.ca :
Need a userid and password for access
Once you are in the wiki, you can create your own userid and password. Please use the same userid you have for mail @yorku.ca (e.g. if your York e-mail address is 'maryjones@yorku.ca' then use the userid 'maryjones', the password can be anything you choose. The page in which you create your account says that your real name and email address are optional but you will need to fill this in order to get properly graded for your work. This will avoid name 'collisions' on the private wiki.
The private wiki will be used for course assignments, course materials, etc.
Using a wiki for group assignments
Editing hints for course assignments

Gelman and Hill Chapters 1 and 2

  • Web page for the book http://www.stat.columbia.edu/~gelman/arm/
  • Data directory: http://www.stat.columbia.edu/~gelman/arm/examples/
  • Downloading data and R scripts:
    1. Start R
    2. Open a script window: File|New script
    3. Use a web browser to to open a data file
    4. Cut and paste the data file into the R script window and save it with a suitable name: e.g. police.dat
    5. Open a new script window for commands to read in the data file as an R 'data.frame'
      1. Count the number of non-data lines to skip at the top, then use the command:
        > dd <- read.table( 'police.dat', header = T, skip = 6)
      2. Submit the command with Ctrl-R

Assignment 1 and things to do

Deadline: 5 pm, Wednesday, September 24.

1. Private wiki 
Log in to the 'private wiki' http://statswiki.math.yorku.ca using the password sent to you by e-mail (you are free to change this password). Go to your user page by clicking on your userid at the top of the page and write a few details about yourself, e.g. where you did your previous studies, your academic interests, the software you know how to use, etc. Remember that this material is not accessible to the public but can be viewed by anyone who has access to the statswiki.
2. R 
  1. Install R on your computer(s), including your laptop if you have one. See R: Getting started
  2. Install the software that goes with our textbook with the R command install.packages("arm")
  3. Work through the first two chapters of Maindonald (2008) Using R for Data Analysis and Graphics
3. Questions on readings for next class 
Read Chapters 1 to 3 of Gelman & Hill and formulate at least one question on Chapters 1 or 2 and one question on Chapter 3. Add them to the questions at MATH 6627 2008 Questions
4. Statistics in the News 
Find a current or recent topic in the news that involves, explicitly or implicitly, an interesting statistical issue. Prepare an analysis of the topic together with a review of scientific evidence. Are there gaps between the science and the public presentation of the topic?
5. Class photo 

[Deferred to the next class -- I forgot to take a photo!] See the class photo at MATH 6627 2008-09 Class Photo and enter your name for the caption.

Assignment Week 1
Number Group Problems
1 A GH Chapter 2, p. 26 q. 1 Statistics in the News 1
2 B GH Chapter 2, p. 26 q. 2Statistics in the News 2
3 C GH Chapter 2, p. 26 q. 3 answer 1 Statistics in the News 3
4 D GH Chapter 2, p. 26 q. 3 answer 2 Statistics in the News 4
5 A GH Chapter 2, p. 26 q. 4 answer 1 Statistics in the News 5
6 B GH Chapter 2, p. 26 q. 5Statistics in the News 6
7 C GH Chapter 2, p. 26 q. 4 answer 2 Statistics in the News 7
8 D Maindonald Chapter 1, p. 8 qq 1,2 Statistics in the News 8
9 A Maindonald Chapter 1, p. 8 q. 3 Statistics in the News 9
10 B Maindonald Chapter 2, pp 19-20 qq 1,2 Statistics in the News 10
11 C Maindonald Chapter 2, pp 19-20 q. 3 Statistics in the News 11
12 D Maindonald Chapter 2, pp 19-20 q. 5 Statistics in the News 12
13 A Maindonald Chapter 2, pp 19-20 q. 6 Statistics in the News 13
14 C Maindonald Chapter 2, pp 19-20 q. 4 Statistics in the News 14

Week 1.5: September 17, 2008

Topic 
This is an optional tutorial on the use of R or the wiki for those who have had little or no experience with either. Be sure to have downloaded R and started covering some of the material in Maindonald (2008) Using R for Data Analysis and Graphics or another tutorial in [2]] before the tutorial. If you have a laptop, install R on it and bring it to the class.
In this tutorial we will work through:
  1. the sample session in Venables and Ripley (2002) [3] and
  2. the tutorial by John Fox prepared for a short course at UCLA: http://socserv.mcmaster.ca/jfox/Courses/UCLA/index.html
    To continue learning R:
  3. Work through http://cran.r-project.org/doc/manuals/R-intro.html, also available as a pdf file through the help menu on the R console.
  4. Highly recommended for learning R systematically: work through the on-line textbook by J. H. Maindondald at http://wiki.math.yorku.ca/index.php/R:_Getting_started#Exploring_much_more_deeply

Week 2: September 24, 2008

Examples of multilevel data

Fitting and looking at models with R

Assignment 2 and things to do

Deadline: 5 pm, Wednesday, October 22.

1. R 
  1. Work through chapters 3 and 4 of Maindonald (2008) Using R for Data Analysis and Graphics
2. Questions on readings for next class 
Read Chapters 4 to 6 of Gelman & Hill and formulate at least one question on each chapter. Add them to the questions at MATH 6627 2008 Questions
3. Class photo 

See the class photo at MATH 6627 2008-09 Class Photo and enter your name for the caption.

4. Do your part of the assignment for Week 2. Wherever you can, produce plots showing your fitted models even if not required by the question in the book. When two students work on the same question, work independently. You may look at each other's work but you should do your work with your own group.
Assignment Week 2
Number Group Problems
5 A GH Chapter 3, p. 49 q. 1 Maindonald Chapter 3, p. 30 q. 7 answer 2 (use other distributions)
13 B GH Chapter 3, p. 49 q. 2Maindonald Chapter 4, p. 34 q. 13
14 C GH Chapter 3, p. 50 q. 3 answer 1 Maindonald Chapter 4, p. 33 q. 12
1 D GH Chapter 3, p. 50 q. 4 answer 1 Maindonald Chapter 4, p. 33 q. 2
3 A GH Chapter 3, p. 51 q. 5 answer 1 Maindonald Chapter 4, p. 33 q. 1
12 B GH Chapter 3, p. 51 q. 5 answer 2Maindonald Chapter 3, p. 30 q. 8
4 C GH Chapter 3, p. 50 q. 4 answer 2 (reverse d) Maindonald Chapter 3, p. 30 q. 7 answer 1
6 D GH Chapter 3, p. 50 q. 3 answer 2 Maindonald Chapter 3, p. 30 q. 6
2 A GH Chapter 3, p. 50 q. 3 answer 3 Maindonald Chapter 3, p. 30 q. 4
11 B GH Chapter 3, p. 50 q. 3 answer 4 Maindonald Chapter 3, p. 30 q. 3
7 C GH Chapter 3, p. 51 q. 5 answer 3 Maindonald Chapter 3, p. 30 q. 2
9 D GH Chapter 3, p. 51 q. 5 answer 4 Maindonald Chapter 3, p. 29 q. 1

Week 3

R script

Visualizing Regression

See Visualizing Regression [4] pp 1-84 for

  • Regression to the mean
  • The regression paradox and the regression fallacy
  • The geometry and interpretation of the data (or concentration) ellipse: the regression line and the data ellipse
  • Visualizing correlation and the confidence interval for the slope using the data ellipse

Notes on Chapter 3: Linear Regression: the basics

3.2 Multiple predictors
Interpretation of coefficient βi:
"expected change in Y when you change Xi keeping other X's constant".
Not always directly meaningful: e.g. a quadratic model:
E(Y | X) = β0 + β1X1 + β2X2
The change in E(Y | X) for a change in X depends on X and is equal to β1 + 2β2X. Note that β1 is the expected change in Y for a change in X when X = 0. Similar considerations hold for models with interactions, etc.


Counterfactuals (causal) versus predictive interpretation of βi
When is each interpretation correct?
3.3 Interactions
See R script for example
3.4 Statistical inference
Where does SE(\hat{\beta}) come from?
With simple regression it's easy:
SE(\hat{\beta})= \frac{s_e}{\sqrt{\sum{(X_i-\bar{X})^2}}}
which is easier to interpret when written as:
SE(\hat{\beta})= \frac{s_e}{\sqrt{n} \times \sigma_X}
where σX is the 'population' standard deviation of X, i.e. the standard deviation using n as a divisor. Compare with SE(\bar{Y}) = \frac{s_y}{\sqrt{n}}. So the information on β is proportional to \sqrt{n} \times \sigma_X and inversely proportional to se.
For multiple regression the common formula is:
SE(\hat{\beta_k})= \frac{s_e}{\sqrt{ (1-R_k^2) \sum{(X_{ki}-\bar{X_k})^2}}}, where ...
Much more informative, however, is the formula:
SE(\hat{\beta_k})= \frac{s_e}{\sqrt{n} \times \sigma_{X_{k|other Xs}}}
where \sigma_{X_{k|other Xs}} is the standard deviation of the residual of Xk after regression on all the other regressors.
The importance of this formula is that it suggests how you might try to improve the estimate of βk. You can increase n or decrease the error of regression or increase the variability in Xk keeping other X's constant.
3.5 Graphical display of fitted model
See R script for alternative approach
3.6 Assumptions and diagnostics
If assumptions true the residuals look approximately random from normal distribution and should not show patterns when plotted in various ways. Common diagnostics: study the residuals and plot. GH only mentions the traditional diagnostics of plotting residuals against fitted values and Xs. In addition, there are other plot that have served me very well.
3.7 Prediction and validation
Broad and important topic. Statisticians often pay too little attention to validation.


Notes on Chapter 4: Before and after fitting the model

4.1 Linear transformations
Standardizing:
Using z-scores
In passing: simple regression using z-scores:
 \hat{z}_y = r \times z_x where r is the correlation.
Using reasonable centre and scale
4.2 Centering and standardizing (especially for models with interactions)
If
E(Y) = β0 + β1X1 + β2X2 + β3X1X2
then
 \frac{\partial E(Y)}{\partial X_1}= \beta_1 + \beta_3 X_2
So β1 is the 'effect' of X1 when X2 = 0. If we recenter X2 we change the meaning of β1 and vice-versa. The information on β1 is 'maximized' when X2 is centered so that  \bar{X}_2 = 0. But this is no reason to centre X2 at  \bar{X}_2 since recentering also changes the meaning of β1.
4.3 Correlation and regression to the mean
Four lines: the principal axis (principal component line), the regression of Y on X, the SD line and the regression of X on Y. In z-scores:
Y on X: \hat{z}_y = r \times z_x
X on Y: \hat{z}_x = r \times z_y
SD line: \hat{z}_y =  z_x
If sy = sx then the SD line and the principal axis are identical. Otherwise it's more complicated.
Exercise:
In a course in which the final grade is the average of the mark on a mid-term and on a final exam (both are graded out of 100), a professor would like to impute the mid-term grade of a student who missed the mid-term for a legitimate reason. What's the best way? Just use the final grade? Use the predicted mid-term grade after doing a regression of the mid-term on the final? Impute a z-score for the mid-term using the z-score on the final? Use reverse regression by regressing the final on the mid-term and imputing the value for the mid-term that would predict the student's grade on the final? Use principal axis regression? What are the consequences of using these various methods and which one do you think is best? Examine at least briefly the meaning of best in this context?
4.4 Log transformations
Interpreting β's.
4.5 Other transformations
4.6 Building models for prediction (in contrast with causal inference)
i.e. models that fit well whether parameters have a causal interpretation or not.
GH omit one important consideration: the number of 'degrees of freedom' should not be too large relative to n. Harrell (2001) discusses this in detail. See sample size and validity.

Some notes on Chapter 5: Logistic regression

Download data from http://www.stat.columbia.edu/~gelman/arm/examples/nes/

To install the data download all three files and run 'nes_chap4.R'

This will create a data set named 'data' with data from elections from 1972 to 2000.

Use:

  d92 <- data[data$year == 1992,] 

to get data on the Bush/Clinton race of 1992.

Assignment 3 and things to do

NEW Deadline: 5 pm, Wednesday, November 12.

1. Questions on readings for next class 
Read Chapter 7 of Gelman & Hill and formulate at least one question. Add it to the questions at MATH 6627 2008 Questions
2. Individual assignments 

The following assignment should be done individually. Email your work to me by the deadline. You can send a text file, a Word file or a pdf file. If you wish to use some other format, please let me know so I can make sure that I will be able to read it.

  1. Look at the data set http://www.math.yorku.ca/~georges/Data/coffee.csv. It has three relevant variables, 'Heart', which is a measure of heart condition -- the higher the less healthy; 'Coffee', a measure of coffee consumption, and finally, 'Stress', measure of occupational stress. How could you use this data to address the question whether coffee consumption is harmful to the heart.Discuss assumptions needed to get anywhere with the data and discuss the nature of various assumptions that might lead to different interpretations, if relevant.
  2. Look at the data set http://www.math.yorku.ca/~georges/Data/hwX.csv where X is the remainder when you divide your 'class number' (the number from 1 to 20 on the class list on the web) by 4. Thus X will be 0, 1, 2, or 3. (i.e. if you number is 7 then X is 3 and you would use the data set http://www.math.yorku.ca/~georges/Data/hw3.csv. The data set contains data on three variables: Health (the higher the better), Height and Weight. All are in standardized units. What would this data set have to say about the relationship between Weight and Health? Discuss assumptions needed to get anywhere with the data and discuss the nature of various assumptions that might lead to different interpretations, if relevant.
  3. Do the exercise in red above on imputing a mid-term grade.
3. Review 
Review your textbooks on multiple regression. What is a confidence ellipse? What is its connection with hypothesis testing? What is a Scheffé confidence interval? What is a Bonferroni confidence interval?


Week 3.5

Here are the 'blackboard' notes.


Week 4

March 4, 2009

News

Assignment 3 was officially due after the start of the strike which means that it wasn't due until now. We will discuss when it ought to be completed.

Plans

The major activity in the course is the analysis of a real data problem. All the data problems I have involve hierarchical or longitudinal data so our priority is to learn enough about the analysis of this kind of data so you can get started on projects by April 1. I propose to meet every week for the next three weeks and then we will reassess our progress.


Visualizing Simple Regression

Visualizing Multiple Regression

For next week
1. Readings for next week
Read Chapters 9 and 10 of Gelman & Hill (skip 8 unless you wish to read it on your own). These two chapters are on causal inference with observational data. They are challenging but very important for professional statisticians to understand. Formulate at least one question. Add it to the questions at MATH 6627 2008 Questions
2. Formulate questions on the material we have seen in class this week: MATH 6627 2008 Questions
3. Finish outstanding assignements.
4. Something to think about
  1. Look at the data set http://www.math.yorku.ca/~georges/Data/hs.csv. This data consists of math achievement scores and 'ses' (socio-economic status) of 1977 in 40 U.S. schools, 21 of which are Catholic and 19 public. A goal in analyzing this data is to describe the relationship between math achievement and ses, and to examine whether the relationship is similar in different school sectors and among boys and girls. Explore the data and think about how one could address these questions. A few specific questions to think about:
Is a low ses child better off in a high ses school or in a lower ses school? If there is a difference, are we confident that it is the school that makes the difference?
Is there any evidence that students in boys or girls schools do better than students in coed schools?
How do public schools compare with Catholic schools?


A brief description of some variables:

school: a numeric id for each school
mathach: a math achievement score
ses: socio-economic status (education and income of parents)
Size: size of school
PRACAD: priority given to academics in a school
DISCLIM: disciplinary climate
Minority: hispanic or black
HIMINTY: high proportion of minorities in school
> hs <- read.csv("http://www.math.yorku.ca/~georges/Data/hs.csv")
> source("http://www.math.yorku.ca/~georges/R/fun.R")

> dim(hs)
[1] 1977   13
> library( car )
> some( hs )
        X school mathach    ses sector female    Sex Minority Size   Sector PRACAD DISCLIM
176  1003   2458   9.142  0.242      1      1 Female      Yes  545 Catholic   0.89  -1.484
500  1777   3013  18.846  0.032      0      1 Female       No  760   Public   0.56  -0.213
680  3022   4292  16.442 -0.048      1      0   Male      Yes 1328 Catholic   0.76  -0.674
880  3705   5619  21.451  0.412      1      1 Female       No 1118 Catholic   0.77  -1.286
1023 3909   5720   8.259 -0.238      1      1 Female       No  381 Catholic   0.65  -0.352
1064 3950   5720  18.241  1.132      1      0   Male       No  381 Catholic   0.65  -0.352
1160 4210   6074  12.553  0.042      1      1 Female       No 2051 Catholic   0.32  -1.018
1178 4228   6074  18.875 -0.508      1      1 Female       No 2051 Catholic   0.32  -1.018
1818 6302   8707  22.102  0.792      0      0   Male       No 1133   Public   0.48   1.542
1942 7150   9586  10.626  1.132      1      1 Female       No  262 Catholic   1.00  -2.416
     HIMINTY
176        1
500        0
680        1
880        0
1023       0
1064       0
1160       0
1178       0
1818       0
1942       0
> 
> tab(size = table(hs$school))
size
   29    32    34    35    36    37    38    41    42    44    45    48    49    51    52 
    1     2     1     1     1     2     1     1     1     1     2     2     1     1     2 
   53    54    55    56    57    58    59    60    63    64    65    66 Total 
    5     1     1     2     3     2     1     1     1     1     1     1    40 
> tab(~ Sex + school, hs)
        school
Sex      1317 1906 2208 2458 2626 2629 2639 2658 2771 3013 3610 3992 4292 4511 4530 4868
  Female   48   27   35   57   18    0   24   27   28   19   29   21    0   58   63   11
  Male      0   26   25    0   20   57   18   18   27   34   35   32   65    0    0   23
  Total    48   53   60   57   38   57   42   45   55   53   64   53   65   58   63   34
        school
Sex      5619 5640 5650 5720 5761 5762 6074 6484 6897 7172 7232 7342 7345 7688 7697 7890
  Female   30   24   32   24   52   21   56   20   29   22   30    0   29    0   11   24
  Male     36   33   13   29    0   16    0   15   20   22   22   58   27   54   21   27
  Total    66   57   45   53   52   37   56   35   49   44   52   58   56   54   32   51
        school
Sex      7919 8531 8627 8707 8854 8874 9550 9586 Total
  Female   16   23   24   26   17   21   19   59  1074
  Male     21   18   29   22   15   15   10    0   903
  Total    37   41   53   48   32   36   29   59  1977

> tab( ~Sector, up(hs, ~school))
Sector
Catholic   Public    Total 
      21       19       40 

Week 5

March 11, 2009

Links to course materials

For next week

1. Readings for next week
Read Chapters 11 and 12 of Gelman & Hill. Formulate at least one question. Add it to the questions at MATH 6627 2008 Questions
2. Start working on the following individual assignment due April 1:
Using the full high school data set at http://www.math.yorku.ca/~georges/Data/hsfull.csv address the following questions:
1) Describe the relationship between math achievement and SES. How does it seem to vary between school sectors, between girls and boys?
2a) In what kind of school does a 'poor' girl (ses = -1) seem to be better off? Would she be better off in a school with relatively low mean SES or a school with relatively high SES, a Catholic or a public school, a girls school or a mixed school?
2b) Do question 2a with 'poor' replaced with 'rich' (ses = 1).
2c) Do question 2a with 'girl' replaced with 'boy'.
2d) De question 2b with 'girl' repalced with 'boy'.
Compare the 'effect' of SES among boys in each combination of contexts: public, Catholic, poor school, rich school, girls, boys or mixed schools.
Compare the 'effect' of SES among girls in each combination of contexts: public, Catholic, poor school, rich school, girls, boys or mixed schools.

Week 6

March 18, 2009

Course materials for this week

Hierarchical Models Part I

  • [[[:Template:Hmr]]Hierarchical_Models_I/Hierarchical_Models_I_v2.pdf Hierarchical Models Part I version 2, (reasonably clean)]
  • [[[:Template:Hmr]]Hierarchical_Models_I/PartI.R R scrips for Hierarchical Models Part I]

Hierarchical Models Part I (in progress)

  • [[[:Template:Hmr]]Hierarchical_Models_I/Hierarchical_Models_I_v3_CURRENT_DRAFT.pdf Hierarchical Models Part I version 3, (still a mess)]
  • [[[:Template:Hmr]]Hierarchical_Models_I/PartI-b.R R scrips for Hierarchical Models Part I(b) (in progress)]

Data

For next week

1. No new readings. Consolidate previous readings.
2. Last week's assignment deadline is extended by 1 week to April 1.

Weeks 7 & 8

March 25 and April 1, 2009

Course materials

Hierarchical Models Part II

  • [[[:Template:Hmr]]Hierarchical_Models_II/Workshop-Longitudinal_with_R-2009_03_25.pdf Longitudinal Data Analysis with R]
  • [[[:Template:Hmr]]Hierarchical_Models_II/TalkOnComasAndMigraines.pdf Non-linear mixed models and generalized linear mixed models]
  • [[[:Template:Hmr]]Hierarchical_Models_II/Sample_Analysis.R R script for a sample analysis]
  • [[[:Template:Hmr]]Hierarchical_Models_II/Longitudinal_Data_Analysis_with_Mixed_Models_using_R_Concluding_Thoughts.pdf A few comments]

Splines


For next week

The assignment that was due April 1 is now due April 8. Preferably mail me a pdf or Word file.

Week 9

Here's the script we wrote in class: MATH6627 Sample analysis 2009 04 22

Links

Organizations

Consulting

TED Talks on Statistics

Other

New York Times, April 8, 1984

Personal tools