MATH 6627 2010-11 Practicum in Statistical Consulting/Assignment Teams/Rubin

From Wiki1

Jump to: navigation, search




  • An apparent paradox in which the association between two variables (X and Y) changes when a third variable (Z) is taken into account.
  • Z is called a confounding factor.


A confounding factor is associated with both the outcome variable and the primary risk factor (or independent variable) of interest.



University of California, Berkeley was sued for biased acceptance rates with respect to gender, men applying to graduate school were more likely to be accepted than women.


However, when the male to female acceptance rates were considered for each individual department there was actually a slight favourable bias toward the acceptance of women.




Male gender as a risk factor for malaria.


Odds Ratio = 1.7 (P<0.05).

Confouder vs exposure


Odds Ratio = 7.8

Confounder vs outcome


Odds Ratio = 5.3


Odds Ratio Outdoor Occupation = 1.06 Odds Ratio Indoor Occupation = 1.00





Pulse Article: For our own good [1].


Whether the article suggest a causal relationship between two variables. If so which? Are the data observational or experimental?


Yes, this article suggests a causal relationship between the type of advertising and unnecesary prescriptions. This paper also pointed out a big emphasis on the fact that drug companies spend too much money on direct-to-consumer advertising of prescription medicines which may lead to health care costs associated with unnecessary prescriptions. The data should be observational because the paper doesn't mention that they chose people that advertising help or not to avoid unnecessary prescriptions.


Can you think of alternative explanations to causality? Confounding factors? Or explanations consistent with causality? Mediating factors?


I think that there could be a mediating factor leading to unnecessary health care costs. This mediating factor could be the secondary reactions of the drug. In some cases the secondary reactions are so important because people can spend a lot of money as a result of drug complications. I agree with the cause-reaction of this article because they don't believe what the president of Pfizer said regarding the unspoken truth about advertising constituting one of the largest and most successful public health campaigns in US history. Also, a confounding factor may exist which leads people to required additional prescriptions. For example, gender, in some cases secondary reactions are different between men and women.


Have any confounding factors been accounted for in the analysis?


No, confounding factors were not accounted for. They emphasize the type of advertising.


Have any mediating factors been controlled for in a way that vitiates a causal interpretation of the relationship?


The article does not mention anything about controlling mediating factor. Also, they did not do a causal interpretation. There was no mention of a relationship between the type of drug and the case group.


What is your personal assessment of the evidence for causality in the study that is the subject of the article?


When I read the article I thought that it would be more interesting but I was wrong due to the lack of information. They don't support their claims. In other words, they don't proof what they are writing, also, they only give a very general point of view. Instead of understanding what they are writing, they only give us too many questions.

Laura: I found the topic of the article quite interesting, but like Luis, felt the article was lacking in substantive information.


Q3: A survey of students at York reveals that the average class size of the classes they attend is 130. A survey of faculty shows an average class size of 30. The students must be exaggerating their class sizes or the faculty under-reporting.

A3: This could be an instance of sampling bias. Maybe all of the students surveyed were coming from the same 130 person class, while the professors surveyed were all coming from smaller classes. Could be the difference between surveying 1st and 2nd year courses vs. 3rd year courses, etc.

Q6: If smoking really is bad for your health, you expect a comparison of a group of people who have quit smoking with a group that have continued to reveal that the group quitting is, on average, healthier than the group that continued.

A6: Maybe the people who quit are quitting for health reasons (i.e. heavy smokers tend to quit while light smokers tend to continue smoking). Also, when you stop smoking your stress level may go up, which would decrease your overall level of health.

Q9: If you want to reduce the number of predictor variables in a model, a technique like forward stepwise regression will generally do a good job of identifying which variables you should keep.

A9: For a regression model with n possible predictor variables, the first step involves evaluating n predictor variable subsets, each consisting of a single predictor variable, and selecting the one with the highest evaluation criterion. The next step selects from among n-1 subsets, the next step from n-2 subsets, and so on. It is not guaranteed to find the subset with the highest evaluation criterion. Stepwise selection models also ignore the problem of multiple inference (performing more than one statistical inference procedure on the same data set), which can lead to erroneous conclusions. Should not rely solely on automated procedures, it may make sense from a logical perspective to include variables irrespective of their significance (i.e. confounders).

Q12: In general we don’t need to worry about interactions between variables unless there is a correlation between them.

A12: Correlation measures the strength of the linear relationship between quantitative variables, relationship may not be linear. Also, correlation assumes variables are normally distributed, which is not always the case.

Q15: Statistical theory shows that the best way to impute a mid-term grade for a student who missed the test with a valid excuse is to use the predicted mid-term grade based on the other grades in the course.

A15: The mid-term mark could be affected by numerous factors. What if the student was cheating on his/her assignment? What if the difficulty level of the mid term is not equivalent to the other assigned work in the course? In addition to which, other grades in the course are comprised of a variety of testing methods. Students do not necessarily perform equivalently across all testing methods. For example, John does well on assignments and presentations, but he performs poorly on tests.



As we saw, 'Sector' appears to be an important predictor. Consider models using ses and Sector. Aim to estimate the between Sector gap as a function of ses if there is an interaction between Sector and ses. Check for and provide for a possible contextual effect of ses. Plot expected math achievement in each sector. Plot the gap with SEs. Consider the possibility that the apparently flatter effect of ses in Catholic school could be due to a non-linear effect of ses. How would you test whether this is a reasonable alternative explanation?

The interaction between sector and ses was significant (p<0.05) and the contextual effect of ses was also significant. This implies that the effect of ses on math achievement differs between public and catholic schools.

wald( fitn, L.context )

                        numDF denDF  F.value p.value

Contextual effect of ses 1 77 27.47158 <.00001


Figure 1: Plot of the gap between ses scores for catholic and public schools


Figure 2: Plot of expected math achievement in each sector for ses levels of -0.5, 0 and 0.5

In order to test whether or not the flatter effect of ses in catholic schools could be due to a non-linear effect of ses we would add a quadratic ses term to the model statement and then see if the quadratic term was significant. If the quadratic term was significant, then the flatter effect of ses in catholic schools may be a result of a non-linear effect of ses.


Take the example further by incorporating Sex. Consider the the 'contextual effect' of Sex which is school sex composition. Note that there are three types of schools: Girls, Boys and Coed schools. If you consider an interaction between Sector and school gender composition, you will see that the Public Sector only has Coed schools. What is the consequence of this fact for modelling sex composition and Sector effects.

To understand why when we consider an interaction between sector and school gender composition, public sector will have only coed schools we have to see at the following results

> tab( ~ Sex + school, hs)

> by ( hs, hs$Sector, function( dd ) tab( ~ Sex + school, dd))

hs$Sector: Catholic


Sex 1317 1906 2208 2458 2629 2658 3610 3992 4292 4511 4530 4868 5619 5650

 Female   48   27   35   57    0   27   29   21    0   58   63   11   30   32
 Male      0   26   25    0   57   18   35   32   65    0    0   23   36   13
 Total    48   53   60   57   57   45   64   53   65   58   63   34   66   45

Sex 5720 5761 6074 7172 7342 7688 9586 Total

 Female   24   52   56   22    0    0   59   651
 Male     29    0    0   22   58   54    0   493
 Total    53   52   56   44   58   54   59  1144

hs$Sector: Public


Sex 2626 2639 2771 3013 5640 5762 6484 6897 7232 7345 7697 7890 7919 8531

 Female   18   24   28   19   24   21   20   29   30   29   11   24   16   23
 Male     20   18   27   34   33   16   15   20   22   27   21   27   21   18
 Total    38   42   55   53   57   37   35   49   52   56   32   51   37   41

Sex 8627 8707 8854 8874 9550 Total

 Female   24   26   17   21   19   423
 Male     29   22   15   15   10   410
 Total    53   48   32   36   29   833

This data does not give us a good interpretation so we have to give a summary of this data to find the consequence of modelling sex composition and Sector effects.


As we see in this output, public sector only have coed schools.

Showing into graphs these data to see what's going on.




- The first two graphs shows that if we model sex composition and Sector effects our model will not be normally distributed.

- The last graph shows that we can not give a prediction either to mathach or ses, but we can pooled the data to ignore this situation and have a better analize of the data.


Does it appear that boys are better off in a boy's school and girls in a girl's school or are they better off in coed schools? How would you qualify your findings so parents don't misinterpret them in making decisions for their children?

As can be seen in Figure 3, FIgure 4 and the following R output, girls and boys both perform better in single-sex schools. All public schools are co-ed, while catholic schools can be all male, all female or co-ed, so sector may be acting as a confounding factor in the analysis of sex category on math achievement.


Figure 3: Plot of Math Achievement by Sex Category


Figure 4: Plot of Math Achievement by ses with respect to sex category

Call: lm(formula = mathach ~ ses + factor( +, data = hs)


    Min       1Q   Median       3Q      Max 

-19.0096 -4.9276 0.1901 4.9705 15.2090


                    Estimate Std. Error t value Pr(>|t|)    

(Intercept) 14.6252 0.4398 33.251 < 2e-16 *** ses 1.6955 0.6140 2.761 0.005807 ** factor( -2.1118 0.4726 -4.469 8.32e-06 *** factor( -1.8262 0.5442 -3.355 0.000807 *** ses:Sex.catCoed 2.0028 0.6545 3.060 0.002242 ** ses:Sex.catGirls 0.7560 0.7519 1.005 0.314815 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.351 on 1971 degrees of freedom Multiple R-squared: 0.1396, Adjusted R-squared: 0.1374 F-statistic: 63.95 on 5 and 1971 DF, p-value: < 2.2e-16

lm(formula = mathach ~ ses + factor( * factor(Sector),

   data = hs)


    Min       1Q   Median       3Q      Max 

-19.4102 -4.8167 0.1175 5.0436 15.3054

Coefficients: (2 not defined because of singularities)

                                         Estimate Std. Error t value Pr(>|t|)

(Intercept) 14.9586 0.4175 35.832 < 2e-16 ses 3.1046 0.1952 15.907 < 2e-16 factor( -1.5700 0.5090 -3.084 0.002068 factor( -2.1432 0.5257 -4.077 4.74e-05 factor(Sector)Public -1.4099 0.3634 -3.879 0.000108 factor( NA NA NA NA factor( NA NA NA NA

(Intercept) *** ses *** factor( ** factor( *** factor(Sector)Public *** factor( factor( ---


  • Is a low ses child better off in a high ses school or are they better off in a school of a similar ses? How about a high ses child in a low ses school? How would you qualify your findings so parents don't misinterpret them in making decisions for their children?
  • Based on the first graph, the gap in math achievement between low ses students and high ses students is smaller for schools with a low ses than for schools with a high ses. So, a low ses child is better off in a low ses school.
  • Contrary to what one would expect, it appears as though students with a low ses outperform students with a high ses irrespective of their school's ses status (i.e. high ses or low ses).




Is a minority status child better off in a school with a higher proportion of minority status children or are they better off in a school with a low proportion? How would you qualify your findings so parents don't misinterpret them in making decisions for their children?



According to the first graphs we can see that it is better to send a child with low SES to a school with low proportion of minority status but in the other hand, it is better to send a child with higher SES to a school with high proportion of minority status. It is also the same for a child with low Mathach


This graph is very useful to understand more the proportion of minority status because we can compare the new predictor Mathach versus SES according to their minority and majority gap together, so with this graph parents have to be careful about taking a decision for their children because this graph is different from the others due to the introduce of a new predictor carrying on a new analysis of contextual effects.


Preeliminary analysis 1


Preeliminary analysis 2

File:Consulting pres2(pdf).pdf

Personal tools