MATH 6627 2010-11 Practicum in Statistical Consulting/Assignment Teams/Rubin

From Wiki1

(Difference between revisions)
Jump to: navigation, search
Line 265: Line 265:

Revision as of 23:33, 15 March 2011




  • An apparent paradox in which the association between two variables (X and Y) changes when a third variable (Z) is taken into account.
  • Z is called a confounding factor.


A confounding factor is associated with both the outcome variable and the primary risk factor (or independent variable) of interest.



University of California, Berkeley was sued for biased acceptance rates with respect to gender, men applying to graduate school were more likely to be accepted than women.


However, when the male to female acceptance rates were considered for each individual department there was actually a slight favourable bias toward the acceptance of women.




Male gender as a risk factor for malaria.


Odds Ratio = 1.7 (P<0.05).

Confouder vs exposure


Odds Ratio = 7.8

Confounder vs outcome


Odds Ratio = 5.3


Odds Ratio Outdoor Occupation = 1.06 Odds Ratio Indoor Occupation = 1.00





Pulse Article: For our own good [1].


Whether the article suggest a causal relationship between two variables. If so which? Are the data observational or experimental?


Yes, this article suggests a causal relationship between the type of advertising and unnecesary prescriptions. This paper also pointed out a big emphasis on the fact that drug companies spend too much money on direct-to-consumer advertising of prescription medicines which may lead to health care costs associated with unnecessary prescriptions. The data should be observational because the paper doesn't mention that they chose people that advertising help or not to avoid unnecessary prescriptions.


Can you think of alternative explanations to causality? Confounding factors? Or explanations consistent with causality? Mediating factors?


I think that there could be a mediating factor leading to unnecessary health care costs. This mediating factor could be the secondary reactions of the drug. In some cases the secondary reactions are so important because people can spend a lot of money as a result of drug complications. I agree with the cause-reaction of this article because they don't believe what the president of Pfizer said regarding the unspoken truth about advertising constituting one of the largest and most successful public health campaigns in US history. Also, a confounding factor may exist which leads people to required additional prescriptions. For example, gender, in some cases secondary reactions are different between men and women.


Have any confounding factors been accounted for in the analysis?


No, confounding factors were not accounted for. They emphasize the type of advertising.


Have any mediating factors been controlled for in a way that vitiates a causal interpretation of the relationship?


The article does not mention anything about controlling mediating factor. Also, they did not do a causal interpretation. There was no mention of a relationship between the type of drug and the case group.


What is your personal assessment of the evidence for causality in the study that is the subject of the article?


When I read the article I thought that it would be more interesting but I was wrong due to the lack of information. They don't support their claims. In other words, they don't proof what they are writing, also, they only give a very general point of view. Instead of understanding what they are writing, they only give us too many questions.

Laura: I found the topic of the article quite interesting, but like Luis, felt the article was lacking in substantive information.


Q3: A survey of students at York reveals that the average class size of the classes they attend is 130. A survey of faculty shows an average class size of 30. The students must be exaggerating their class sizes or the faculty under-reporting.

A3: This could be an instance of sampling bias. Maybe all of the students surveyed were coming from the same 130 person class, while the professors surveyed were all coming from smaller classes. Could be the difference between surveying 1st and 2nd year courses vs. 3rd year courses, etc.

Q6: If smoking really is bad for your health, you expect a comparison of a group of people who have quit smoking with a group that have continued to reveal that the group quitting is, on average, healthier than the group that continued.

A6: Maybe the people who quit are quitting for health reasons (i.e. heavy smokers tend to quit while light smokers tend to continue smoking). Also, when you stop smoking your stress level may go up, which would decrease your overall level of health.

Q9: If you want to reduce the number of predictor variables in a model, a technique like forward stepwise regression will generally do a good job of identifying which variables you should keep.

A9: For a regression model with n possible predictor variables, the first step involves evaluating n predictor variable subsets, each consisting of a single predictor variable, and selecting the one with the highest evaluation criterion. The next step selects from among n-1 subsets, the next step from n-2 subsets, and so on. It is not guaranteed to find the subset with the highest evaluation criterion. Stepwise selection models also ignore the problem of multiple inference (performing more than one statistical inference procedure on the same data set), which can lead to erroneous conclusions. Should not rely solely on automated procedures, it may make sense from a logical perspective to include variables irrespective of their significance (i.e. confounders).

Q12: In general we don’t need to worry about interactions between variables unless there is a correlation between them.

A12: Correlation measures the strength of the linear relationship between quantitative variables, relationship may not be linear. Also, correlation assumes variables are normally distributed, which is not always the case.

Q15: Statistical theory shows that the best way to impute a mid-term grade for a student who missed the test with a valid excuse is to use the predicted mid-term grade based on the other grades in the course.

A15: The mid-term mark could be affected by numerous factors. What if the student was cheating on his/her assignment? What if the difficulty level of the mid term is not equivalent to the other assigned work in the course? In addition to which, other grades in the course are comprised of a variety of testing methods. Students do not necessarily perform equivalently across all testing methods. For example, John does well on assignments and presentations, but he performs poorly on tests.



As we saw, 'Sector' appears to be an important predictor. Consider models using ses and Sector. Aim to estimate the between Sector gap as a function of ses if there is an interaction between Sector and ses. Check for and provide for a possible contextual effect of ses. Plot expected math achievement in each sector. Plot the gap with SEs. Consider the possibility that the apparently flatter effect of ses in Catholic school could be due to a non-linear effect of ses. How would you test whether this is a reasonable alternative explanation?

The interaction between sector and ses was significant (p<0.05) and the contextual effect of ses was also significant. This implies that the effect of ses on math achievement differs between public and catholic schools.

wald( fitn, L.context )

                        numDF denDF  F.value p.value

Contextual effect of ses 1 77 27.47158 <.00001


Figure 1: Plot of the gap between ses scores for catholic and public schools


Figure 2: Plot of expected math achievement in each sector for ses levels of -0.5, 0 and 0.5

In order to test whether or not the flatter effect of ses in catholic schools could be due to a non-linear effect of ses we would add a quadratic ses term to the model statement and then see if the quadratic term was significant. If the quadratic term was significant, then the flatter effect of ses in catholic schools may be a result of a non-linear effect of ses.


Take the example further by incorporating Sex. Consider the the 'contextual effect' of Sex which is school sex composition. Note that there are three types of schools: Girls, Boys and Coed schools. If you consider an interaction between Sector and school gender composition, you will see that the Public Sector only has Coed schools. What is the consequence of this fact for modelling sex composition and Sector effects.




Does it appear that boys are better off in a boy's school and girls in a girl's school or are they better off in coed schools? How would you qualify your findings so parents don't misinterpret them in making decisions for their children?

As can be seen in Figure 3, FIgure 4 and the following R output, girls and boys both perform better in single-sex schools. All public schools are co-ed, while catholic schools can be all male, all female or co-ed, so sector may be acting as a confounding factor in the analysis of sex category on math achievement.


Figure 3: Plot of Math Achievement by Sex Category


Figure 4: Plot of Math Achievement by ses with respect to sex category

Call: lm(formula = mathach ~ ses + factor( +, data = hs)


    Min       1Q   Median       3Q      Max 

-19.0096 -4.9276 0.1901 4.9705 15.2090


                    Estimate Std. Error t value Pr(>|t|)    

(Intercept) 14.6252 0.4398 33.251 < 2e-16 *** ses 1.6955 0.6140 2.761 0.005807 ** factor( -2.1118 0.4726 -4.469 8.32e-06 *** factor( -1.8262 0.5442 -3.355 0.000807 *** ses:Sex.catCoed 2.0028 0.6545 3.060 0.002242 ** ses:Sex.catGirls 0.7560 0.7519 1.005 0.314815 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.351 on 1971 degrees of freedom Multiple R-squared: 0.1396, Adjusted R-squared: 0.1374 F-statistic: 63.95 on 5 and 1971 DF, p-value: < 2.2e-16

lm(formula = mathach ~ ses + factor( * factor(Sector),

   data = hs)


    Min       1Q   Median       3Q      Max 

-19.4102 -4.8167 0.1175 5.0436 15.3054

Coefficients: (2 not defined because of singularities)

                                         Estimate Std. Error t value Pr(>|t|)

(Intercept) 14.9586 0.4175 35.832 < 2e-16 ses 3.1046 0.1952 15.907 < 2e-16 factor( -1.5700 0.5090 -3.084 0.002068 factor( -2.1432 0.5257 -4.077 4.74e-05 factor(Sector)Public -1.4099 0.3634 -3.879 0.000108 factor( NA NA NA NA factor( NA NA NA NA

(Intercept) *** ses *** factor( ** factor( *** factor(Sector)Public *** factor( factor( ---


  • Is a low ses child better off in a high ses school or are they better off in a school of a similar ses? How about a high ses child in a low ses school? How would you qualify your findings so parents don't misinterpret them in making decisions for their children?




Is a minority status child better off in a school with a higher proportion of minority status children or are they better off in a school with a low proportion? How would you qualify your findings so parents don't misinterpret them in making decisions for their children?



According to the first graphs we can see that it is better to send a child with low SES to a school with low proportion of minority status but in the other hand, it is better to send a child with higher SES to a school with high proportion of minority status. It is also the same for a child with low Mathach


This graph is very useful to understand more the proportion of minority status because we can compare the new predictor Mathach versus SES according to their minority and majority gap together, so with this graph parents have to be careful about taking a decision for their children because this graph is different from the others due to the introduce of a new predictor carrying on a new analysis of contextual effects.

Personal tools