MATH 6627 2010-11 Practicum in Statistical Consulting/Students/Crystal Cao

From Wiki1

(Difference between revisions)
Jump to: navigation, search
Line 50: Line 50:
===Week 11===
===Week 11===
*Q:  In the multiple linear regression,  we get the formula <math>\hat{y}=\hat{\beta_0}+\hat{\beta_1}x_1+\hat{\beta_2}x_2+\hat{\beta_3}x_3</math>. Then we could use Fisher’s Z transformation to transform <math>\hat{\beta_1}\cdots \hat{\beta_3}</math> to <math> B_1 \cdots B_3</math>. If <math> B_1 \approx B_2 </math>, but the ANOVA test shows that the P-value for <math>B_1</math> is 0.01, but the P-value for <math>B_2</math> is 0.08. Why?
*Q:  In the multiple linear regression,  we get the formula <math>\hat{y}=\hat{\beta_0}+\hat{\beta_1}x_1+\hat{\beta_2}x_2+\hat{\beta_3}x_3</math>. Then we could use Fisher’s Z transformation to transform <math>\hat{\beta_1}\cdots \hat{\beta_3}</math> to <math> B_1 \cdots B_3</math>. If <math> B_1 \approx B_2 </math>, but the ANOVA test shows that the P-value for <math>B_1</math> is 0.01, but the P-value for <math>B_2</math> is 0.08. Why?
-
*A: <math>B_i=\frac{\hat{\beta_i}Sx_i}{S_y}/math>, SE(\hat{\beta_i}=\frac{Se}{\sqrt{n}Sx_i\right|others}</math>. In ANOVA analysis, the t-score is equal to <math>k\times B_i \times C_i</math>, where <math>k=\frac{\sqrt{n}}{Se}</math> , <math>C_i= Sx_i\right|others</math> and the P-value is a function of t-score. Although <math> B_1 \approx B_2 <\math>, but <math>C_1</math> is not necessarily equal to <math>C_2</math>, and P-value for <math>B_1</math> is not necessarily equal to <math>B_2</math>. Only in case of there are two predict variables,  <math> B_1 \approx B_2 <\math>, means <math> P_1 \approx P_2 <\math>.
+
*A: <math>B_i=\frac{\hat{\beta_i}Sx_i}{S_y}</math>, SE(\hat{\beta_i}=\frac{Se}{\sqrt{n}Sx_i\right|others}</math>. In ANOVA analysis, the t-score is equal to <math>k\times B_i \times C_i</math>, where <math>k=\frac{\sqrt{n}}{Se}</math> , <math>C_i= Sx_i\right|others</math> and the P-value is a function of t-score. Although <math> B_1 \approx B_2 <\math>, but <math>C_1</math> is not necessarily equal to <math>C_2</math>, and P-value for <math>B_1</math> is not necessarily equal to <math>B_2</math>. Only in case of there are two predict variables,  <math> B_1 \approx B_2 <\math>, means <math> P_1 \approx P_2 <\math>.

Revision as of 19:42, 2 April 2011

Contents

About Me

My name is Yurong, and I am also known as Crystal in our department.

As a PhD student in Applied Math, my research involved a lot statistical analysis. The project I am currently involved in is: The Climate and Environmental Impact on the Distribution Properties of Mosquito Abundance in Peel Region. The dynamic models are very popular on studying the transmission of vector-borne diseases. In those dynamical models, the parameters were determined by simulations or estimations. The models were successful in predicting the dynamic of vector populations under different circumstance, but could not reflect the dynamics of the vector population with the change of the climate and environmental factors. Now I am using statistic analysis to build up the association between vector abundance and the climate and environmental factors.

By learning more statistic knowledge, I hope I could have deeper understanding on statistic methods and use the combination of mathematical modeling and statistical analysis in my research.

Sample Exam Questions

Week 1

  • Q: List the possible methods of controlling for the effects of confounding factors while using observational data.
  • A: 1) Randomization; 2) Matching;3) Stratification: analyzing each stratum with similar values for the confounding factor(s); 4) Building a statistical model in that includes the confounding factor(s) and using multiple regression. Since the confounding factor may be known but may be measured with error so that it is not fully controlled and some important confounding factors might not be known, there is no perfect solutions and judgment must be used in applying it and in assessing studies based on these methods.

Week 2

  • Q: Considering the coffee consumption problem, how should we choose the explanatory variables in the model?
  • A: How to choose the variables not only depends on the significance of the variable. The cost to get the data and the quality of data, etc also need to be considered. How to build the model depends on the question answered. While we need to answer how the variable affects the heart damage, we need to add the variable in the model even it is not significant. While answering whether the variable affects the heart damage, we may drop it if it is not significant.

Week 3

  • Q: Look at the example of the coffee consumption and heart damage in the lecture note, we found the marginal confidence interval is smaller than the conditional one for the predicted coffecient of coffee. How did it happen?
  • A: Since the two predictors coffee consumption and stress are related to each other, the marginal confidence interval is not equal to the conditional one. If the two predictors are not related, the two intervals will be equal.

Week 4

  • Q: What is the difference between the interaction between variables and the correction between variables?
  • A: In statistical regression analyses, an interaction may arise when considering the relationship among three or more variables, and describes a situation in which the simultaneous influence of two variables on a third is not additive. Correlation measures the strength of the linear relationship between quantitative variables, relationship may not be linear. In regression analyses, the interaction and correlation between variables are two different concepts, and they have no relationship with each other. The two variables may be interactive but not correlated, interactive and correlated, not interactive and not correlated, not interactive but correlated.

Week 5

  • Q: What is the relationship between interaction and collinearity?
  • A: Interaction is often confused with collinearity. Collinearity refers to associations among predictors: i.e. the extent to which the predictor data ellipse is tilted and eccentric. It is not related to Y Interaction refers to a situation in which the relationship between Y and the predictor variables is not 'additive', i.e. the effect of some variable depends on the levels of the other variable. It has nothing to do with the relationships among the Xs – only in the way they affect Y Interaction or collinearity can exist with or without the other. The presence of either does not even suggest the likely presence of the other.

Week 6

  • Q: What is the difference between a characteristic of the school and a 'derived' variable?
  • A: a derived variable could have a different value with a different sample of students. A characteristic of the school would not.

Week 7

  • Q: Which confidence interval is better, Scheffe or Bonferroni when goes to infinite?
  • A: for models with one or two degrees of freedom for error, Scheffe is always superior to Bonferroni. For three or more degrees of freedom for error, Bonferroni is superior to Scheffe for standard confidence levels, but the reverse is true for nonstandard levels.

Week 8

  • Q: How Lack of fit contributes to autocorrelation?
  • A: Lack of fit will generally contribute positively to autocorrelation. For example, if trajectories are quadratic but you are fitting a linear trajectory, the residuals will be positively autocorrelated. Strong positive autocorrelation can be a symptom of lack of fit. This is an example of poor identification between the FE model and the R model, that is, between the deterministic and the stochastic aspects of the model.

Week 9

  • Q: According to the Principle of marginality, we don not necessary drop items because they are not significant. Then it arise the question that how should decide to drop an item while it is not significant?
  • A: It depends on the questions we should answer. In the case we are dealing with observational data and we try to predict the relation between the dependent variable and the independent variables, we may drop an item if the p-value is small. In the case we try to find the causing facts between Y and Xs, we could not drop an item only according to the p-value since it may be a compounding factor or an intermediate factor.

Week 10

  • Q: Why does the Ordinary least-squares analysis fail to explain on pooled Orthodont data?
  • A: Because the residuals within clusters are not independent; they tend to be highly correlated with each other. Because the residuals within clusters are not independent; they tend to be highly correlated with each other. We could use repeated measures (univariate and multivariate) or two-stage approach.

Week 11

  • Q: In the multiple linear regression, we get the formula \hat{y}=\hat{\beta_0}+\hat{\beta_1}x_1+\hat{\beta_2}x_2+\hat{\beta_3}x_3. Then we could use Fisher’s Z transformation to transform \hat{\beta_1}\cdots \hat{\beta_3} to  B_1 \cdots B_3. If  B_1 \approx B_2 , but the ANOVA test shows that the P-value for B1 is 0.01, but the P-value for B2 is 0.08. Why?
  • A: B_i=\frac{\hat{\beta_i}Sx_i}{S_y}, SE(\hat{\beta_i}=\frac{Se}{\sqrt{n}Sx_i\right|others}</math>. In ANOVA analysis, the t-score is equal to k\times B_i \times C_i, where k=\frac{\sqrt{n}}{Se} , Failed to parse (syntax error): C_i= Sx_i\right|others
and the P-value is a function of t-score. Although Failed to parse (unknown function\math):  B_1 \approx B_2 <\math>, but <math>C_1
is not necessarily equal to C2, and P-value for B1 is not necessarily equal to B2. Only in case of there are two predict variables,  Failed to parse (unknown function\math):  B_1 \approx B_2 <\math>, means <math> P_1 \approx P_2 <\math>.   ==Statistics in the Media, Paradoxes and Fallacies==  ===Week 1=== The following article “Why a Cloned Cat Isn't Exactly Like the Original: New Statistical Law for Cell Differentiation” was published on Science Daily (Dec. 15, 2010). Source: [http://www.sciencedaily.com/releases/2010/12/101215082939.htm].   New answer to that question has been found by researchers at the Institute of Physical Chemistry of the Polish Academy of Sciences in Warsaw. Using computer simulations and theoretical calculations they discovered a new statistical law. The law they discovered describes how a random disorder inside individual cells transforms into an order leading to a differentiation of population that is of benefit for its survival," sums up Dr Ochab-Marcinek.  The new statistical mechanism will possibly illuminate one of the sources of bacteria's resistance to antibiotics and help explain why monozygotic twins and cloned organisms are not their identical copies. I hope this new statistical law will help medical field to get new progress in the drug trial.   ===Week 2=== *[http://www.financialpost.com/Canada+recession+less+severe+than+other+countries/2909946/story.html/ Canada recession less severe than other countries G7 countries]  This article stated that Canada experienced a recession that was less severe and shorter than in the other Group of Seven nations. Furthermore, it was nowhere near as “severe” or nearly as long as the downturns the country faced in the early 1980s and 1990s.  In terms of real GDP output, the contraction during the early 1980s recession had the largest peak-to-trough decline, 4.9%, the early 1990s witnessed a four-quarter recession in which GDP dropped 3.4% peak to trough, whereas the most recent downturn saw a peak-to-trough decline of 3.3%.  But in a new paper published Thursday [http://www.statcan.gc.ca/pub/11-010-x/2011001/part-partie3-eng.htm], Philip Cross, chief economic analyst at Statistics Canada, said this does not capture how painful an impact the financial crisis had on the Canadian economy. And it shows how difficult it is to say one recession was less painful than another. The 2008-2009 downturn was not as deep overall as the 1981-1982 recession. Nevertheless, Mr. Cross, it set a post-war record for the most months — eight — of uninterrupted decline, a testament to the severity of the onset of the global economic and financial crisis. As for the recession of the 1990s, it was not as severe as the 1980s or most recent downturn in terms of the rate of decline or the total peak-to-trough contraction. “However, it was the most prolonged before recovery clearly was sustained,” he said.  Therefore it is not very meaningful to just compare the numbers and jump to the conclusion.  ===Week 3=== *[http://www.nlm.nih.gov/medlineplus/news/fullstory_107861.html/ High Altitude Linked to Higher Suicide Risk -- Again]  The latest observation of Dr. Barry E. Brenner, a professor of emergency medicine and internal medicine, as well as program director, in the department of emergency medicine at University Hospital Case Medical Center in Cleveland seems to confirm the findings of previous research that unearthed a complex and as-yet not fully explained relationship between higher than average suicide rates and residency in higher elevations.  “And yet as you go up in altitude the overall death rate, or all-cause mortality, actually decreases," Brenner noted. "So, the fact that suicide rates are increasing at the same time is a really significant and really striking finding."  Even after adjusting for traditional risk factors such as age, race, household income, population density, gender, greater isolation, lower income and greater access to firearms the authors found that suicide rates (whether involving a firearm or not) were significantly higher than average in those counties with higher altitudes. "It may be related to obesity levels and sleep apnea that may be more common in higher altitudes," Brenner suggested.  *Comment: Since the author has already controlled all the traditional and untraditional factors, the suicide rates are still higher in high altitude counties, the results are very reliable. So for those people with pre-disposing factors to suicide, like depression and emotional distress, they should be relocated to lower altitudes place or suicide monitoring and prevention services should be provided.  ===Week 4=== *[http://www.nlm.nih.gov/medlineplus/news/fullstory_108138.html/ High Rates of Early Elective Delivery at Some U.S. Hospitals: Report] Five to 40 percent or more of births in the United States are induced early without any good medical reason, according to a new hospital-by-hospital report. Babies delivered early also face a higher risk of death, spending time in a neonatal intensive care unit and life-long health problems, according to a statement from the Leapfrog Group. Why are so many early elective deliveries occurring? According to Maureen Corry, Childbirth Connection's executive director, a recent survey found that the leading reason (accounting for about 25 percent of early births) was caregiver concern that the mother was overdue. About 19 percent were medical inductions, another 19 percent were due to the mother's desire "to get the pregnancy over with," and the final one (17 percent) came from concern about the size of the baby. According to Fleischman, large babies aren't a valid reason for early delivery.  *Comment: According to the report, most of the induced deliveries are done based on the mother or the caregiver’s concern, which is not necessary and are harmful to the babies. So hospitals should be called to put policies in place to prevent early elective deliveries and more education should be done to the mothers to reduce the early elective delivery.  ===Week 5=== *[http://www.nlm.nih.gov/medlineplus/news/fullstory_108498.html/ U.S. Sees Slowdown in Spending on Mental Health] A new federal government study has found that the amount of money spent on psychiatric drugs in the United States continues to grow but at a much slower rate than in previous years. From 2004 to 2005, spending on psychiatric drugs rose 5.6 percent, compared with an increase of 27.3 percent between 1999 and 2000, according to the Substance Abuse and Mental Health Services Administration.  *Comment: Behavioral health services are critical to health systems, and they lower costs for individuals, families, businesses and governments.  But too many people don't get needed help for substance abuse or mental health problems. One of the reason maybe that people don’t know they could get help or where to get help since they have less control of the situation or themselves. These people need more attention and care from the society and community. Spending more money on behavioral health services could dramatically decrease the health costs, criminal and juvenile justice costs, educational costs and lost productivity.  ===Week 6=== *[http://www.nlm.nih.gov/medlineplus/news/fullstory_108953.html/ 3.1 Million Hispanic Americans Struggle with Arthritis] Arthritis affected about 3.1 million Hispanics in the United States between 2002 and 2009, and there were wide variations in arthritis rates among Hispanic subgroups, according to a new federal study. The age-adjusted prevalence of arthritis ranged from a low of 11.7 percent among Cuban Americans to a high of 21.8 percent among Puerto Ricans, according to the analysis of National Health Interview Survey data from 2002, 2003, 2006 and 2009.  *Comment: The researchers did suggest that the wide-scale use of culturally adapted, community-level interventions that are proven to increase physical activity and self-management skills likely would lead to meaningful improvements in the quality of life for Hispanic adults with arthritis. But they didn’t give the reason why there were wide variations in arthritis rate among Hispanic subgroups. I believe it may cause by racial difference and life habit. More data and studies need to be done to give further reasons.   ===Week 7=== *[http://www.broadcastingcable.com/article/453033-TVB_Study_Adults_Spend_Twice_as_Much_Time_on_TV_Than_Web.php/ TVB Study: Adults Spend Twice as Much Time on TV than Web] People age 18-plus watched 319 minutes of television a day, according to the Media Comparisons Study 2010 commissioned by the Television Bureau of Advertising (TVB). That figure more than doubles the time spent on the Internet (156.6 minutes), and dwarfs daily time spent engaging with radio (91.2 minutes), newspapers (26.4 minutes) and mobile (19.2).   *Comment: Television reaches more consumers every day than newspapers, magazines, radio, the internet and mobile media, and more time is spent with television.  It is still the most efficient way to reach consumers by investing in ads through television.  ===Week 8=== *See the link below on the Ethics in Statistics and Consulting page [http://scs.math.yorku.ca/index.php/MATH_6627_2010-11_Practicum_in_Statistical_Consulting]  ===Week 9=== *[http://www.nlm.nih.gov/medlineplus/news/fullstory_110003.html/ Costlier prostate cancer treatments gain popularity] Newer technologies for treating prostate cancer have surged in popularity in the last decade -- and they have come with a hefty price tag. Researchers estimate that the increased use of certain prostate cancer treatments, such as less invasive surgery and advanced radiation therapy, tacked on an additional $350 million in healthcare spending in one year alone.  *Comment: The big question about these technologies is if it worth the cost. "There really isn't data on whether this saves more lives," Dr. Paul Nguyen, the lead author of the study, told Reuters Health. It is said that technology is being adopted in hospitals and doctors' offices faster than researchers can track whether the advantages balance the additional costs.  ===Week 10=== *[http://stats.org/stories/2011/media_evolution_mar7_11.html/ Media play down fear of teaching evolution] Two social scientists, Michael Berkman and Eric Plutzer just published the results of a far-reaching survey of U.S. high school biology teachers in Science Magazine. They report that as many as 60 percent of them are inadequately teaching evolutionary biology. In fact, the authors say, this group plays “a far more important role in hindering scientific literacy in the United States than the smaller number of explicit creationists.”  *Comment: It is very shocked to know the fact that the evolutionary biology is not taught adequately.  The students should be taught the variety of the world origins, but each subjects should be taught adequately and let the students develop their own thought freely.  ===Week 11=== *[http://www.nlm.nih.gov/medlineplus/news/fullstory_110482.html/ U.S. Reports Continuing Drop in Birth Rate] The number of births in the United States has declined since reaching an all-time high in 2007, according to a new federal government report. The analysis of birth data found that the decline in the U.S. fertility rate -- based on births among women in their childbearing years, ages 15 to 44 -- between 2007 and 2009 was the largest for any two-year period in more than 30 years. From 2007 to 2009, the rate fell 4 percent, from 69.5 to 66.7 births per 1,000 women. Birth rates declined for all women younger than 40, but the rate fell 9 percent, to 96.3 births per 1,000 women, among those 20 to 24 years old -- the lowest rate ever recorded for this age group, according to the report. The birth rate fell 6 percent among women 25 to 29 years old and 2 percent among women in their 30s. The only increase in birth rates occurred among women in their 40s, rising 6 percent among women 40 to 44 years old. Births in this age group, however, accounted for just 3 percent of all U.S. births in 2009.  *Comment: Some researchers have linked the recent fertility decline to the economic recession. I firmly believe the drop of birth rate highly related to changes in economic structure. Now days, young people tend to receive higher education than before. The higher pressure and competition from work facing from work delayed the age that women give birth to babies, and also the number of babies they have.   ==Questions and Comments on groupwork and class lectures==  ===Week 1=== I am deeply impressed by the R package Google Vis popularized by RHans Rosling. I  install the googleVis package from CRAN, and then follow the tutorial and start with the Fruits example of the following lines of R code:  data(Fruits) # format is a data frame with 9 observations on 7 variables  M <- gvisMotionChart(Fruits, idvar="Fruit", timevar="Year")# Not run:  plot(M)# End(Not run)  I am so amused by the output of the package. I will try to explore more functions of this powerful tool.  ===Week 2===  Statistic consulting is far more complicated than I image. Very unlike mathematical analysis, there is no direct or straight forward answer to the most of the statistic consulting problem. The correct analysis depends on the question we're asking, the data and the true model. Consider the coffee consumption and heart damage example again. The right answer depends on assumptions made which can’t be verified from data, the interpretation of regression coefficients depends on suppositions about the relationships among the variables. The broad knowledge and deep understanding of the problem itself is required and proficient statistic skill is a must have. This week’s lecture is an eyes opener for me.    ===Week 3=== The three assignment teams all did a wonderful work for their assignment 1 presentation. Andy used a simple but very mathematical way to explain the Simpson’s paradox problem, and it is a brainstorm for me. Our team only studies the GoogleVis package. I learned some knowledge of the other visualizing packages, lattice, rgl and p3d from the other two groups’ presentation.  ===Week 4=== Prof. Monette talked about the longitudinal data analysis in this week’s lecture. He mentioned that most of the statistical analysis involved longitudinal data, and knowing how to do the longitudinal data analysis is very crucial to the future career. The following couple of Weeks’s lectures will focus on the longitudinal data analysis. I am expecting to that.  ===Week 5=== In the new topic: Mixed Models with R: Hierarchical Models to Mixed Models, some new concepts are introduced. After reading, I am still confused with hierarchical models and hierarchical model. I don’t know how to tell if a model is a hierarchical model. I hope I would understand more in the next few lectures.  ===Week 6=== *1) Describe the relationship between math achievement and SES. How does it seem to vary between school sectors, between girls and boys?   *hs <- read.csv( "http://www.math.yorku.ca/~georges/Data/hsfull.csv")   source( "http://www.math.yorku.ca/~georges/R/fun.R")# source( "/R/fun.R")   source( "http://www.math.yorku.ca/~georges/R/Plot3d.R") # source( "/R/Plot3d.R")  library( lattice )   library( car )   library( nlme )   school <- hs[,2]   minority<- hs[,3]   sex <- hs[,4]   ses <- hs[,5]   mathach <-hs[,6]   size<-hs[,7]   sector<-hs[,8]   xyplot( mathach ~ ses | sector, hs, groups = sex, auto.key = T)  [[File:F1.jpg]]  The question wants us to address the relationship between mathematics achievements (mathach) with respect to social economics status (SES), in two different aspects, one is the school type, and the other one is the student gender. School type is divided into two major classes, Catholic and Public school systems; the gender is then divided into female and male students. At first glance of the plot above, the panel is separated into two by school systems, and the color denotes the gender. However, since the data is coming from 160 schools, 7185 students, it is extremely hard to draw a clear conclusion of the relationship. Therefore, we will break down future as follows.   xyplot( mathach ~ ses | sector, hs,   panel = function(x, y, ...)   { panel.xyplot( x, y, ...)   panel.lmline( x, y ,...)})  [[File:F2.jpg]]  We now look at the relationship between math scores and SES by school sectors. The general population of students from Catholic schools do much better than students from Public ones. The regression line has a much steeper slope for Catholic than for Public, meaning that a student with a fixed SES value, his or her performance would be better off in Catholic school. For example, a rather poor student with SES = -2, would do just under 5 in Public, whereas the same student with the same status would score almost 10 in a Catholic school. For an averaged student with SES = 0, the student would do about two points better in Catholic school. About one point better in the same school setting for a rich student of SES = 2. Moreover, by taking a second look at the general population of the two plots, the data is mostly concentrated towards the high math achievements in Catholic schools; and Public schools have majority of the data around a rather lower range in math scores.   xyplot( mathach ~ ses|sex,  panel = function( x, y, ...){  panel.xyplot( x, y, ...)   panel.lmline( x, y, ...)})  [[File:F3.jpg]]  Now we examine the relationship of math scores and SES from the gender aspect. Although both perform roughly the same in math, however there is a slight difference for students with lower SES values. For example, a female student would score 5 mathach, and a male student would score around 7 when SES = -2.   xyplot( mathach ~ ses|sex*sector,  panel = function( x, y, ...){  panel.xyplot( x, y, ...)   panel.lmline( x, y, ...) })  [[File:F4.jpg]]  If we were to look at school systems and the gender together, we would get the same quantitative observations as above. Catholic students perform better than public school students; males tend to do better than females; so that female students from Public schools score the worst, and male Catholic student score the best on math.  ===Week 7=== As I understand, in the multilevel model, the within-group regression coefficient expresses the effect of the explanatory variable within a given group; the between-group regression coefficient expresses the effect of the group mean of the explanatory variable on the group mean of the response variable.  ===Week 8=== Comment on [[../Andy_Li|Andy Li's page]]  ===Week 9=== The Principle of Marginality is very useful in statistical consulting. Interaction terms in multiple regression models with continuous predictors are formed by taking the products of predictors. The inclusion of such cross-product terms has the geometric effect of changing a planar surface into a curved surface, often referred to as a response surface. The general rule is that any time an interaction is included in a model, then the individual terms that comprise the interaction should also be included in the model. Although models that violate the principle of marginality are legitimate, they are nearly always silly and hence not worth considering.  ===Week 10=== In lab 3, it is said that the separate effects of age, period and cohort can be estimated under the assumption that there are no interactions, i.e., the separate effects are additive. I could not quite understand this. Since they are “No-linear” effects, there should have interactions exists. How could we assume that there are no interactions?  ===Week 11=== On Wednesday’s class, George said it is O.K. to learn from our own mistakes. But he also mentioned that people work in financial field said they learn from other’s mistake, since it is too costly to make a mistake. My supervisor said you should do it correctly for the first time. I am very confused how could I don’t make mistake. If nobody makes any mistake, how would we learn from the other’s mistake?
Personal tools