# MATH 6627 2010-11 Practicum in Statistical Consulting/Students/Crystal Cao

(Difference between revisions)
 Revision as of 14:21, 28 January 2011 (view source)Yrcao (Talk | contribs)← Older edit Revision as of 14:56, 28 January 2011 (view source)Yrcao (Talk | contribs) Newer edit → Line 21: Line 21: ===Week 4=== ===Week 4=== + *Q:  What is the difference between the interaction between variables and the correction between variables? + *A: In statistical regression analyses, an interaction may arise when considering the relationship among three or more variables, and describes a situation in which the simultaneous influence of two variables on a third is not additive. + Correlation measures the strength of the linear relationship between quantitative variables, relationship may not be linear. + In regression analyses, the interaction and correlation between variables are two different concepts, and they have no relationship with each other. The two variables may be interactive but not correlated, interactive and correlated, not interactive and not correlated, not interactive but correlated.

## Contents

My name is Yurong, and I am also known as Crystal in our department.

As a PhD student in Applied Math, my research involved a lot statistical analysis. The project I am currently involved in is: The Climate and Environmental Impact on the Distribution Properties of Mosquito Abundance in Peel Region. The dynamic models are very popular on studying the transmission of vector-borne diseases. In those dynamical models, the parameters were determined by simulations or estimations. The models were successful in predicting the dynamic of vector populations under different circumstance, but could not reflect the dynamics of the vector population with the change of the climate and environmental factors. Now I am using statistic analysis to build up the association between vector abundance and the climate and environmental factors.

By learning more statistic knowledge, I hope I could have deeper understanding on statistic methods and use the combination of mathematical modeling and statistical analysis in my research.

## Sample Exam Questions

### Week 1

• Q: List the possible methods of controlling for the effects of confounding factors while using observational data.
• A: 1) Randomization; 2) Matching;3) Stratification: analyzing each stratum with similar values for the confounding factor(s); 4) Building a statistical model in that includes the confounding factor(s) and using multiple regression. Since the confounding factor may be known but may be measured with error so that it is not fully controlled and some important confounding factors might not be known, there is no perfect solutions and judgment must be used in applying it and in assessing studies based on these methods.

### Week 2

• Q: Considering the coffee consumption problem, how should we choose the explanatory variables in the model?
• A: How to choose the variables not only depends on the significance of the variable. The cost to get the data and the quality of data, etc also need to be considered. How to build the model depends on the question answered. While we need to answer how the variable affects the heart damage, we need to add the variable in the model even it is not significant. While answering whether the variable affects the heart damage, we may drop it if it is not significant.

### Week 3

• Q: Look at the example of the coffee consumption and heart damage in the lecture note, we found the marginal confidence interval is smaller than the conditional one for the predicted coffecient of coffee. How did it happen?
• A: Since the two predictors coffee consumption and stress are related to each other, the marginal confidence interval is not equal to the conditional one. If the two predictors are not related, the two intervals will be equal.

### Week 4

• Q: What is the difference between the interaction between variables and the correction between variables?
• A: In statistical regression analyses, an interaction may arise when considering the relationship among three or more variables, and describes a situation in which the simultaneous influence of two variables on a third is not additive.

Correlation measures the strength of the linear relationship between quantitative variables, relationship may not be linear. In regression analyses, the interaction and correlation between variables are two different concepts, and they have no relationship with each other. The two variables may be interactive but not correlated, interactive and correlated, not interactive and not correlated, not interactive but correlated.

## Statistics in the Media, Paradoxes and Fallacies

### Week 1

The following article “Why a Cloned Cat Isn't Exactly Like the Original: New Statistical Law for Cell Differentiation” was published on Science Daily (Dec. 15, 2010). Source: [1].

New answer to that question has been found by researchers at the Institute of Physical Chemistry of the Polish Academy of Sciences in Warsaw. Using computer simulations and theoretical calculations they discovered a new statistical law. The law they discovered describes how a random disorder inside individual cells transforms into an order leading to a differentiation of population that is of benefit for its survival," sums up Dr Ochab-Marcinek.

The new statistical mechanism will possibly illuminate one of the sources of bacteria's resistance to antibiotics and help explain why monozygotic twins and cloned organisms are not their identical copies. I hope this new statistical law will help medical field to get new progress in the drug trial.

### Week 2

This article stated that Canada experienced a recession that was less severe and shorter than in the other Group of Seven nations. Furthermore, it was nowhere near as “severe” or nearly as long as the downturns the country faced in the early 1980s and 1990s.

In terms of real GDP output, the contraction during the early 1980s recession had the largest peak-to-trough decline, 4.9%, the early 1990s witnessed a four-quarter recession in which GDP dropped 3.4% peak to trough, whereas the most recent downturn saw a peak-to-trough decline of 3.3%.

But in a new paper published Thursday [2], Philip Cross, chief economic analyst at Statistics Canada, said this does not capture how painful an impact the financial crisis had on the Canadian economy. And it shows how difficult it is to say one recession was less painful than another. The 2008-2009 downturn was not as deep overall as the 1981-1982 recession. Nevertheless, Mr. Cross, it set a post-war record for the most months — eight — of uninterrupted decline, a testament to the severity of the onset of the global economic and financial crisis. As for the recession of the 1990s, it was not as severe as the 1980s or most recent downturn in terms of the rate of decline or the total peak-to-trough contraction. “However, it was the most prolonged before recovery clearly was sustained,” he said.

Therefore it is not very meaningful to just compare the numbers and jump to the conclusion.

### Week 3

The latest observation of Dr. Barry E. Brenner, a professor of emergency medicine and internal medicine, as well as program director, in the department of emergency medicine at University Hospital Case Medical Center in Cleveland seems to confirm the findings of previous research that unearthed a complex and as-yet not fully explained relationship between higher than average suicide rates and residency in higher elevations.

“And yet as you go up in altitude the overall death rate, or all-cause mortality, actually decreases," Brenner noted. "So, the fact that suicide rates are increasing at the same time is a really significant and really striking finding."

Even after adjusting for traditional risk factors such as age, race, household income, population density, gender, greater isolation, lower income and greater access to firearms the authors found that suicide rates (whether involving a firearm or not) were significantly higher than average in those counties with higher altitudes. "It may be related to obesity levels and sleep apnea that may be more common in higher altitudes," Brenner suggested.

• Comment: Since the author has already controlled all the traditional and untraditional factors, the suicide rates are still higher in high altitude counties, the results are very reliable. So for those people with pre-disposing factors to suicide, like depression and emotional distress, they should be relocated to lower altitudes place or suicide monitoring and prevention services should be provided.

### Week 4

Five to 40 percent or more of births in the United States are induced early without any good medical reason, according to a new hospital-by-hospital report. Babies delivered early also face a higher risk of death, spending time in a neonatal intensive care unit and life-long health problems, according to a statement from the Leapfrog Group. Why are so many early elective deliveries occurring? According to Maureen Corry, Childbirth Connection's executive director, a recent survey found that the leading reason (accounting for about 25 percent of early births) was caregiver concern that the mother was overdue. About 19 percent were medical inductions, another 19 percent were due to the mother's desire "to get the pregnancy over with," and the final one (17 percent) came from concern about the size of the baby. According to Fleischman, large babies aren't a valid reason for early delivery.

• Comment: According to the report, most of the induced deliveries are done based on the mother or the caregiver’s concern, which is not necessary and are harmful to the babies. So hospitals should be called to put policies in place to prevent early elective deliveries and more education should be done to the mothers to reduce the early elective delivery.

## Questions and Comments on groupwork and class lectures

### Week 1

I am deeply impressed by the R package Google Vis popularized by RHans Rosling. I install the googleVis package from CRAN, and then follow the tutorial and start with the Fruits example of the following lines of R code:

data(Fruits) # format is a data frame with 9 observations on 7 variables

M <- gvisMotionChart(Fruits, idvar="Fruit", timevar="Year")# Not run:

plot(M)# End(Not run)

I am so amused by the output of the package. I will try to explore more functions of this powerful tool.

### Week 2

Statistic consulting is far more complicated than I image. Very unlike mathematical analysis, there is no direct or straight forward answer to the most of the statistic consulting problem. The correct analysis depends on the question we're asking, the data and the true model. Consider the coffee consumption and heart damage example again. The right answer depends on assumptions made which can’t be verified from data, the interpretation of regression coefficients depends on suppositions about the relationships among the variables. The broad knowledge and deep understanding of the problem itself is required and proficient statistic skill is a must have. This week’s lecture is an eyes opener for me.

### Week 3

The three assignment teams all did a wonderful work for their assignment 1 presentation. Andy used a simple but very mathematical way to explain the Simpson’s paradox problem, and it is a brainstorm for me. Our team only studies the GoogleVis package. I learned some knowledge of the other visualizing packages, lattice, rgl and p3d from the other two groups’ presentation.

### Week 4

Prof. Monette talked about the longitudinal data analysis in this week’s lecture. He mentioned that most of the statistical analysis involved longitudinal data, and knowing how to do the longitudinal data analysis is very crucial to the future career. The following couple of Weeks’s lectures will focus on the longitudinal data analysis. I am expecting to that.