MATH 6627 2010-11 Practicum in Statistical Consulting/Students/Bin Sun

Contents

Hi, my name is Bin Sun, a master student in applied statistics which is my second major. I am happy to be a member in this course and be a classmate, teammate and friend of others. My first major is electronic&engineering. I find statistics more attractive and I really enjoy what I am learning and working on. I am familiar with consulting in some sense that I have been working as a statistician in a hospital but I was not trained to be a consultant before. I just try to use my knowledge to solve problems in statistics for people who have no background in it. That is challenging but interesting. The software I have been using is mostly SAS and R.

Sample Exam Questions

Note that about half of the questions should be suitable for an in-class exam and the other half for a take home exam.

Week 1

• Q: How to observe the effect of the smoking on expectancy of life in observational study?
• A: In the observational study, we should match the characteristics of observations in the same of smoking consumption group as homogeneous as possible to eliminate the effects of lurking variables or mediate variables.

Week 2

• Q: Is it appropriate that we delete the variable that is not significant in the model building process?
• A: Sometimes it is not appropriate. The reason is that the costumers might want to see the adjusted effects of some important variables by the insignificant variable. They would like to leave the insignificant variable in the model.

Week 3

• Q: How to consider the interaction between coffee and stress in the example shown in the lecture by plot?
• A: The joint the plane of the 3D plot in the model including coffee and stress is not twisted, which means the effect of coffee is not conditioned on the stress lever, therefore, we do not consider the interaction term in this example.

Week 4

• Q: What is the difference of confident interval of beta in the simple regression and multiple regression?
• A: The difference is that in multiple regression we need to control for the effects of other covariates. The confident interval in the simple regression usually is smaller than the one in multiple regression.

Week 5

• Q: What is main feature of the longitudinal data analysis?
• A: In the longitudinal analysis, we include time as a covariate.

Week 6

• Q: What is the different applications of maximum likelihood(ML) and restricted maximum likelihood(REML)?
• A: ML can be used to test two models with different betas. REML is applied only to the parameters for random effects.

Week 7

• Q: Why is the confidence interval for the means in the multivariate case less than the one at a time confidence interval in the uni variate case for the same variable?
• A: In the multivariate case, we are considering several variables simultaneously, we can see the confidence interval of one variable is the confidence interval conditioning on other variables. Because of this restriction, the confidence interval is less than the one in the uni variate case. Mathematically, the only thing matters to the length of confidence interval here is the critical value which we can see as correction value. Different test will leads to different standard error.

Week 8

• Q: What are the apply functions in R? How many classifications it has? What is the functional of each of them?
• A: The apply functions are used to execute a function repetitively on each row or column of a matrix, or a data frame, or every element of a list.
• apply Apply Functions Over Array Margins
• eapply Apply a Function Over Values in an Environment
• lapply Apply a Function over a List or Vector
• mapply Apply a Function to Multiple List or Vector Arguments
• rapply Recursively Apply a Function to a List
• tapply Apply a Function Over a Ragged Array
• sapply A user-friendly version of lapply by default returning a vector or matrix if appropriate.

To see more, you can visit the site http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/ .

Week 9

• Q:What could we do if we find both of the main effect and the interaction term with the main effect are not significant simultaneously?
• A: It is not appropriate to delete both terms at the same time just based on the p-values. We can try to delete the interaction term first and see the changes to the estimates of the remaining covariates.

Week 10

• Q: How to understand the complex models and interpret them?
• A: There are basically two ways: numerical way and graphical presentation. The last one is preferred. We can visualize estimated response as a function of predictors.

Week 11

• Q: When do we need to standardize a covariate and what is the consequence?
• A: We can standardize variables when they are not measured by a traditional scale and extremely large or small values occur. The reason that we standardize variable could be various, for example, to make sure all variables contribute evenly to a scale when items are added together, or to make it easier to interpret results of a regression or other analysis. The consequence is that the explanation of the effects under their original scales is hard to obtain.

Statistics in the Media

Week 1

• Viral Evasion Gene Reveals New Targets for Eliminating Chronic Infections: Walter and Eliza Hall Institute researchers have discovered how a key viral gene helps viruses evade early detection by the immune system. Their finding is providing new insights into how viruses are able to establish chronic infections, leading scientists to reevaluate their approaches to viral vaccine development.--ScienceDaily (Jan. 11, 2011)
``` http://www.sciencedaily.com/releases/2011/01/110105091142.html
```

The team has been investigating a virus called gamma herpesvirus-68, which establishes chronic infections in mice and provides a model of the workings of the human gamma herpesvirus Epstein-Barr Virus, commonly known to cause infectious mononucleosis, or 'kissing disease'. Their results, which have been published in the Journal of Immunology, show that a viral gene called K3 rapidly disables the antigen-processing machinery normally used by dendritic cells to alert the immune system to infections.

"This gene quickly helps the virus to hide from the immune system by subverting normal antigen presentation to T cells, which have the critical task of destroying virally-infected cells," Dr Belz said. "The virus carries out a top-secret operation. It shuts down the normal mechanisms that allow the immune system to recognise an infection and then boards the antigen-presenting cells which ferry the virus through the blood and tissues, allowing it to spread throughout the body and establish system infection."

• Comment:

This article seems to be just a discovery in medical research. I wandered how they find such one special gene in millions of thousands of genes and related it to the chronic infection, whether such result is only based on the medical knowledge and experiment. As the development of the genetic statistics, this discovery should be proved by statistical genetic models.

Week 2

• Existing home sales fall 14% in December
``` http://www.cbc.ca/consumer/story/2011/01/14/december-home-resales-crea.html
```

Sales of existing homes in Canada fell 14.4 per cent last month from a record-setting December a year ago, the Canadian Real Estate Association reported Friday.About 447,010 homes traded on the Canadian Multiple Listing Service in 2010, down 3.9 per cent from 2009.

• Comment:

Based on the numbers, if it was two weeks ago, I might say that it could be a good time to be a homeowner since the market seemed to be starting to plateau. But now, I would say that we need to combine other indexes other than just these simple numbers like other important covariates, for example, interest rates.

Week 3

• Mortality Rates from Liver Diseases Underestimated, Researchers Say

http://www.sciencedaily.com/releases/2010/11/101101115616.html Statistics from the Centers for Disease Control and Prevention (CDC) rank mortality related to chronic liver disease and cirrhosis as the 12th most common cause of death in adults in the U.S. Using a modified definition that includes diseases such as viral hepatitis, liver cancer and obesity-related fatty liver disease (liver diseases), Mayo Clinic-led researchers have found that liver-related mortality is as high as fourth for some age groups, and eighth overall.--ScienceDaily (Nov. 1, 2010)

• Comment:

Using different definition of event could cause the research to have huge differences. This article used a modified definition of event which is death in this case and found a conclusion that was very strong. So we should be careful in the survival analysis about the event definitions.

Week 4

• New Statistical Model Moves Human Evolution Back Three Million Years

http://www.sciencedaily.com/releases/2010/11/101105124241.htm Evolutionary divergence of humans and chimpanzees likely occurred some 8 million years ago rather than the 5 million year estimate widely accepted by scientists, a new statistical model suggests.--ScienceDaily (Nov. 5, 2010)

• Comment:

As we can treat the incomplete fossil record as missing information and also, treat the waiting time of divergent of genes happening as a stochastic process, memoryless or with memory, maybe we can use the EM algorithm to figure the time of human occurring time. This is just a first thought that I relate this study to statistics. The model could be way more complicate than this. But it is just a matter of time for the scientists to figure it out.

Week 5

• Model terrorism

http://stats.org/stories/2011/model_terrorism_feb7_11.html Can statistics help us predict and pre-empt terrorist attacks?

• Comment:

In the contrast to the well-known models which are normal distributed and bell-shaped distribution, this article introduced models which fit a new distribution power law distribution for in which there are a high number of small-casualty attacks, a few large-scale attacks, and a sharply-descending curve in between. The traditional normal distributed models treat the extreme events as outliers, which is not appropriate in such cases as most of the events are extreme ones. And power law distributed models seems to be better fitting in both prediction and preemption. Although some questions might arise for such models, they broads the views of statisticians and leads other statisticians to figure out the problems they leave.

Week 6

• Air Fair

http://www.statcan.gc.ca/daily-quotidien/110222/dq110222c-eng.htm The article says that the average domestic and international air fare (all types) paid by passengers was \$235.50 in the first quarter, up 1.2% from the first quarter of 2009. This marked the first increase following five consecutive quarterly declines. As we all know that the from 2010, there are more unexpected fees charged, like over 20kg(compare to 40kg before) luggage fee and food fee due to the dramatically increasing gas prices. It appears to me that the number shown in the article was under estimated.

Week 7

The article 'How safe are Canadians abroad?' from CBC news compares the incident rates among countries which Canadians go to in the past 5 years. The statistics is based on figures from the Department of Foreign Affairs and International Trade. Only assaults and deaths reported to the government are included. The figures at the beginning of the news indicates this could be a typical longitudinal analysis, although the article only uses the simple statistics like proportion and average.

Week 8

Are chemicals killing us?

I was attracted by the title. Every one would have such a question as chemicals are everywhere in our daily lives. We are intuitively like to protect ourselves and families. So we are easily misleaded by the media that as the survey told us that the media overplays individual studies relative to the overall body of evidence and gives too much attention to the views of individual scientists relative to those of the broader toxicological community.

Week 9

According to a new study, fathers experiencing depression may be four times more likely to spank and half as likely to read to their children compared to fathers not suffering from depression.

The study consisted of over 1,700 fathers of one-year olds, who reported living with their children all or most of the time. The fathers were assessed for depression and 7 percent had symptoms consistent with major depression at some point within the last year.

Overall, 15 percent of fathers reported spanking their children in the past month. Out of the fathers who were not suffering from depression, 13 percent reported spanking, while among depressed dads, 41 percent had spanked their kids.

Week 10

This is a longitudinal study that researchers followed 891 patients with diabetes treated by 29 doctors for three years. They discovered that the patients of the doctors who scored highest in this type of empathy were 16% more likely to have good control over their blood sugar and 15% more likely to have better cholesterol levels than patients of physicians with the lowest empathy scores.

Week 11

The article argues about the research results and specifies several points that why the research conclusion is questionable.

Questions and Comments on Groupwork and Class Lectures

Week 1

*Q: For the example about smoking and life expectation we discussed in class, since we know the fact that smoking is not good for health, then we take a look into what was wrong in the modeling. In reality, how could we detect such mistakes if we have no knowledge about the fact and the media variable like health spending in our example? It seems to me that modeling makes no sense in such cases.

A: I agree. The saying "Do not put your faith in what statistics say until you have carefully considered what they do not say." (William W. Watt) certainly holds true in this instance. This is why experimental design is such an important aspect of statistics. If your experiment is poorly designed it is easy to arrive at misleading conclusions, which was the case in the smoking example.--Lawarren 17:11, 9 January 2011 (EST)

Week 2

Comment: It is my first time to see the relations among three variables through 3D plot instead of calculating the correlations. And it is my first time to understand the coefficients through ellipse plot instead of, again, formulas. I find it amazing to deal with statistic problems vividly attributed to the all kinds of plots.

Week 3

It is very interesting about the saying 'Correlation is not always causation' which tells us some clues about causality relations.

Week 4

The coefficient relations could be explained by the chain rule in calculus. This is another interesting finding that I learned in this course. I don't know any rational behind it, but it appears to be reasonable.

Week 5

I just know that for the purpose of detecting the interaction term between two variables in the model, we can condition on each level of one variable A and see the parameter estimations for the other variable B. If there is detectable changes among the estimations, which means the effect of B depends on the level of A, we include the interaction, otherwise, we do not consider the interaction between A and B.

Week 6

I have some questions related to the hierarchical modeling, first one is that could the between and within effects of level 2 variable be detected directly from the plots, and second question is that when we see some obvious nonlinear trend in the bivariate plot, why there were no nonlinear mixed effect models in the modeling building process.

Week 7

After learning the hierarchical model, I was confused about the concepts of levels at the beginning, and I even thought that the j*i*n contingent table could be treated as variables with levels. This is a simple mistake, I think, that the beginners might make. When we have a data set, and a bunch of models in our heads, it will be better to sit back and look at the data for a while instead of writing the code immediately for any arbitrary model.

Week 8

About the result of the wald test from R, what is the denDF reported?

Week 9

So far we only learn how to deal with the longitudinal model with 2 levels, how about models more than 2 levels, like in our project, there are 4 levels of variables. How complicated the problem could be? And what method to deal with this?

Week 10

Question about the rational behind the second derivative to see the effect of x on y when we have the model that the response is y and the covariate term x*z*w.

Week 11

During the modeling procedure in our project, I made a mistake that treating a factor as a numerical variable in the model. Sometimes it is appropriate, for example, we can measure the overall effect of one variable. And in genetic association study, we treat the SNP variable as numerical with 3 levels to get the addictive model.