MATH 6627 2010-11 Practicum in Statistical Consulting/Students/Jessica Li
I am a Master student of Applied Statistics. I got my Bachelor Degree in Actuarial Science, and I have some background in Statistics and R.
Sample Exam Questions
- Q: In the example of modelling cigarette consumption vs. life expectancy, we compare cigarettes consumption in different countries, in reality other than this factor, there could be different in all sorts of ways between these countries. we know that the ideal solution introducted by Fisher is Experiment. What is Fisher's idea for experiment?
- A: 1) Make sure all the countries are similar except for chance. 2) Take one country of willing subjects and randomly split it into several groups with different amounts of cigratte consumption. 3) Compare the life expectancy for these group, if the groups that have higher life expectancy due to less amount of cigratte consumption.
- Q: From the example of the studing the inheritance of height from father to son, and the graph of the ellipse, we find that the SD line is not the regression line of son's height on father's height, the change of the son's expected height from the mean is a proportion of the change of the father's height from the mean. Why we use ellipses?
- A: By using ellipses, we can easily get the formular for statistical theory and models. In addition,through ellipses the scatterplots and other graphs help us for the data analysis, such as to study the correlation betwwen the variables, how significant is the variable, and calculation of the confidence interval.
- Q: What is the beta space, and why we want to look at our data from the beta space?
- A: In the data space, the axes are variables and the points are observed. For better understanding with hierarchical data, we want to see more natural display for our models in a such space called beta space. In beta space, the axes are coefficients and the points are models represented by their coefficients.
- Q: Think about the strategy with obervational data, what tools can we use if we want to control for all the Zs that we can?
- A: We can use any one of these three below. First is statistical control which use a model that includes Zs, include them in a statistical model and adjust statisitically. Modern thinking: we don't need all the Zs in the model to avoid bias, only the “propensity score”. This can fail with the wrong model. Second is matching, this technique only perform comparisons between observations with similar Zs. Again all that really matters to avoid bias is the propensity score if you have the Zs. Third one is structural methods which build a structural causal model.
- Q: From the example of data set: Health as predicted by Weight and Height, we have three canonical outliers, What are typies of this outliers, and have to distinguish them?
- A:First one is typical values for predictors, Y typical,it has little impact on beta-hat,it increases size of confidence intervals and decreases power. Second is typical values for predictors but Y consistent with other data, has little impact on beta-hat,it shrinks confidence intervals,it creates false sense of power if point not valid.Third is typical values for predictors and Y not consistent with other data, has large impact on beta-hat,and could shrink or expand CIs,makes a mess of everything.
- Q: In the hierarchical data-- High school example, What is the difference between the Robinson's Paradox and Simpson's Paradox?
- A: Robinson's Paradox refers to the fact that Beta-w and Beta-B can have different signs. Simpson's Paradox refers to the fact that Beta-w and Beta-p can have different signs.
- Q: A comparison of the Scheffé method/Confidence Interval and the Bonferroni method/Confidence Interval.
- A:The Bonferroni adjustment applies very generally, but it suffers from the limitation that the number of simultaneous confidence intervals must be specified in advance in order to maintain the validity of the simultaneous coverage. The Scheffe method is an alternative that controls the simultaneous coverage over all contrasts in a subspace. If a large number of contrasts are of interest, or it is not known in advance which ones are of interest, the Scheffe method provides a way to “snoop” through all the possibilities while maintaining control on the coverage of confidence intervals and the false positive rates of tests. when the number of contrasts to be estimated is small, Bonferroni is better than Scheffé. Actually, unless the number of desired contrasts is at least twice the number of factors, Scheffé will always show wider confidence bands than Bonferroni.
- Q: Briefly introduce tha Wald test, and what is the problem of the wald test?
- A:The Wald test is a way of testing the significance of particular explanatory variables in a statistical model. In logistic regression we have a binary outcome variable and one or more explanatory variables. For each explanatory variable in the model there will be an associated parameter. It is one of a number of ways of testing whether the parameters associated with a group of explanatory variables are zero. If for a particular explanatory variable, or group of explanatory variables, the Wald test is significant, then we would conclude that the parameters associated with these variables are not zero, so that the variables should be included in the model. If the Wald test is not significant then these explanatory variables can be omitted from the model. However, the problems with the use of the Wald statistic: for large coefficients, standard error is inflated, lowering the Wald statistic (chi-square) value, and the likelihood-ratio test is more reliable for small sample sizes than the Wald test.
- Q: In the process of fitting a model, is it correct if we drop the variables in the model by only look at the P-value of these variables in the fitted model?
- A: We cannot drop more than one variable at the same time from the model by testing the significance of the variables in the model (looking at P-value), since after we drop one variable from the model, the p-value would be change for other variables. Therefore, we should do Wald test if we would like to drop two or more variables from the model at the same time.
- Q: What is EBLUP, and what does it do?
- A: If we replace the unknown parameters with their estimates, we get the EBLUP (Empirical BLUP). The EBLUP optimally combines the information from the ith cluster with the information from the other clusters. We borrow strength from the other clusters.
- Q:Compare the model fitting for non-linear and linear model, What are the differences?
- A:1.With a linear model we only need to specify the predictors. We don't need to say anything about the parameters because it is understood that there is exactly one parameter for each regressor and each parameter multiplies its regressor. The non-linear model formula for a non-linear model needs to specify both the parameters and the regressor. 2. The algorithm for fitting is iterative and needs starting values which you generally need to supply. 3. In non-linear mixed effects models(with nlme) parameters in the non-linear model are themselves be modeled through linear models potentially based on other predictors. This allows the non-linear model to be simpler since it only needs to capture the essentially non-linear aspects of the model. This formulation is easier to fit numerically.
Statistics in the Media
- Computer Model for Projecting Severity of Flu Season: Researchers have developed a statistical model for projecting how many people will get sick from seasonal influenza based on analyses of flu viruses circulating that season.
The research, conducted by scientists at the National Institutes of Health, appears December 8 in the open-access publication PLoS Currents: Influenza.
From the study through the 1993/1994 and 2008/2009 season, the research has shown that severity of infections with the Influenza A virus is related to its novelty. In addition, 90% of the variation in influenza severity over the periods studied could be explained by the novelty of the virus' hemagglutinin protein. People think that this result can help to improve the ability to accurately predict influenza severity. Therefore,scientists can use appropriate surveillance methods to make more informed decisions in planning for influenza, including the selection of vaccines. However, in reality there are many variations. For example, people would be among different age ranges, living at different areas and under different health conditions.
- High Tech Crime Fighting Tool: Computer Science Analyszes And Predicts Crime
Engineers at the University of Virginia have developed a new program, called web-cat, that allows police to easily access crime data online -- and spot trends that show what types of crimes happen most often, and where. Users can look for crime action by typing in specific dates, choosing types of crimes, choose locations, or find out what weapons are used most. The system then produces graphs, reports, and maps of high crime areas. The crime-fighting tool is also being upgraded to predict locations of future crimes. In the News, people indicated that "We found that a lot of our residential break-ins were occurring Mondays and Wednesdays -- then we were able to pass that on to the patrol officers, this has been shown to help police officers because if they can get an idea of what's going on, they can make better predictions as to where crimes are likely to happen in the future." Obviously, it is more likely to have crimes during night time and at the entertainment places. However, the criminals rarely choose the same place or the same person as a target twice. There are crimes everywhere among the city. It is also possible that the criminals are coming from other cities or countries. The situations are very complicate, and it is very hard to make prediction for the future, such as time, location and type.
- Drinkers Down Under switching from beer to wine
In a report titled "No Longer a Nation of Beer Drinkers," the Australian Bureau of Statistics said that beer consumption has fallen gradually but consistently since the 1960s, while consumption of wines and spirits has increased. From the statistics data, the researches find that at the start of the 1960s, beer made up 76 percent of all pure alcohol consumed in Australia, but in recent years, this has fallen to 44 percent. Over the same time period, the wine consumption has increased threefold, it is increased to 36 percent, while the intake of spirits has nearly doubled to 20 percent. Why the wine and spirits have huge increase in the past forty years? The researchers noted that increased consumption was likely to have been affected by numerous factors such as different age patterns in the population, increasing affluence and the growth of the Australian wine industry. Moreover, changing taxes and the introduction of random breath testing are a few of the factors that could have cut consumption. From time to time people also realize that wine is better for health than the beer does. On the other hand, the rose of the alcohol consumption comes with a cost such that alcohol abuse is costing Australians $36 billion a year.
- Canadians spend most of waking life sedentary
StatsCan collected the data set in a survey of the physical activity patterns of Canadian adults and kids, and divided its findings into two reports: One addressing physical activity in Canadian adults between the ages of 20 and 79, and the other examining young people between 6 and 19-years-old. The results of this study from the Canadian Health Measures Survey (CHMS) released by Statistics Canada, only 15 per cent of adults achieve the minimum amount of daily recommended exercise. In addition, the survey mentions that young people fare even worse, with just 7 per cent of those aged 5 to 17 attaining the minimum level of physical activity each day. The data reveals that adults are sedentary for an average of 9.5 hours each day while children and youth spend 8.6 hours engaged in sedentary activities such as watching television. In addition, the survey found that men have more physical activity than women on average. This situation is also appears among the teenage group. More than 8o per cent of boys and 70 per cent of girls manage to squeeze in 30 minutes of activity three days a week. From my point of view, they should consider that people whose jobs require them to perform heavy lifting and being out and about for most of their work day. This would be more than enough exercise for most people and would improve these numbers dramatically. As same as teenages, the survey should also study if they are working part-time when they study as full time students.
- More private liquor stores, more alcohol deaths?
The research based on the study of 89 local areas of British Columbia, the researchers found that the number of private alcohol retailers rose by 40 percent between 2003 and 2008, while the liters of alcoholic beverages sold at those stores each year shot up 84 percent.The researchers say the findings raise concerns about the public health impact of privatizing alcohol sales. It is a move being considered by a number of Canadian and U.S. jurisdictions that currently restrict liquor sales to government-run stores.By the same time, the number of government-run stores in the province declined slightly. Moreover, the findings are based government data for 89 distinct local "health areas" in British Columbia. Across those communities, the study found, an average of eight out of every 10,000 people died of an alcohol-related cause each year between 2003 and 2008. Alcohol-related deaths included any death with "alcohol" listed on the death certificate -- including causes like alcohol poisoning and drunk driving as well as chronic conditions related to drinking, like liver cirrhosis. Province-wide, the annual alcohol-related death rate actually dipped somewhat in the latter years of the study period, but within local health areas, the researchers found a correlation between the concentration of private liquor stores and alcohol-related deaths.
- Video Games Are Good for Girls
Researchers from Brigham Young University's School of Family Life conducted a study on video games and children between 11 and 16 years old. They found that girls who played video games with a parent enjoyed a number of advantages.A father who still hasn't given up video games now he has some justification to keep on playing, if they have a daughter. The study involved 287 families with an adolescent child. The researchers measure the outcome of several factors, such as: positive behavior, aggression, family connection, mental health,etc. The result is so interesting, for boys, playing with a parent was not a statistically significant factor for any of the outcomes the researchers measured. Yet for girls, playing with a parent accounted for as much as 20 percent of the variation on those measured outcomes.
Ranking tennis players is a novel way to show how complex network analysis can reveal interesting facts hidden in statistical data. Male tennis players who played in at least one Association of Tennis Professionals match between 1968 and 2010 were evaluated through network analysis. The researchers ran an algorithm, similar to the one used by Google to rank Web pages, on digital data from hundreds of thousands of matches. The data was pulled from the Association of Tennis Professionals website. They quantified the importance of players and ranked them by a "tennis prestige" score. This score is determined by a player's competitiveness, the quality of his performance and number of victories. Fans may think of Jimmy Connors as an "old school" tennis player, but according to a new ranking system, one of the reasons Jimmy Connors ranks on top is because he played for more than 20 years and had the opportunity to win a lot of matches against other very good players. The rankings are a snapshot of who is at the top at this time," Radicchi said. "Players who have yet to retire are penalized with respect to those who have ended their careers. Prestige scores strongly correlate with the number of victories, and active players haven't played all the matches of their careers yet. Researching and ranking sports stars gives a glimpse at the power of complex network analysis.
Hi Jessica, I'm not sure why, but the link you have provided for the tennis article isn't working. I searched for the article in google and found it at the following link in the event somebody else is interested in viewing the article. --Lawarren 10:48, 9 March 2011 (EST)
- thank you very much Lawarren.
In year 2011 the survey of USRC Hotel Investment released the prediction of a good growth in the hotel industry. Based on the data in the context of longer-term trends. The data includes parameters such as capitalization rates, discount rates, income and expense growth expectations, marketing time, debt parameters, and other data for both full-service and limited-service hotels. The researches found that full-service discount rates are now at their lowest level since the Mid Year 2007 survey, just prior to rates beginning to creep up in 2008, and of course the subsequent economic and credit collapse of the Fall of 2008. However, they are now just 10 basis points higher than the record low seen in the Winter of 2007 survey, and are hovering very near the survey results of the last recovery years of 2005-2006, following the post 9-11 downturn. As yield requirements lessened, anticipated growth continued to increase, even beyond the very healthy movement we saw in the last Mid Year 2010 Survey.
Risk statistics can be used persuasively to present health interventions in different lights. The different ways of expressing risk can prove confusing and there has been much debate about how to improve the communication of health statistics. Choosing the appropriate way to present risk statistics is key to helping people make well-informed decisions. A new Cochrane Systematic Review found that health professionals and consumers may change their perceptions when the same risks and risk reductions are presented using alternative statistical formats. In the new study, Cochrane researchers reviewed data from 35 studies assessing understanding of risk statistics by health professionals and consumers. They found that participants in the studies understood frequencies better than probabilities. Although the researchers say further studies are required to explore how different risk formats affect behaviour, they believe there are strong logical arguments for not reporting relative values alone.
In a recent poll of global travelers studied the travellers' preferance between rail and air. With 90% of respondents saying they would like to see rail options displayed alongside flights when searching for travel. This poll concluded that time, cost and comfort are the 3 key factors considered by consumers when booking travel and flying is coming in second increasingly often. From the time part: The results reveal that travellers are considering total travel time,getting from door to door,and the full travel experience when choosing mode of transport.There are most of people would accept having the entire time from door-to-door be longer to avoid the process of checking in, security and boarding, they would willingly add an hour or more of total travel to their trips to avoid the hassles of long lines, airport security and baggage fees. Consider the cost of the travles, most of people take action to avoid paying bag fees, strategizing packing days in advance and stuffing carry-ons to maximum capacity to avoid checking bags. Most People are ecpecting for more comfortable waitting area for their travel in the future.
This is a survey that provides a comprehensive look at the prevalence of common mental disorders. It conducted from 2001 to 2004 had 3042 participants. The results include data from children and adolescents ages 8 to 15. In the study, the young people were interviewed directly. Family members also provided information about their children's mental health. The researchers tracked six mental disorders—generalized anxiety disorder (GAD), panic disorder, eating disorders (anorexia and bulimia), depression, attention deficit hyperactivity disorder (ADHD) and conduct disorder. The participants were also asked about what treatment, if any, they were receiving. The result shows 13 percent of respondents met criteria for having at least one of the six mental disorders within the last year. About 1.8 percent of the respondents had more than one disorder, usually a combination of ADHD and conduct disorder. Researchers found that among the paitients, males are more likely than females to have the ADHD, but females more likely than males to feel depressed. I think the reasons might be that usually boys are more active than girls, and girls are more emotional and soft than boys. The propose of the study is to transform the understanding and treatment of mental illnesses through basic and clinical research, paving the way for prevention, recovery and cure.
Questions and Comments
- The Question posted on Bin's blog.
I guess that one could select the sample from different cities in one country instead of select different countries, in order to eliminate the condition differences.
- Before I take this course, I have no idea about what does the consulting do and how to consult problems step by step to help others to make a right decision. From this weeks lecture, I realize that it is not easy to be a good consultant. Usually, the situation is very complicate and the manipulation are eliminated by many restrictions at the same time. During the consulting process, we need to indicate the variables from the problem, then pick the relationship of these variables that we are interested in. Moreover, we have to decide either use the obeserved data or use the experimental data to analyze the problem. There are many things to be considered, such as: how to select the sample to make analysis; if the variables y is caused by x by chance, ect. I think it is a big chanellenge for me and I find it is very interesting as while.
- In class we saw the multiple regression model represented by a plane in data space, which is represented by a point showing the fitted slope with respect to Coffee and the fitted slope with respect to stress in beta space. since we only need 2 dimensions, for some complex models, we may need 3 or more dimensions for the beta space.I just wonder that under these situations, how can we draw ellipses to illustrate the interactions between the variables?
- Comment: If we have 3 or more parameters, then we need 3D or higher dimensional ellipse. This ellipse in higher dimension would be larger because it is trying to be correct for 3 or more parameters at the same time.-Gurpreet
- Thank you very much!
- For the control for all the Zs, it mentions that we can using one of those three techniques. The first one is Statistical control. I am not sure about the meaning of propensity score, although it says "prediction of X from Zs".
Comment: A propensity score is the probability of a particular person being allocated to a specific group in a study given certain covariates. In a truly randomized study, we are usually safe to assume that participants are randomly assigned to groups, so that any particular covariate does not make a certain person more likely to be assigned to a certain group. In observational or "quasi-experimental" designs, we don't have random assignment, so we measure how much selection bias is present by regressing the odds of being in a particular group on the covariates measured in the study. We want group allocation to be "unconfounded" with any covariates (such as individual differences). - Constance Thank you Constance!
- In multiple regression model, it is a challenge to detemine the type of the canonical outlier, and how to deal with these outliers, either to remove them or not. It will give us huge change of our reslut.
- From the slides of Hierarchical Models,it is good to know the idea of the two-stage approach(derived variables approach): estimate slope and intercept, then use them as a multivariate sample and do a MANOVA test.
- How does the Scheffe method connect with the F-test?
- The F-test for the model is significant if and only if some contrast in the model space has a significant Scheffe confidence interval. If the F-test is not significant, then none of the Scheffe confidence intervals will be significant.
- TED Talks on Statistics
- Wernicke:Lies, damned lies and statistics
- Jordan pictures some shocking stats
- Official Statistics and Statistical Ethics : Selected Issues 
- Meaning of Apply Functions in R:
1. lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
2. sapply is a user-friendly version of lapply by default returning a vector or matrix if appropriate. The lapply() function works on any list. The "l" in "lapply" stands for list. The "s" in "sapply" stands for simplify.
3. vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.
4. tapply is a very powerful function that let us break a vector into pieces and apply aome function to each of the pieces. we need to specify how to breakdown the pieces.
- I cannot find the R code that George explained to us during last lecture. Could someone show me where it is, please? Thank you.
- Our group had a meeting with our client, this meeting is very helpful for our analysis. Since we have received several different datasets and some of these datasets are duplicate with less variables, we were not sure which datasets we should use specifically for our analysis questions. In addition, we also have declared our the responses of our analysis.