From Wiki1

Jump to: navigation, search



Cause, correlation, or ...


Recommended sources on statistics:

There are many excellent sources for information on current statistical issues (Psychonomic Society Journals):

  • Confidence Intervals:
    • Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York, NY: Routledge/Taylor & Francis Group. (see
    • Masson, M. E. J., & Loftus, G. R. (2003). Using confidence intervals for graphically based data interpretation. Canadian Journal of Experimental Psychology/Revue Canadienne de Psychologie Expérimentale, 57, 203-220. doi:10.1037/h0087426
  • Effect Size Estimates:
    • Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis and the interpretation of research results. Cambridge University Press. ISBN 978-0-521-14246-5.
    • Fritz, C. O., Morris, P. E., & Richler, J. J. (2011). Effect size estimates: Current use, calculations and interpretation. Journal of Experimental Psychology: General, 141, 2-18.
    • Grissom, R. J., & Kim, J. J. (2012). Effect sizes for research: Univariate and multivariate applications (2nd ed.). New York, NY: Routledge/Taylor & Francis Group.
  • Meta-analysis:
    • Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York, NY US: Routledge/Taylor & Francis Group. (see ).
    • Littell, J. H., Corcoran, J., & Pillai, V. (2008). Systematic reviews and meta-analysis. New York: Oxford University Press.
  • Bayesian Data Analysis:
    • Kruschke, J. K. (2011). Doing Bayesian data analysis: A tutorial with R and BUGS. San Diego, CA: Elsevier Academic Press. (See
    • Kruschke, J. K. (in press). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General. (For a preprint see
  • Power Analysis:


Pythagoras Diagram

Recent changes

/Recent Changes /Contributions


Create new account
SCS R course
/R packages
/SPIDA 20102 preparation

Data scraping

RStudio: Shiny

Notes for 6643

  • Assignment: Can we produce an estimate of AIC based just on the Wald test?

On Pedagogy

Advice for students

Questions (e.g. for survey papers)

  • Implement more diagnostics in R for lme models
  • Explore duality of the whole data matrix
  • Extend the UD representation to hyperbola, etc., and include a way of plotting osculation loci
  • Explore the geometry of harmonic combinations and its implications for mixed model estimates. What happens as you shift weight from G to (X'X) − 1? How does the result wander outside the convex combination? When does it happen and what does it mean?
  • Refine Lform and related tools


Geometry, and Complexity]

R course

Day 2 - add

  • final recap of 'lm' interface: subset, na.action, etc., etc.
  • discuss formula syntax
  • final recap of methods for 'lm'
  • note easy extension to 'glm', 'lme', etc.
  • note that many 'new' functions do not use this interface, only more 'mature' functions
    • lm.formula
  • discuss OO showing methods and dispatching

Day 3

  • the most useful tools:
    • seq
    • rep
    • replacement functions
  • data input
  • more programming
    • object oriented programming
    • using a function in C
  • using attributes
  • systematic treatment of graphics, including
    • par
    • xyplot

/Day2 Guided Tour of Linear Models.R

SCS Reads 2011 Links

Capstone courses

Links to recent courses

Links to add somewhere

  • Battling bad science
  • D W Hosmer, S Taber and S Lemeshow () "The importance of assessing the fit of logistic regression models: a case study." American Journal of Public Health, Vol. 81, Issue 12 1630-1635
  • Quick R for SPSS, SAS and Stata users.



Simpson's Paradox

In the 1979 Canadian federal election an unusual event occurred in the Northwest Territories: the Liberals won the popular vote in the territory, but won neither seat.

Lee Lorch

  • Marybeth Gasman (1999) "Scylla and Charybdis: Navigating the Waters of Academic Freedom at Fisk University During Charles S. Johnson's Administration (1946–1956)" American Educational Research Journal
    A prominent sociologist and race relations activist, Charles S. Johnson dedicated his life to the advancement of Blacks. His presidency at Fisk University, a historically Black college, was the culmination of his career. During the latter part of his administration, he faced a dilemma involving an outspoken professor named Lee Lorch, who, in 1954, was accused of being a communist. Johnson and the Board of Trustees dismissed Lorch because he refused to answer a congressional committee's questions about his previous political affiliations. In 1959, the American Association of University Professors found the late President Johnson guilty of violating the principles of academic freedom. This article explores the ways in which academic freedom, civil liberties, and civil rights clashed in the Lee Lorch case. Furthermore, it examines the ways in which the setting of a historically Black college alters traditional assumptions about the application of these principles.
  • [ Charles V. Bagli (November 21, 2010), "A New Light on a Fight to Integrate Stuyvesant Town", New York Times.

Multilevel Models


Missing Data


Software for multilevel models

Package Function Notes

clmm {ordinal}

Ordinal response: Fits cumulative link mixed models, i.e. cumulative link models with random effects via the Laplace approximation or the standard and the adaptive Gauss-Hermite quadrature approximation. The functionality in clm is also implemented here. Currently only a single random term is allowed in the location-part of the model.
R: {lme4a} Development version of lme4

Download: svn checkout svn://

R: {MCMCglmm} MCMC Methods for Multi-response Generalized Linear Mixed Models
R: {plm} Econometric Analysis of Panel Survey Data Vignette

See p. 3 for comments on first-differencing.

See Snijders and Bosker (2012) for longer list
R: {lme4:nlmm} Mon-linear models with lme4 Presentation by Doug Bates


Check for changes and reconcile


On the age-period-cohort problem:




R notes

Items to cover

  • Wrap up language:
    • Selection (give context): indices: index, names, logical, matrix of coordinates, 'subset'
      • Example: dropping NAs from selected variables. Necessary because functions that are most sophisticated methodologically are generally least sophisticated in their interface
        • contrast sophisticated program: lm with unsophisticated lowess
  • Using variables in data frames:
    • formula oriented functions: xyplot( y ~ x, data = dd )
    • explicit: plot( dd$x, dd$y )
    • with: with( dd, plot(x,y)); with(dd, xyplot( y ~ x, dd)
    • attach: As usual the easiest is deprecated! (why is it only easy and pleasurable things that are ever deprecated)
      plot(x, y)
      • Problem with 'attach':
        • names in data frame may be masked by names in workspace
        • assignments in workspace not saved in data frame
  • Overview of graphics
  • Programming structures
  • Add to graphics:
    Colours: pal(grepv('red',colors())); pals() # for all
    modified tablemissing

debugging in R


Importing files

From Excel

  • Easy: save file in Excel as .csv, then read into R with read.csv
  • If you have a lot of files, or get the files from some other sources that edits .xls or .xlsx files:
  • The winner: package gdata:
    • First install perl.
    • read.xls in gdata handles both .xls and .xlsx files
    • works on both 32-bit and 64-bit machines
  • package XLConnect seems to work only on xlsx files
  • the smaller xlsx package also works only xlsx files
  • Package xlsReadWrite works on xls files but only on 32-bit systems
  • Use xls2csv, a Perl script to convert files to csv first.

Getting lines vs points for different groups in xyplot

Ideally, type = c('l','p') would work but it doesn't seem to. So one way is to use type = 'b' with an invisible line for one group and an invisible point for the other:

 library(spida.beta)  # also loads 'car'
 dd <- Prestige
 dd$income.pred <- predict( lm( income ~ education*type, dd), newdata = dd)
 td( lty = c(1,0), pch = c(32, 16), lwd = 2)  
                            # lty = 0 produces an invisible line
                            # and pch = 32 seems to be an invisible point
 xyplot( income.pred + income ~ education|type, dd[order(dd$education),], type = 'b',
       auto.key = list( columns = 2, lines = T, points = T))

Also show example using panel.superpose.2


grade <- function(x ,
     cos = c(-Inf,40,50,55,60,65,70,75,80,90,Inf) - 0,
     grade = c("F","E","D","D+","C","C+","B","B+","A","A+")) {
     factor(cut(x, cos, grade, right = FALSE), levels = grade)
dg$Grade <- grade( dg$Final )
tab(dg, ~ Grade)
# gets indexing of levels wrong
# the following seems to work correctly
grade  <- function(x ,
     cos = c(-Inf,40,50,55,60,65,70,75,80,90,Inf) - 0,
     grade = c("F","E","D","D+","C","C+","B","B+","A","A+")) {
     ret <- cut(x, cos, grade, right = FALSE)
     factor(ret, levels = grade)

Getting the G matrix in nlme

fit <- lme( y ~ x, dd, random = ~1+x |id)
G <- pdMatrix( fit$modelStruct$reStruct)$id

Building R packages in 2.14

  1. Install R
  2. Install tools:
  1. Info:


Thumbnail test

Here is a graphic file in raw form:


And here is the same file with a thumbnail:


Math check

Please click on the 'discussion' tab above

Test how math renders:
 f(x) & = (a+b)^2 \\
      & = a^2+2ab+b^2 \\

 x \perp y

glmmPQL etc

Good discussion between Doug and Ben:

Combining unbiased estimators

THis is an example:

  • bullet 1
  • bullet 2
    • again
  • bullet3

nubmered bullets:

  1. one
  2. two
    1. dkjkdj
      • djkdj
    • djfkd


new stuff

sub sub

more stuff

Let {{\hat{\phi }}_{1}} and {{\hat{\phi }}_{2}} be unbiased estimators of \phi \in {{\mathbb{R}}^{k}} with non-singular variances V1 and V2 respectively.

Then the minimum variance linear unbiased estimator of φ is obtained by combining {{\hat{\phi }}_{1}} and {{\hat{\phi }}_{2}} using weights that are proportional to the inverses of their variances. The result can be expressed in a variety of ways:

  \hat{\phi } &=  {{\left( V_{1}^{-1}+V_{2}^{-1} \right)}^{-1}}\left( V_{1}^{-1}{{{\hat{\phi }}}_{1}}+V_{2}^{-1}{{{\hat{\phi }}}_{2}} \right) \\ 
 & = {{\left( V_{1}^{-1}+V_{2}^{-1} \right)}^{-1}}\left( V_{1}^{-1}{{{\hat{\phi }}}_{1}}+V_{2}^{-1}{{{\hat{\phi }}}_{2}} \right)+ & \left[ {{\left( V_{1}^{-1}+V_{2}^{-1} \right)}^{-1}}V_{2}^{-1}{{{\hat{\phi }}}_{1}}-{{\left( V_{1}^{-1}+V_{2}^{-1} \right)}^{-1}}V_{2}^{-1}{{{\hat{\phi }}}_{1}} \right] \\ 
 & =  {{{\hat{\phi }}}_{1}}+ {{\left( V_{1}^{-1}+V_{2}^{-1} \right)}^{-1}}V_{2}^{-1}\left( {{{\hat{\phi }}}_{2}}-{{{\hat{\phi }}}_{1}} \right) \\ 
 & =  {{{\hat{\phi }}}_{1}}+ {{\left( I+{{V}_{2}}V_{1}^{-1} \right)}^{-1}}\left( {{{\hat{\phi }}}_{2}}-{{{\hat{\phi }}}_{1}} \right) \\ 
 & =  {{\left( I+{{V}_{1}}V_{2}^{-1} \right)}^{-1}}\left( {{{\hat{\phi }}}_{1}}+{{V}_{1}}V_{2}^{-1}{{{\hat{\phi }}}_{2}} \right)  
\end{align} The proof is an application of the principle of Generalized Least-Squares. The problem can be formulated as a GLS problem by considering that: \left[ \begin{matrix}
   {{{\hat{\phi }}}_{1}}  \\
   {{{\hat{\phi }}}_{2}}  \\
\end{matrix} \right]=\left[ \begin{matrix}
   I  \\
   I  \\
\end{matrix} \right]\phi +\left[ \begin{matrix}
   {{\varepsilon }_{1}}  \\
   {{\varepsilon }_{1}}  \\
\end{matrix} \right] with \operatorname{Var}\left( \left[ \begin{matrix}
   {{\varepsilon }_{1}}  \\
   {{\varepsilon }_{1}}  \\
\end{matrix} \right] \right)=\left[ \begin{matrix}
   {{V}_{1}} & 0  \\
   0 & {{V}_{2}}  \\
\end{matrix} \right]

Applying the GLS formula yields: \begin{align}
\hat{\phi } & ={{\left( {{\left[ \begin{matrix}
   I  \\
   I  \\
\end{matrix} \right]}^{\prime }}{{\left[ \begin{matrix}
   {{V}_{1}} & 0  \\
   0 & {{V}_{2}}  \\
\end{matrix} \right]}^{-1}}\left[ \begin{matrix}
   I  \\
   I  \\
\end{matrix} \right] \right)}^{-1}}{{\left[ \begin{matrix}
   I  \\
   I  \\
\end{matrix} \right]}^{\prime }}{{\left[ \begin{matrix}
   {{V}_{1}} & 0  \\
   0 & {{V}_{2}}  \\
\end{matrix} \right]}^{-1}}\left[ \begin{matrix}
   {{{\hat{\phi }}}_{1}}  \\
   {{{\hat{\phi }}}_{2}}  \\
\end{matrix} \right] \\ 
  & ={{\left( V_{1}^{-1}+V_{2}^{-1} \right)}^{-1}}\left( V_{1}^{-1}{{{\hat{\phi }}}_{1}}+V_{2}^{-1}{{{\hat{\phi }}}_{2}} \right)  

From Nassif Ghoussoub

Beware the “useful idiocy” of Mr. Morgan

The latest commentary of Gwyn Morgan in the Globe and Mail, “If universities were in business, they’d be out of business”, has crossed another line. Far from being an analysis of the state of Canadian universities, his rant is personal, bitter, demeaning, and insulting to university professors across the country.

Back in his Globe article of April 29, 2009, “Not all research deserves public funding“, the retired CEO of EnCana Corp. proceeded to rip into the “ivory towers of academia”, attack “esoteric research” and disparage any graduate degree not hailing from medicine or engineering. He also dismissed the 2300 scientists who joined the “Don’t Leave Canada Behind” campaign, which called on the government to include R&D, the lifeblood of the new economy, in its stimulus budget. To its credit, the government of Canada responded positively to the call of its scientists, but from his -dubiously earned- platform at the Globe and Mail, Mr. Morgan kept at it.

In Saturday’s paper, Mr. Morgan employs sweeping generalizations and ghost statistics to come to the conclusion that, among other things, Canada’s university professors are “poorly prepared” for their lectures, “show up occasionally” to class, and give “poorly thought out assignments”. He claims “the reaction of universities to widespread student dissatisfaction is to blame insufficient financing, rather than their own dysfunction”. He offers that in the new age, formal lectures should be altogether ended.

His commentary provides neither data about student learning, nor any direct quotations from professors or students. A 1991 study is cited, and then baptized as the truth with a simple "Nineteen years later, little has changed." The article does not attack a particular university, faculty, or teaching method, but rather an apparently archetypal "university professor".

So what if he hasn't been on campus in 40 years? He knows how it is. Even then, he "stopped going to classes and dedicated his time to learning from textbooks and reviewing friends’ notes". But Mr. Morgan ignores that a professor somewhere, sometime, must have produced and dictated these textbooks and notes. He finds "no reason why all written course material can’t be delivered via the Internet", obviously not aware that since the 90’s, most course material has been made available on the Internet, thanks to dedicated professors. Morgan's suggestion that we replace large classes with "small informal discussions" sounds great, but how does the CEO propose we pay for the much larger number of professors required to do the job? He wants universities to run like businesses, but as one reader suggested: “If Universities were run like the oil and gas industry we would be back in the dark ages where the only skill required would be to count your money... at least until the oil runs out”.

It is obvious that we embattled post-secondary teachers and researchers need to worry more about the “very useful idiocy” of Mr. Morgan, the permanent platform he has been provided, and the damage that drivel like this can cause to higher education and advanced research in Canada.

For the Globe and Mail, Mr. Morgan has been pure comic gold for years. Writing on a variety of subjects, ranging from environmental issues and health care, to research and post-secondary education, he has been a bottomless trove of shameless misrepresentations, extreme views and sheer wackiness. But ultimately, this is not only about Gwyn Morgan nor about the Globe and Mail. It is about us.

It is about Canada’s University Presidents countering his dangerous Tea Party style rhetoric on our post-secondary institutions.

It is about the Deans of Canada’s Faculties facing up to Mr. Morgan when he writes: “Many qualified applicants are turned away from areas such as engineering and medicine, while universities continue to graduate thousands with knowledge that is neither useful in getting a job, nor in helping our country succeed in a competitive world.”

It is about the Royal Society of Canada, and other learned societies responding to his views about “esoteric research that doesn’t have the slightest chance of yielding any real value”.

It is also up to our schools of journalism, to point out to mainstream media the irresponsibility in printing shallow, empty articles full of generalizations and devoid of facts.

Mr. Morgan may be one of those individuals who get so many things wrong at once that the thought of challenging them or setting the record straight is just too daunting. But it is incumbent upon us not to let his rhetoric negate the exemplary contributions of thousands of Canada’s scholars, teachers and researchers.

Nassif Ghoussoub, Professor of Mathematics, The University of British Columbia

Cell Phones

Date: Sun, 10 Oct 2010 20:02:29 -0400 From: Stuart Newman <newman@NYMC.EDU> Reply-To: Science for the People Discussion List


To: SCIENCE-FOR-THE-PEOPLE@LIST.UVM.EDU Subject: "Disconnect": Why cellphones may be killing us

Though I haven't yet read it, this book is presumably not based on anecdotal evidence. The author, Devra Davis, is the founding director of the toxicology and environmental studies board at the U.S. National Academy of Sciences. []

"Disconnect": Why cellphones may be killing us A new book probes the connection between mobile devices and a host of health problems -- with frightening results By Thomas Rogers


Notes on mediation

The question of mediation is essentially a question about causality. Is the putative mediator, M say, caused by X and, in turn, a cause of Y? But M, in a mediational analysis, cannot have been randomized even if X has been. The question of mediation is essentially a question about causality with observational, not experimental, data.

To get a perspective on the problem we need to start by considering the general problem of causality with observational data. Let Y be the response variable and let X be the 'target' variable which is seen as a possible 'cause' of Y. For X to cause Y means that the expected value of Y would change in some target experimental condition in which X was manipulated (perhaps through random allocation) while other variables were left untouched -- not necessarily unchanged.

For causal inference with observational data, we are interested in what would happen under circumstances that are different from those we have actually observed. Our analysis of our observational data will yield an accurate estimate of the causal effect of X if the model for the observational data has the same coefficient for X as it would have if it were applied to data gathered under the target experimental condition. The challenge is to specify and estimate a model that is 'transferable' from the observational condition to the experimental condition. We need a set of concepts to help us critically assess whether a model is transferable. It is not sufficient to have a model that 'fits' well. It may be necessary to include potential confounding factors even if they are not significant in the prediction model for Y. And it may be necessary to exclude strong predictors that are potential mediators -- variables that must not be held constant as one examines the causal relationship between X and Y. One needs a good understanding of the causal model that is valid under experimental conditions in order to properly specify a transferable observational model.

The problem can be approached in a surprisingly different way, which is the basis for propensity scores. Instead of focusing on a 'transferable' model for Y, one focuses on a model for the assignment of the target causal variable X using potential confounding variables. As in models for Y, it is important to avoid potential mediators between X and Y. However, the model for X based on confounding factors is a prediction model. Confounding variables may be included, raw or transformed, as long as they are predictive of X. It is not necessary to include variables that are not predictive of X. The criterion for developing the model is statistical fit, a criterion that -- apart from the actual selection of confounding predictors -- is empirical, i.e. it is based on the analysis of the data at hand without reference to external theory that is not verifiable with the data. The assignment model need only be valid for the observational condition. Its validity for the experimental condition is irrelevant.

What are some of the pros and cons of the two approaches? A good transferable model for Y may provide more precise estimates of the effect of X because more of the variability in Y is accounted for in the model. On the other hand, the validity of a causal estimate based on the propensity score approach depends on assumptions that may be much easier to sustain than those required for the approach based on modeling Y. Broadly, the propensity score approach offers lower bias but not necessarily lower variability. Note that the two approaches are not mutually exclusive. They may be better viewed as two sets of concepts that could be combined in an analysis that draws from both.

How does this all relate to the analysis of mediation? The Baron and Kenny approach and its variants -- in which I include the various ways of estimating direct and indirect causal effects -- are all based on methods analogous to models for Y. As mentioned earlier, estimating the causal effect of the mediator involves causal inference with observational data -- even in the context of an experiment randomizing X. This invites the question whether propensity score methods could be used in assessing more accurately the causal effect of M. The answer lies in the relatively recent theory of principal stratification [Constantine Frangakis and Donald Rubin (2002) "Principal Stratification in Causal Inference", Biometrics, 58, 21--29].

An accessible reference for the concepts behind propensity scores is Donald Rubin (1997) "Estimating Causal Effects from Large Data Sets Using Propensity Scores," Annals of Internal Medicine, 127, 757--763.

A recent treatment of mediation using principal stratification is given in Chapter 8, "Intermediate Causal Factors," of Herbert Weisberg (2010) Bias and Causation: Models and Judgment for Valid Comparisons, Wiley.

With the large number of seemingly competing approaches to causal inference, students as well as experienced researchers may feel quite puzzled as to which approach they should use. The answer, possibly, is all of them. Each approach seems to shed light on some aspect of the challenge of causal inference in the absence of pristine randomization. They do not offer recipes so much as sets of concepts that can be applied to help understand research projects and analyses.

Shock of the New

Robert Hughes (1980)


Notes for NATS 1500

  • Topics
    • single-sex schools?

Notes for MATH 6627

  • Collection of misleading graphs
  • ASA consulting page
  • set up student home page
  • first assignment. Find and explore a dataset using
    • Ernest Kwan's correlagram
    • Lattice (use panels and groups)
    • p3d
    • gapminder
    • should have included candisc
  • present a 15-minute(crucial) presentation on the data set and on the method
  • prepare a wiki page with links and materials
  • Address a few questions:
    • What are strengths and weaknesses
    • For what kind of dataset is it well suited and what kind not?
    • Can you find a dataset that illustrates well the features of this approach?
    • Can you compare your approach with other approaches?


  • develop checklists:
  • initial exploration of data
  • missing data (explicit and implicit)
  • do simulation of parallel methods: check estimation of variance parameters
  • use nlme to estimate knot placement in gsp


Notes for R course

  • Start: It had to be U ... on the SVD [1]
  • Use SPSS dates both ways to illustrate
    • sub using regular expressions
    • import: reading dates into 'Date' format using formats: Include all %a %b %Y and others?
    • export: writing a date into a character string using format( Date.object, "%d-%m-%Y") to create variable SPSS can read
  • Variable references
    deal with plethora of ways used differently in different places:
    • formula ( ~id ), good for variables in different roles ( y ~ log(x) + x2 | id)
    • interpreted in data: (id), good for single var but can use list : (list(x1,x2))
    • fully reference: dd$id
  • Beware:
  • aggregate with a formula drops rows with NAs even though the FUN might be able to handle them
  • multiple barplot:


  • Discussion of memory issues: what happens when you work on two computers


Notes for High School Talks

Climate change

Excel techniques

  • Regular expressions and string substitution
  • [2]

Ellipse Seminar

/Ellipse Seminar

Setting up mathstat email in Thunderbird

IMAP mailserver:  Port: 143  Security: STARTTLS

Outgoing:  Port: 587  Security: none?

Statistical amusement

On Careers in Statistics and Mathematics

On Teaching Science

A few videos


Personal tools