Simulation

A good way to test your intuitions about how the stats works is to run simulations.

The basics

You'll need to generate random numbers: rnorm, rbin, rpois, will all come in handy.

Some examples

Between-subjects data: overlapping 95% CIs need not imply no significant difference

Why you can't use 95% CIs for within-subject data

Example showing how adding random intercepts can improve fit

Generating the null distribution of Pearson's r for various sample sizes

Simulate a series of correlational studies each of which has n participants where there is no correlation between x and y, i.e. gives an estimate of the null sampling distribution of r.

studies = 1000
n = 50
r1 = c()

for (i in 1:studies) {
  x = rnorm(n,0,1)
  y = rnorm(n,0,1)
  r1[i] = cor(x,y)
}

n = 200
r2 = c()

for (i in 1:studies) {
  x = rnorm(n,0,1)
  y = rnorm(n,0,1)
  r2[i] = cor(x,y)
}

par(mfrow = c(1,2))
hist(r1, main="n = 50", col = "lightblue",xlim=c(-.7,.7), xlab = "r")
hist(r2, main="n = 200", col = "lightblue",xlim=c(-.7,.7), xlab = "r")
par(mfrow = c(1,1))

Picture below:

corr_null.png

Generating data with different means and variances

# covariate values
x <- runif(100, -5, 5)

# Example 1
# model for variance
sds <- 0.5 + abs(x) * 2

# model for responses
y <- 2 + 3 * x + rnorm(100, sd = sds)

fm <- lm(y ~ x)
plot(fm, 1)
plot(fm, 3)

# Example 2
sds <- 1 + (abs(min(x)) + x) * 2

# model for responses
y <- 2 + 3 * x + rnorm(100, sd = sds)

fm <- lm(y ~ x)
plot(fm, 1)
plot(fm, 3)

Thank you to Dimitris Rizopoulos

Simulating mixture distributions

Here we simulate an experiment with two (between-subject) conditions. In both cases, there's a mixture distribution (intuitively there are two groups: very fast responses and slightly slower, perhaps caused by different systems), however in the second (represented by x2), the upper distribution is shifted upwards.

x1a  = rnorm(2000,100,5)
x1b  = rnorm(5000,110,5)
x1  = c(x1a,x1b)
x2a = rnorm(2000,100,5)
x2b = rnorm(5000,115,5)
x2  = c(x2a,x2b)

par(mfrow =c(1,2))
plot(density(x1),main="", xlab = "RT")
lines(density(x2), lty=2)
plot(density(x1b),main="", xlab = "RT")
lines(density(x2b), lty=2)
par(mfrow = c(1,1))

t.test(x1,x2)
t.test(x1b,x2b)

The t-test on the mixture gives a significant result, but note the means are skewed because of the presence of the lower distro. (Mixture difference: 107 vs 111; pure effect difference: 110 vs. 115)

The density plot below shows both the mixture distributions and the pure effect bit of the distribution.

density.png

Positive correlation isn't transitive

Let U, V, and W be independent random variables. Define X = U+V, Y = V+W, and Z = W-U. Then the correlation between X and Y is positive, Y and Z is positive, but the correlation between X and Z is negative. (See Langford, E.; Schwertman, N. & Owens, M. (2001) [Is the Property of Being Positively Correlated Transitive? The American Statistician, 55, 322-325.])

U = rnorm(100,0,1)
V = rnorm(100,0,1)
W = rnorm(100,0,1)

X = U+V
Y = V+W
Z = W-U

cor.test(X,Y)
cor.test(Y,Z)
cor.test(X,Z)

par(mfrow = c(2,2))

plot(Y~X,xlim=c(-5,5),ylim=c(-5,5))
abline(lm(Y~X),col="red")
plot(Y~Z,xlim=c(-5,5),ylim=c(-5,5))
abline(lm(Y~Z),col="red")
plot(Z~X,xlim=c(-5,5),ylim=c(-5,5))
abline(lm(Z~X),col="red")

par(mfrow = c(1,1))
correlation_transitive.png

Median splits are a bad idea…

> x = rnorm(100,100,15)
> y = 2*x + 5 + rnorm(100,0,35)
>
> summary(lm(y ~ x))

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   22.442     21.503    1.04      0.3
x              1.885      0.212    8.88  3.3e-14 ***

Multiple R-squared: 0.446,      Adjusted R-squared: 0.44
F-statistic: 78.8 on 1 and 98 DF,  p-value: 3.32e-14

> summary(lm(y ~ cut(x,2)))

                    Estimate Std. Error t value Pr(>|t|)
(Intercept)           178.45       6.58   27.12  < 2e-16 ***
cut(x, 2)(94.2,135]    49.26       8.10    6.08  2.3e-08 ***

Multiple R-squared: 0.274,      Adjusted R-squared: 0.267
F-statistic:   37 on 1 and 98 DF,  p-value: 2.30e-08

(Note the R-squares and the t's and the p's.)

Sums of uniformly distributed random variables

See here for the background.

x1 = runif(1000,1,6)
x2 = runif(1000,1,6)
x3 = runif(1000,1,6)
x4 = runif(1000,1,6)
x5 = runif(1000,1,6)
x6 = runif(1000,1,6)
x7 = runif(1000,1,6)
x8 = runif(1000,1,6)
x9 = runif(1000,1,6)

par(mfrow=c(3,3))
hist(x1, main="",xlab="x")
hist(x1+x2, main="",xlab="x")
hist(x1+x2+x3, main="",xlab="x")
hist(x1+x2+x3+x4, main="",xlab="x")
hist(x1+x2+x3+x4+x5, main="",xlab="x")
hist(x1+x2+x3+x4+x5+x6, main="",xlab="x")
hist(x1+x2+x3+x4+x5+x6+x7, main="",xlab="x")
hist(x1+x2+x3+x4+x5+x6+x7+x8, main="",xlab="x")
hist(x1+x2+x3+x4+x5+x6+x7+x8+x9, main="",xlab="x")
par(mfrow=c(1,1))
sum_random_ind_uniform.png
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License