Refresher in inferential statistics

# Refresher in inferential statistics

## Hypothesis testing

• Are men brighter than women?
• Does school raise IQ?
• Are reading and language separate systems?
• Do people use nouns more frequently when they have recently heard them?
• Does counting our blessings make us happier?
• Is homosexuality due to sublimated love for one's mother?

These questions, if they are to be answered must be framed as [wikipedia:hypothesis|hypotheses] - claims about the world.

### The null-hypothesis

In significance-testing, attention is focussed on the null hypothesis, i.e., the case where our speculative hypothesis is wrong and any differences observed occur by chance.

Thus "Are men brighter than women?" can be framed as a null hypothesis: "Men and women have the same intelligence", and, alternatively, as an experimental hypothesis "Men are brighter than women".

As Fisher (1966) pointed out "the null hypothesis must be exact, that is free from vagueness and ambiguity, because it must supply the basis of the 'problem of distribution,' of which the test of significance is the solution."

The null hypotheses is typically the claim that any observed differences are due to chance (typically that no difference exists), and is symbolized H0
The experimental hypotheses is the claim that some difference exists, and is symbolized H1

Again following Fisher, the null hypothesis is not "proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis." (1935, p.19)

Is "Men and women have the same intelligence" a well-framed null hypothesis?

• H0 Adult Men and women have the same mean IQ-test scores.
• H1 Adult Men have higher mean IQ-test scores than adult women.

If we measured all people on the planet, we could answer this as a simple fact. That is nearly always impossible. Instead, we will have to measure a sample of women and a sample of men and try and decide on the basis of that evidence whether we can reject the null hypothesis.

The fundamental concept of inferential statistics is the construction of a system for deciding on the basis of sample of data whether we can reject the null hypothesis or, as Neyman and Pearson put it "deciding whether or not a particular sample may be judged as likely to have been randomly drawn from a certain population".

Alternatives: Baysian, and confidence intervals.
Meehl: Null-hypothesis is quasi-always correct.

## Type I and Type II errors

Type_I_and_type_II_errors
If the world can be in one or another state, and we can think it is in one or other state, 4 relationships between what we think and true state of the world are possible:

 no yes no correct rejection false-alarm yes miss correct detection

Two of these cells represent errors: claiming the world is in some state when it is not (false-alarm), and missing the fact that the world is in some state.

In statistics, a less memorable terminology is used: False alarms are called Type I errors (sometimes, α error, or false positive). A miss is called a Type II error (or sometimes β error, or a false negative).

In terms of hypothesis testing:
Type I (α): reject the null hypothesis when it is true (there is no difference)
Type II (β): fail to reject the null hypothesis when the null hypothesis is false

e.g. if 7 year olds are brighter than 6 year olds, but my test of difference is not significant, I will have missed the state of the world and made a Type II error.

These concepts were explicitly identified in 1928 by Jerzy Neyman and Egon Pearson

## Chi-square χ2

• I count 7 men and 29 women in this class: Are males and females equally likely to enroll in MSc degrees in Psychology?
• If I roll a dice size times, and get a six 3 times, is the dice fair?
• One hundred women under 32 and 100 aged 33 to 38 wishing to achieve pregnancy are recruited at random from GP offices. After 12 months, the proportions shown below are obtained. Is fertility lower from age 33?
 fertile infertile under 33 84 16 33 or older 75 25

To answer these questions, χ2 can be used. More generally, if you have count data which total to a probability of 1 (i.e., there is a fixed number of outcomes, one (and only one) of which must occur on any trial, and you have counts of which outcomes occurred in a sample), you can test the obtained distribution of counts against an expected distribution of counts using Pearson's χ2.

### Degrees of freedom

The degrees of freedom in this test are equal to the number of possible outcomes minus one (i.e., for a dice df = (6-1)= 5)

### Problems

• I count 7 men and 29 women in this class: Are males and females equally likely to enroll in MSc degrees in Psychology?
``````> x <- c(male=7, female=29); chisq.test(x)
# X-squared = 13.4444, df = 1, p-value = 0.0002457

> x <- c(male=27, female=29); chisq.test(x)
# X-squared = 0.0714, df = 1, p-value = 0.7893```
```

nb: Because this test has exactly two outcomes, the underlying distribution of the null-hypothesis is binomial. An exact test in this case is:

``````binom.test(c(7,29), p = .5)
# number of successes = 7, number of trials = 36, p-value = 0.0003126
# alternative hypothesis: true probability of success is not equal to 0.5
# 95 percent confidence interval: 0.08194364 0.36024802
# sample estimates: probability of success  0.194```
```
``````binom.test(c(27,29), p = .5)
# p-value = 0.8939```
```
• If I roll a dice six times, and get a six 3 times, is the dice fair?

### What's wrong with this?

``````> x <- c(six=3, notsix=3); chisq.test(x)
# X-squared = 0, df = 1, p-value = 1```
```
• I got 3 sixes on six rolls: Is the dice fair?

x <- c(notsix=3,six=3);
p <- c(notsix =5/6,six = 1/6);
chisq.test(x,p=p);

1. X-squared = 4.8, df = 1, p-value = 0.02846

### The importance of framing the hypothesis unambiguously

Compare "I got three 6s on six rolls: Is the dice fair?" to "I got 3 sixes (and 1 each of a 3, a 2, and 5) out of six rolls: Is this dice fair?"

• "I got 3 sixes (and 1 each of a 3, a 2, and 5) out of six rolls: Is this dice fair?"

x <- c(0, 1, 1, 0, 1, 3);
p <- rep(1/6,6); # equivalent to c(1/6,1/6,1/6,1/6,1/6,1/6);
chisq.test(x,p=p);

1. X-squared = 6, df = 5, p-value = 0.3062

One is significant (the dice is crooked), one is not (the dice is fair).

The difference lies in the questions: One (ahead of time) asked "Is the number of 6's unusual?" a question with 1 degree of freedom. The other asks "Does each face of this dice appear as often as expected?", which has 5 degrees of freedom, and is a much less powerful test. Let's push the n up, rolling twice as many times, with the same outcome:

``````x <- c(0, 2, 2, 0, 2, 6);
p <- rep(1/6,6);
chisq.test(x,p=p);
# X-squared = 12, df = 5, p-value = 0.03479```
```

What power do I have to detect an unfair dice with 6 rolls?

````  power.prop.test(n = 50, p1 = .50, p2 = .50)`
```

### Correct expectations

``````x <- c(notsix=3,six=3);
p <- c(notsix =5/6,six = 1/6);
chisq.test(x,p=p);
# X-squared = 4.8, df = 1, p-value = 0.02846
# Warning message:
# In chisq.test(x, p = p) : Chi-squared approximation may be incorrect```
```

### Another view

``````x <- c(1,1,1,0,0,3);
p <- c(1/6,1/6,1/6,1/6,1/6,1/6);
chisq.test(x,p=p);
# X-squared = 6, df = 5, p-value = 0.3062```
```
• If I count mothers under and over age 35 in a maternity ward and in a premature-birth ward, and find the following table, does maternal age affect child-birth?
 fertile infertile under 33 84 16 33 or older 75 25
``````n=184; fertility <-
+ matrix(c(n, (n/84)*16,
+           n, (n/75)*25),
+        nrow = 2,
+        byrow = TRUE,
+        dimnames = list(r=c("under33", "over33"), c=c("fertile", "infertile")))
> fertility
c
r         fertile infertile
under33     184  35.04762
over33      184  61.33333
> fisher.test(fertility)

Fisher's Exact Test for Count Data

data:  fertility
p-value = 0.02149
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.072184 2.859440
sample estimates:
odds ratio
1.740769

Warning message:
In fisher.test(fertility) :
'x' has been rounded to integer: Mean relative difference: 0.003952569
>```
```

under32 = c(preg=84,infertile=16);
over32 = c(preg=75,infertile=25);
x = rbind(under32,over32)
chisq.test(x);

## Non-parametric statistics

page revision: 12, last edited: 01 Jul 2009 18:22