Refresher in inferential statistics
Table of Contents

Hypothesis testing
 Are men brighter than women?
 Does school raise IQ?
 Are reading and language separate systems?
 Do people use nouns more frequently when they have recently heard them?
 Does counting our blessings make us happier?
 Is homosexuality due to sublimated love for one's mother?
These questions, if they are to be answered must be framed as [wikipedia:hypothesishypotheses]  claims about the world.
The nullhypothesis
In significancetesting, attention is focussed on the null hypothesis, i.e., the case where our speculative hypothesis is wrong and any differences observed occur by chance.
Thus "Are men brighter than women?" can be framed as a null hypothesis: "Men and women have the same intelligence", and, alternatively, as an experimental hypothesis "Men are brighter than women".
As Fisher (1966) pointed out "the null hypothesis must be exact, that is free from vagueness and ambiguity, because it must supply the basis of the 'problem of distribution,' of which the test of significance is the solution."[10]
The null hypotheses is typically the claim that any observed differences are due to chance (typically that no difference exists), and is symbolized H_{0}
The experimental hypotheses is the claim that some difference exists, and is symbolized H_{1}
Again following Fisher, the null hypothesis is not "proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis." (1935, p.19)
Is "Men and women have the same intelligence" a wellframed null hypothesis?
 H_{0} Adult Men and women have the same mean IQtest scores.
 H_{1} Adult Men have higher mean IQtest scores than adult women.
If we measured all people on the planet, we could answer this as a simple fact. That is nearly always impossible. Instead, we will have to measure a sample of women and a sample of men and try and decide on the basis of that evidence whether we can reject the null hypothesis.
The fundamental concept of inferential statistics is the construction of a system for deciding on the basis of sample of data whether we can reject the null hypothesis or, as Neyman and Pearson put it "deciding whether or not a particular sample may be judged as likely to have been randomly drawn from a certain population".
Alternatives: Baysian, and confidence intervals.
Meehl: Nullhypothesis is quasialways correct.
Type I and Type II errors
Type_I_and_type_II_errors
If the world can be in one or another state, and we can think it is in one or other state, 4 relationships between what we think and true state of the world are possible:
no  yes  
no  correct rejection  falsealarm 
yes  miss  correct detection 
Two of these cells represent errors: claiming the world is in some state when it is not (falsealarm), and missing the fact that the world is in some state.
In statistics, a less memorable terminology is used: False alarms are called Type I errors (sometimes, α error, or false positive). A miss is called a Type II error (or sometimes β error, or a false negative).
In terms of hypothesis testing:
Type I (α): reject the null hypothesis when it is true (there is no difference)
Type II (β): fail to reject the null hypothesis when the null hypothesis is false
e.g. if 7 year olds are brighter than 6 year olds, but my test of difference is not significant, I will have missed the state of the world and made a Type II error.
These concepts were explicitly identified in 1928 by Jerzy Neyman and Egon Pearson
Pvalues
Power
Correlation
Chisquare χ^{2}
 I count 7 men and 29 women in this class: Are males and females equally likely to enroll in MSc degrees in Psychology?
 If I roll a dice size times, and get a six 3 times, is the dice fair?
 One hundred women under 32 and 100 aged 33 to 38 wishing to achieve pregnancy are recruited at random from GP offices. After 12 months, the proportions shown below are obtained. Is fertility lower from age 33?
fertile  infertile  
under 33  84  16 
33 or older  75  25 
To answer these questions, χ^{2} can be used. More generally, if you have count data which total to a probability of 1 (i.e., there is a fixed number of outcomes, one (and only one) of which must occur on any trial, and you have counts of which outcomes occurred in a sample), you can test the obtained distribution of counts against an expected distribution of counts using Pearson's χ^{2}.
The Chi squared distribution
Degrees of freedom
The degrees of freedom in this test are equal to the number of possible outcomes minus one (i.e., for a dice df = (61)= 5)
Problems
 I count 7 men and 29 women in this class: Are males and females equally likely to enroll in MSc degrees in Psychology?
> x < c(male=7, female=29); chisq.test(x)
# Xsquared = 13.4444, df = 1, pvalue = 0.0002457
> x < c(male=27, female=29); chisq.test(x)
# Xsquared = 0.0714, df = 1, pvalue = 0.7893
nb: Because this test has exactly two outcomes, the underlying distribution of the nullhypothesis is binomial. An exact test in this case is:
binom.test(c(7,29), p = .5)
# number of successes = 7, number of trials = 36, pvalue = 0.0003126
# alternative hypothesis: true probability of success is not equal to 0.5
# 95 percent confidence interval: 0.08194364 0.36024802
# sample estimates: probability of success 0.194
binom.test(c(27,29), p = .5)
# pvalue = 0.8939
 If I roll a dice six times, and get a six 3 times, is the dice fair?
What's wrong with this?
> x < c(six=3, notsix=3); chisq.test(x)
# Xsquared = 0, df = 1, pvalue = 1
 I got 3 sixes on six rolls: Is the dice fair?
x < c(notsix=3,six=3);
p < c(notsix =5/6,six = 1/6);
chisq.test(x,p=p);
 Xsquared = 4.8, df = 1, pvalue = 0.02846
The importance of framing the hypothesis unambiguously
Compare "I got three 6s on six rolls: Is the dice fair?" to "I got 3 sixes (and 1 each of a 3, a 2, and 5) out of six rolls: Is this dice fair?"
 "I got 3 sixes (and 1 each of a 3, a 2, and 5) out of six rolls: Is this dice fair?"
x < c(0, 1, 1, 0, 1, 3);
p < rep(1/6,6); # equivalent to c(1/6,1/6,1/6,1/6,1/6,1/6);
chisq.test(x,p=p);
 Xsquared = 6, df = 5, pvalue = 0.3062
One is significant (the dice is crooked), one is not (the dice is fair).
The difference lies in the questions: One (ahead of time) asked "Is the number of 6's unusual?" a question with 1 degree of freedom. The other asks "Does each face of this dice appear as often as expected?", which has 5 degrees of freedom, and is a much less powerful test. Let's push the n up, rolling twice as many times, with the same outcome:
x < c(0, 2, 2, 0, 2, 6);
p < rep(1/6,6);
chisq.test(x,p=p);
# Xsquared = 12, df = 5, pvalue = 0.03479
What power do I have to detect an unfair dice with 6 rolls?
power.prop.test(n = 50, p1 = .50, p2 = .50)
Correct expectations
x < c(notsix=3,six=3);
p < c(notsix =5/6,six = 1/6);
chisq.test(x,p=p);
# Xsquared = 4.8, df = 1, pvalue = 0.02846
# Warning message:
# In chisq.test(x, p = p) : Chisquared approximation may be incorrect
Another view
x < c(1,1,1,0,0,3);
p < c(1/6,1/6,1/6,1/6,1/6,1/6);
chisq.test(x,p=p);
# Xsquared = 6, df = 5, pvalue = 0.3062
 If I count mothers under and over age 35 in a maternity ward and in a prematurebirth ward, and find the following table, does maternal age affect childbirth?
fertile  infertile  
under 33  84  16 
33 or older  75  25 
n=184; fertility <
+ matrix(c(n, (n/84)*16,
+ n, (n/75)*25),
+ nrow = 2,
+ byrow = TRUE,
+ dimnames = list(r=c("under33", "over33"), c=c("fertile", "infertile")))
> fertility
c
r fertile infertile
under33 184 35.04762
over33 184 61.33333
> fisher.test(fertility)
Fisher's Exact Test for Count Data
data: fertility
pvalue = 0.02149
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.072184 2.859440
sample estimates:
odds ratio
1.740769
Warning message:
In fisher.test(fertility) :
'x' has been rounded to integer: Mean relative difference: 0.003952569
>
under32 = c(preg=84,infertile=16);
over32 = c(preg=75,infertile=25);
x = rbind(under32,over32)
chisq.test(x);