Comparing IRT and EFA on Raven's like tests

It's well known that you shouldn't use factor analysis on binary variables (see van der Ven and Ellis 2000, p. 47), but good to try these things anyway to understand why. [FIXME: need to try again with polychoric correlation!]

This is a simulation of the hurdle-jumping model of Raven's Progressive Matrices. First we generate 1000 "IQ" scores, with a mean of 100 and sd 15, then rescale so they have a range 0 to 36 (as per the Advanced matrices).

iq = rnorm(1000,100,15)
hist(iq)

# simplified from the one in ggplot
gscale = function (x, to = c(0, 1)) 
{
    from = range(x, na.rm = TRUE)
    (x - from[1])/diff(from) * diff(to) + to[1]
}

rpm = floor(gscale(iq,c(0,36))) # floor just rounds to the smallest integer
hist(rpm)

When I ran this, the resulting rpm vector had a mean of 17.6 (SD = 4.6). Next step is to generate item scores using hurdle-jumping logic, so e.g., someone scoring 7 gets

11111110000000000...

someone scoring 1 gets

1000000000000...

plus a bit of noise: randomly flipping 5% of the item responses from 1 to 0 or from 0 to 1.

randomflip = function(x,p) {
  breakv = sample(c(1,0), length(x), T, prob = c(p,1-p))
  (x + breakv) %% 2 # this is just modular arithmetic
}

m = NULL
for (i in 1:length(rpm)) {
  resps = randomflip(c( rep(1,rpm[i]), rep(0,36 - rpm[i]) ), 0.05)
  m = rbind(m, resps)
}

So now we have a 1000 x 36 matrix of each "person's" item score. How about a scree-plot?

rpm_scree.png

More than one factor is suggested. Let's pull out one anyway.

Factor1
 [1,]
 [2,]
 [3,]
 [4,]
 [5,]
 [6,]
 [7,]
 [8,]
 [9,]
[10,]  0.108
[11,]  0.136
[12,]  0.245
[13,]  0.344
[14,]  0.454
[15,]  0.502
[16,]  0.563
[17,]  0.589
[18,]  0.721
[19,]  0.760
[20,]  0.782
[21,]  0.772
[22,]  0.763
[23,]  0.655
[24,]  0.625
[25,]  0.587
[26,]  0.495
[27,]  0.401
[28,]  0.337
[29,]  0.286
[30,]  0.202
[31,]  0.148
[32,]
[33,]
[34,]
[35,]
[36,]

Two factors?

Factor1 Factor2
 [1,]
 [2,]
 [3,]
 [4,]
 [5,]
 [6,]
 [7,]
 [8,]
 [9,]
[10,]  0.155
[11,]  0.222
[12,]  0.369
[13,]  0.525
[14,]  0.642
[15,]  0.673
[16,]  0.745
[17,]  0.753
[18,]  0.757   0.247
[19,]  0.676   0.367
[20,]  0.605   0.459
[21,]  0.528   0.533
[22,]  0.414   0.667
[23,]  0.255   0.714
[24,]  0.193   0.755
[25,]  0.135   0.771
[26,]  0.113   0.647
[27,]          0.544
[28,]          0.500
[29,]          0.397
[30,]          0.314
[31,]          0.234
[32,]          0.132
[33,]
[34,]
[35,]
[36,]

Three?

Factor1 Factor2 Factor3
 [1,]
 [2,]
 [3,]
 [4,]
 [5,]
 [6,]
 [7,]
 [8,]
 [9,]
[10,]                  0.181
[11,]                  0.259
[12,]                  0.436
[13,]  0.135   0.111   0.609
[14,]  0.242   0.114   0.680
[15,]  0.309   0.108   0.655
[16,]  0.408           0.640
[17,]  0.479           0.571
[18,]  0.662           0.438
[19,]  0.742           0.288
[20,]  0.790           0.174
[21,]  0.789   0.137
[22,]  0.769   0.283
[23,]  0.626   0.409
[24,]  0.565   0.502  -0.106
[25,]  0.482   0.608  -0.106
[26,]  0.332   0.628
[27,]  0.215   0.609
[28,]  0.142   0.609
[29,]          0.517
[30,]          0.410
[31,]          0.356
[32,]          0.228
[33,]          0.185
[34,]
[35,]
[36,]

Funny stuff going on there. We KNOW there's only one latent trait driving the scores, but poor FA is getting confused. By the way, note the plot over here and the authors' interpretation thereof. [FIXME: check again using polychoric corrleation! THOUGH they appear to have used PCA here for the scree. Hmmm!]

Now for the IRT, it's not perfect, I guess because there were too few "people" down the lower and upper ends, but more useful:

irt1 = tpm(m, type = "latent", max.guessing = 1)
irt1
Item     Gussng   Dffclt   Dscrmn
Item 1    0.046  -20.922    0.149
Item 2    0.096  -50.475    0.052
Item 3    0.139  -10.991    0.256
Item 4    0.085  808.913   -0.003
Item 5    0.003  -13.796    0.198
Item 6    0.032  -22.429    0.134
Item 7    0.004  -17.542    0.164
Item 8    0.002  -13.332    0.193
Item 9    0.000  -12.005    0.225
Item 10   0.000   -5.356    0.509
Item 11   0.000   -4.319    0.617
Item 12   0.000   -2.536    0.928
Item 13   0.000   -1.925    1.455
Item 14   0.000   -1.404    1.986
Item 15   0.000   -1.109    1.874
Item 16   0.000   -0.925    2.217
Item 17   0.000   -0.659    1.991
Item 18   0.006   -0.301    3.198
Item 19   0.030   -0.069    3.781
Item 20   0.031    0.123    3.941
Item 21   0.023    0.327    4.151
Item 22   0.025    0.552    5.761
Item 23   0.049    0.783    4.723
Item 24   0.047    0.863    6.351
Item 25   0.036    1.026    7.962
Item 26   0.047    1.253    5.690
Item 27   0.048    1.460    4.424
Item 28   0.043    1.566    5.517
Item 29   0.033    1.689    8.308
Item 30   0.049    2.044  321.707
Item 31   0.050    2.047  251.781
Item 32   0.054    2.052  232.980
Item 33   0.055    2.056  207.787
Item 34   0.042    2.756  128.868
Item 35   0.060    2.897   15.462
Item 36   0.040   29.041    0.503

Looking at this without any a priori notion of what's going on, I'd keep items 10 to 29 and 35? The ICCs are lovely:

irt.png

(It must be a real art getting similar curves across the population for a real test with 36 items!)

Now, what happens if we compare factor scores from the one-factor model with The Truth as imposed by the simulation? Doesn't look too bad actually but (tilt your head anti-clockwise 90 degrees) tending towards a logit curve. Presumably near the centre of the items (and the ability distribution) the factor extracted does a fine job of predicting ability. Near the edges, it gets a bit messier.

f1 = factanal(m, factors=1, scores = "regression") # Thomson's scores, as he was a local!
thompson_score.png

My guess is that Real Data would look a lot messier than this.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License