Vectors
To and from data files where missing data is coded
Missing data is marked as NA in R. In some archival data sets it's common to use, e.g., -999 or some other out of range value. Also, most stats packages (SAS, SPSS, etc.) use a period to denote missing data.
Two functions that help:
add.na = Vectorize(function(x,naval) {
if (x == naval) NA else x
})
remove.na = Vectorize(function(x,naval) {
if (is.na(x)) naval else x
})
Best illustrated by example…
> x = c(100,300,400,5,7,8,-999,2,3,1)
> x.new = add.na(x,-999)
> x.new
[1] 100 300 400 5 7 8 NA 2 3 1
>
> x.old = remove.na(x,-999)
> x.old
[1] 100 300 400 5 7 8 -999 2 3 1
>
However, I've come upon an easier way; read.table has an option called 'na.string' which enables you to specify a string as indicating that a variable is missing. On the other end of things, write.table has an option called 'na' that lets you change the NAs into whatever you like. As such, my procedure for dealing with this data involved transferring those strings into NAs upon reading in the file and then outputting them as the strings upon writing the new file. For example, for some data I am looking at, I wanted to keep the missing value as -999999 after standardizing some scores.
vesmmpi <- read.table("./old_file.csv",header=F,sep=",",na.string="-999999")
names(vesmmpi)
.
.
.
write.table(vesmmpi,file="./new_file.csv",sep=",",row.names=F,col.names=F,na="-999999")
Dataframes
Switching from wide to long format for repeated measures data (and back again)
Having to transpose data from wide (one row per case) to long format (one row per measure) is very irritating. You can use SPSS, but R can do it too. This snippet should help. It specifies that the columns 2:4 are the repeated measures, and converts these into "score" with separate lines for each timepoint.
# make some data
wide = data.frame(gender = c(rep("Male",20),rep("Female",20))
,time1 = rnorm(40,25,5)
,time2 = rnorm(40,25,5)
,time3 = rnorm(40,25,5))
wide
long <- reshape(wide, idvar="Subject", varying=list(2:4),
v.names="score", direction="long")
long
wideagain <- reshape(long, timevar="time", idvar="Subject", direction="wide", v.names="score")
wideagain
See help(reshape) for more info.
See William Revelle's ANOVA page for another way to do it.
Merging dataframes
[FIXME: use merge, give some examples.]
Scoring
First, build a data frame for the (made up) correct answers
correct.answers = data.frame(
item = 1:4,
correct = c(3,1,4,1)
)
item | correct |
---|---|
1 | 3 |
2 | 1 |
3 | 4 |
4 | 1 |
i.e., the correct answer for item 1 is 3, for item 2 it's 1…
Now invent some actual answers for 4 pretend participants
actual.answers = data.frame(
subject = c(1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4),
item = rep(c(1,2,3,4),4),
answer = c(3,1,4,1, 3,2,4,1, 3,2,0,1, 1,2,3,4)
)
subject | item | answer |
---|---|---|
1 | 1 | 3 |
1 | 2 | 1 |
1 | 3 | 4 |
1 | 4 | 1 |
2 | 1 | 3 |
2 | 2 | 2 |
2 | 3 | 4 |
2 | 4 | 1 |
3 | 1 | 3 |
3 | 2 | 2 |
3 | 3 | 0 |
3 | 4 | 1 |
4 | 1 | 1 |
4 | 2 | 2 |
4 | 3 | 3 |
4 | 4 | 4 |
Sub 1 has done very well. Sub 4 hasn't done so well.
Merge the two tables, and add an extra columns which is TRUE iff the correct answer was given
scored = merge(actual.answers,correct.answers)
scored$is.correct = scored$correct == scored$answer
Now build a table to get a sumscore for each person
scored.table = table(scored$subject, scored$is.correct)
scored.table
and make into a dataframe again
scored.df = data.frame(
subject=dimnames(scored.table)[[1]],
score = scored.table[,2]
)
Removing Outliers
One way of doing things, there must be more!
testme <- data.frame(subject = gl(10,10), cond =gl(2,5,100), data=rnorm(100,25,5),error=rep(0,100))
# make sure there are some outliers
testme$data[c(10,44,79)] <- 100
The basic plan is to remove outliers that are 2.5SDs above/below the per-participant mean. Here, we tag them in an error column first, so we can keep records.
# generic outlier function (MC, 09-09-09)
outliers <- function(x, index=NULL, sds=2.5) {
if (is.data.frame(x)) {
as.data.frame(sapply(x, outliers, index, sds))
} else if (is.matrix(x)) {
apply(x, 2, outliers, index, sds)
} else if (is.list(x)) {
lapply(x, outliers, index, sds)
} else if (is.vector(x)) {
if (!is.null(index)) {
if (!is.list(index)) {
index <- list(index) # make sure index is a list
}
unsplit(outliers(split(x,index),index=NULL,sds=sds),index)
} else {
bound <- sds*sd(x,na.rm=T)
m <- mean(x,na.rm=T)
(abs(x-m) > bound)
}
} else {
cat("outliers not implemented for class ",class(x),"\n",sep="")
}
}
# find outliers by subject
testme$error[outliers(testme$data,testme$subject)] <- 'Outlier'
# find 2-SD outliers by subject and condition
#
# testme$error[outliers(testme$data,list(testme$subject,testme$cond),sds=2)] <- 'Outlier'
testme$data[testme$error == 'Outlier'] <- NA
Later, we can examine the record of what has been removed/altered and why.
table(testme$error)