Data manipulation

Vectors

To and from data files where missing data is coded

Missing data is marked as NA in R. In some archival data sets it's common to use, e.g., -999 or some other out of range value. Also, most stats packages (SAS, SPSS, etc.) use a period to denote missing data.

Two functions that help:

add.na = Vectorize(function(x,naval) {
  if (x == naval) NA else x
})

remove.na = Vectorize(function(x,naval) {
  if (is.na(x)) naval else x
})

Best illustrated by example…

> x = c(100,300,400,5,7,8,-999,2,3,1)
> x.new = add.na(x,-999)
> x.new
 [1] 100 300 400   5   7   8  NA   2   3   1
> 
> x.old = remove.na(x,-999)
> x.old
 [1]  100  300  400    5    7    8 -999    2    3    1
>

However, I've come upon an easier way; read.table has an option called 'na.string' which enables you to specify a string as indicating that a variable is missing. On the other end of things, write.table has an option called 'na' that lets you change the NAs into whatever you like. As such, my procedure for dealing with this data involved transferring those strings into NAs upon reading in the file and then outputting them as the strings upon writing the new file. For example, for some data I am looking at, I wanted to keep the missing value as -999999 after standardizing some scores.

vesmmpi <- read.table("./old_file.csv",header=F,sep=",",na.string="-999999")
names(vesmmpi)
.
.
.
write.table(vesmmpi,file="./new_file.csv",sep=",",row.names=F,col.names=F,na="-999999")

Dataframes

Switching from wide to long format for repeated measures data (and back again)

Having to transpose data from wide (one row per case) to long format (one row per measure) is very irritating. You can use SPSS, but R can do it too. This snippet should help. It specifies that the columns 2:4 are the repeated measures, and converts these into "score" with separate lines for each timepoint.

# make some data
wide = data.frame(gender = c(rep("Male",20),rep("Female",20))
                 ,time1 = rnorm(40,25,5)
                 ,time2 = rnorm(40,25,5)
                 ,time3 = rnorm(40,25,5))
wide
long <- reshape(wide, idvar="Subject", varying=list(2:4),
        v.names="score", direction="long")
long
wideagain <- reshape(long, timevar="time", idvar="Subject", direction="wide", v.names="score")
wideagain

See help(reshape) for more info.

See William Revelle's ANOVA page for another way to do it.

Merging dataframes

[FIXME: use merge, give some examples.]

Scoring

First, build a data frame for the (made up) correct answers

correct.answers = data.frame(
      item = 1:4,
      correct = c(3,1,4,1)
   )
item correct
1 3
2 1
3 4
4 1

i.e., the correct answer for item 1 is 3, for item 2 it's 1…

Now invent some actual answers for 4 pretend participants

actual.answers = data.frame(
      subject = c(1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4),
      item    = rep(c(1,2,3,4),4),
      answer =  c(3,1,4,1, 3,2,4,1, 3,2,0,1, 1,2,3,4)
   )
subject item answer
1 1 3
1 2 1
1 3 4
1 4 1
2 1 3
2 2 2
2 3 4
2 4 1
3 1 3
3 2 2
3 3 0
3 4 1
4 1 1
4 2 2
4 3 3
4 4 4

Sub 1 has done very well. Sub 4 hasn't done so well.

Merge the two tables, and add an extra columns which is TRUE iff the correct answer was given

scored = merge(actual.answers,correct.answers)
scored$is.correct = scored$correct == scored$answer

Now build a table to get a sumscore for each person

scored.table = table(scored$subject, scored$is.correct)
scored.table

and make into a dataframe again

scored.df = data.frame(
   subject=dimnames(scored.table)[[1]],
   score = scored.table[,2]
)

Removing Outliers

One way of doing things, there must be more!

testme <- data.frame(subject = gl(10,10), cond =gl(2,5,100), data=rnorm(100,25,5),error=rep(0,100))
# make sure there are some outliers
testme$data[c(10,44,79)] <- 100

The basic plan is to remove outliers that are 2.5SDs above/below the per-participant mean. Here, we tag them in an error column first, so we can keep records.

# generic outlier function (MC, 09-09-09)
outliers <- function(x, index=NULL, sds=2.5) {
  if (is.data.frame(x)) {
    as.data.frame(sapply(x, outliers, index, sds))
  } else if (is.matrix(x)) {
    apply(x, 2, outliers, index, sds)
  } else if (is.list(x)) {
    lapply(x, outliers, index, sds)
  } else if (is.vector(x)) {
    if (!is.null(index)) {
      if (!is.list(index)) {
        index <- list(index) # make sure index is a list
      }
      unsplit(outliers(split(x,index),index=NULL,sds=sds),index)
    } else {
      bound <- sds*sd(x,na.rm=T)
      m <- mean(x,na.rm=T)
      (abs(x-m) > bound)
    }
  } else {
    cat("outliers not implemented for class ",class(x),"\n",sep="")
  }
}

# find outliers by subject
testme$error[outliers(testme$data,testme$subject)] <- 'Outlier'

# find 2-SD outliers by subject and condition
# 
# testme$error[outliers(testme$data,list(testme$subject,testme$cond),sds=2)] <- 'Outlier'

testme$data[testme$error == 'Outlier'] <- NA

Later, we can examine the record of what has been removed/altered and why.

table(testme$error)
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License