I was interested in looking for ‘batch effects’ between sets of runs I was doing of an experiment.

Each batch includes control DNA with expression information for 2 genes.  I am using the ratio of these two genes to look for outlier runs.

1. Extracted data into excel spreadsheet with run name and expression information for the two genes only.


run01, 14.6, 27
run02, 15, 25.1
run03, 14.2, 32.9

2. File loaded into R.

methrun <-read.table("c:\\Alex\\R\\methrun.txt",header=T)
[1] "RunName" "ALUctrl" "MLH1ctrl"


3. The analysis I found here:

would normalise the data, so I made a copy

methrun1 <- methrun

and it also required me to remove RunName column

methrun1$RunName <- NULL

I was now able to apply the scale

mydata <- scale(methrun1)

The following applies k-means clustering to work out how much variation can be explained by between 1-15 clusters.

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")


Based on this plot I decided to give three clusters a try.  Four clusters could have been a reasonable choice as well.  For my data, less than three, you’re leaving too much variance unexplained.  More than four, your returns are diminishing.

4. Distribute the batches into clusters:

# K-Means Cluster Analysis
# 3 cluster solution ***dependent on dataset***
fit <- kmeans(mydata, 3)
# get cluster means 
# append cluster assignment
mydata <- data.frame(mydata, fit$cluster)
# plot data
plot(mydata$y ~ mydata$x, pch=fit.cluster)


Interesting! I should definitely be looking into the batches represented by the cross and the circles for inconsistencies.

5. Repeat analysis with more or less clusters:

# clean up after each cluster
mydata$fit.cluster <- NULL
# and reapply analysis of step 4

6. A neat way to summarise your clusters is to apply a Principle Components Analysis.


clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,  labels=2, lines=0)


Also, if I had taken more data on these runs

e.g. time taken to set up, age of reagents/primers/DNA, number of freeze-thaw cycles of reagents/primers/DNA

I could try to explain some of the variation depicted in these plots.