I was interested in looking for ‘batch effects’ between sets of runs I was doing of an experiment.

Each batch includes control DNA with expression information for 2 genes.  I am using the ratio of these two genes to look for outlier runs.

1. Extracted data into excel spreadsheet with run name and expression information for the two genes only.


run01, 14.6, 27
run02, 15, 25.1
run03, 14.2, 32.9

2. File loaded into R.

methrun <-read.table("c:\\Alex\\R\\methrun.txt",header=T)
[1] "RunName" "ALUctrl" "MLH1ctrl"


3. The analysis I found here:

would normalise the data, so I made a copy

methrun1 <- methrun

and it also required me to remove RunName column

methrun1$RunName <- NULL

I was now able to apply the scale

mydata <- scale(methrun1)

The following applies k-means clustering to work out how much variation can be explained by between 1-15 clusters.

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")


Based on this plot I decided to give three clusters a try.  Four clusters could have been a reasonable choice as well.  For my data, less than three, you’re leaving too much variance unexplained.  More than four, your returns are diminishing.

4. Distribute the batches into clusters:

# K-Means Cluster Analysis
# 3 cluster solution ***dependent on dataset***
fit <- kmeans(mydata, 3)
# get cluster means 
# append cluster assignment
mydata <- data.frame(mydata, fit$cluster)
# plot data
plot(mydata$y ~ mydata$x, pch=fit.cluster)


Interesting! I should definitely be looking into the batches represented by the cross and the circles for inconsistencies.

5. Repeat analysis with more or less clusters:

# clean up after each cluster
mydata$fit.cluster <- NULL
# and reapply analysis of step 4

6. A neat way to summarise your clusters is to apply a Principle Components Analysis.


clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,  labels=2, lines=0)


Also, if I had taken more data on these runs

e.g. time taken to set up, age of reagents/primers/DNA, number of freeze-thaw cycles of reagents/primers/DNA

I could try to explain some of the variation depicted in these plots.

Notes on: Data Science for Business (textbook)

Data Science for Business: What you need to know about data mining and data-analytic thinking Paperback

by Foster Provost and Tom Fawcett

Useful resource on basic statistical models (generating, testing, visualising), applied in business contexts.

My notes up to page 112 are as follows:

Examples of classification and regression tasks. 

What is the probability a class will respond to an situation?

e.g. which group is least likely to renew their contract

What is the probability an individual is part of a class?

e.g. user displays x, y, z characteristics = 70% chance male youth

Value estimation (regression)

e.g. how much will a customer use this service?

Similarity matching

e.g. find company profiles that are similar to our most valuable customers

Clustering – (unsupervised) group individuals together based on similarity

Co-occurrence grouping (aka market-basket analysis) – look for ‘interactions’

e.g. what items are commonly purchased together

Profiling – useful for establishing trends and identifying outliers

e.g. what is the typical cell phone usage of this customer segment?

e.g. a credit card is used for odd purchases in an odd location

Link prediction

e.g. you are karen share 10 friends -> would you like to be friends with karen?

e.g. movie prediction algorithms

data reduction – simplify larger data sets into smaller data sets; complex models into simpler models


Who are the most profitable customers?

Which customers are most likely to pay late?

Which customers are most likely to respond to a campaign or advertisement?

Is there a stat significant difference between profitable and average customer groups?

With how much certainty can we characterise individuals into groups?

Will a particular new customer be profitable? How profitable?

More general notes

‘Tree induction’ is the same as model generation in science stats and can be plotted in R using:

plot([response variable] ~ .,data=[data set])

“(For simplification in the following examples) we ignore the need to normalize numeric measurements to a common scale. Attributes such as Age and Income have vastly different ranges and they are usually normalized to a common scale to help with model interpretability, as well as other things (to be discussed later). ”

  • Interesting – how are age and income usually normalized?

Side-note: I got side-tracked  thinking about Principal Components Analysis, which is used when a number of your variables are correlated.  The aim of PCA is to take high dimensional data that is correlated and reduce the number of dimensions i.e. model simplification.

princomp([data set], scores=TRUE, cor=TRUE)

scores – transform data into the transformed space

cor – use the correlation matrix instead of the covariance matrix

Output: Proportion of the variance – useful.

Can visualise this by:

plot(princomp([data set], scores=TRUE, cor=TRUE))

biplot() – how do the directions of the components align? Non-aligning components you probably want to keep.  Highly aligning components probably aren’t adding too much explanatory value.

  • Find out – what ‘transformed’ data looks like after PCA? What form is it in?