I was interested in looking for ‘batch effects’ between sets of runs I was doing of an experiment.

Each batch includes control DNA with expression information for 2 genes.  I am using the ratio of these two genes to look for outlier runs.

1. Extracted data into excel spreadsheet with run name and expression information for the two genes only.


run01, 14.6, 27
run02, 15, 25.1
run03, 14.2, 32.9

2. File loaded into R.

methrun <-read.table("c:\\Alex\\R\\methrun.txt",header=T)
[1] "RunName" "ALUctrl" "MLH1ctrl"


3. The analysis I found here:

would normalise the data, so I made a copy

methrun1 <- methrun

and it also required me to remove RunName column

methrun1$RunName <- NULL

I was now able to apply the scale

mydata <- scale(methrun1)

The following applies k-means clustering to work out how much variation can be explained by between 1-15 clusters.

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")


Based on this plot I decided to give three clusters a try.  Four clusters could have been a reasonable choice as well.  For my data, less than three, you’re leaving too much variance unexplained.  More than four, your returns are diminishing.

4. Distribute the batches into clusters:

# K-Means Cluster Analysis
# 3 cluster solution ***dependent on dataset***
fit <- kmeans(mydata, 3)
# get cluster means 
# append cluster assignment
mydata <- data.frame(mydata, fit$cluster)
# plot data
plot(mydata$y ~ mydata$x, pch=fit.cluster)


Interesting! I should definitely be looking into the batches represented by the cross and the circles for inconsistencies.

5. Repeat analysis with more or less clusters:

# clean up after each cluster
mydata$fit.cluster <- NULL
# and reapply analysis of step 4

6. A neat way to summarise your clusters is to apply a Principle Components Analysis.


clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,  labels=2, lines=0)


Also, if I had taken more data on these runs

e.g. time taken to set up, age of reagents/primers/DNA, number of freeze-thaw cycles of reagents/primers/DNA

I could try to explain some of the variation depicted in these plots.

Notes on: Data Science for Business (textbook)

Data Science for Business: What you need to know about data mining and data-analytic thinking Paperback

by Foster Provost and Tom Fawcett

Useful resource on basic statistical models (generating, testing, visualising), applied in business contexts.

My notes up to page 112 are as follows:

Examples of classification and regression tasks. 

What is the probability a class will respond to an situation?

e.g. which group is least likely to renew their contract

What is the probability an individual is part of a class?

e.g. user displays x, y, z characteristics = 70% chance male youth

Value estimation (regression)

e.g. how much will a customer use this service?

Similarity matching

e.g. find company profiles that are similar to our most valuable customers

Clustering – (unsupervised) group individuals together based on similarity

Co-occurrence grouping (aka market-basket analysis) – look for ‘interactions’

e.g. what items are commonly purchased together

Profiling – useful for establishing trends and identifying outliers

e.g. what is the typical cell phone usage of this customer segment?

e.g. a credit card is used for odd purchases in an odd location

Link prediction

e.g. you are karen share 10 friends -> would you like to be friends with karen?

e.g. movie prediction algorithms

data reduction – simplify larger data sets into smaller data sets; complex models into simpler models


Who are the most profitable customers?

Which customers are most likely to pay late?

Which customers are most likely to respond to a campaign or advertisement?

Is there a stat significant difference between profitable and average customer groups?

With how much certainty can we characterise individuals into groups?

Will a particular new customer be profitable? How profitable?

More general notes

‘Tree induction’ is the same as model generation in science stats and can be plotted in R using:

plot([response variable] ~ .,data=[data set])

“(For simplification in the following examples) we ignore the need to normalize numeric measurements to a common scale. Attributes such as Age and Income have vastly different ranges and they are usually normalized to a common scale to help with model interpretability, as well as other things (to be discussed later). ”

  • Interesting – how are age and income usually normalized?

Side-note: I got side-tracked  thinking about Principal Components Analysis, which is used when a number of your variables are correlated.  The aim of PCA is to take high dimensional data that is correlated and reduce the number of dimensions i.e. model simplification.

princomp([data set], scores=TRUE, cor=TRUE)

scores – transform data into the transformed space

cor – use the correlation matrix instead of the covariance matrix

Output: Proportion of the variance – useful.

Can visualise this by:

plot(princomp([data set], scores=TRUE, cor=TRUE))

biplot() – how do the directions of the components align? Non-aligning components you probably want to keep.  Highly aligning components probably aren’t adding too much explanatory value.

  • Find out – what ‘transformed’ data looks like after PCA? What form is it in?


Data repositories, tools and learning resources

My IT mentor Brendon Body linked me to this kdnuggets repository of data repositories – a solid place to start.  A brief poke around has lead me to the following which will be at the top of my list as I begin more research.

Compilations of Data Repositories

Data Repositories (of specific interest to me)


Learning Resources


Of note, I think this resource on How To Do a Big Data Project: A Template for Success will be useful to keep my research (and eventually projects) structured, tangible and goal-orientated.

I also hope to compile a list of Big Data thought leaders.

This year I semi-participated in a MOOC on SEOThe Challenge.  I loved the concept of ‘thought leaders’ that they introduced.  Our homework one week was to compile a list of twenty influencers in a field we were interested in.  People shared their lists on the forums, on topics from Education Innovation, to Mummy Bloggers, to Online Marketing.  I think this is a fantastic way to plug in to the thought processes and resources of these leaders, while making sure to have enough to keep ideas and opinions somewhat diversified.  In the past I’ve used Twitter (and specifically Twitter Lists) for this.

An Introduction

This is a blog to collate my thoughts and progress as I try to learn a little more about the sourcing, manipulation, analysis, visualisation and communication of large data sets.

Where I begin
I have studied a general Science degree (genetics and molecular biology) plus honours. I have spent two years since in a (molecular) cancer epidemiology research laboratory.

What I know
Standard relational database management system skills in Excel/Access.
Basic  querying using SQL.  This MOOC on Introduction to Databases has been quite useful.
Basic coding structure and concepts. This MOOC on Systematic Program Design has been quite useful.
Standard scientific statistics applied through R and SPSS.
Basic unix/linux learned through my personal genome project – My23andY.

While this will be an opportunity to improve my writing and communication skills, I don’t plan on making posts too perfect with re-draft after re-draft – I don’t want this blog to become a chore.   So forgive my stream-of-consciousness writing and I’ll do my best to make sense!